Search Results

WebNews

Please enter a web search for web results.

NewsWeb

v LLM Docs
docs. vllm. ai > en > latest > api > vllm > model_executor > layers > quantization > compressed_tensors > schemes > compressed_tensors_w8a8_mxfp8

compressed_tensors_w8a8_mxfp8

2+ week, 3+ day ago (50+ words) v LLM Docs Compressed tensors scheme for MXFP8 quantization (W8 A8). Loads pre-quantized MXFP8 weights from compressed-tensors checkpoints. Activations are dynamically quantized to MXFP8 at runtime. MXFP8 format: - 8-bit float weights (E4 M3) stored as float8_e4m3fn - Per-group E8 M0 scales (uint8) with group_size=32 - Activations dynamically quantized to MXFP8 during inference...

v LLM Docs
docs. vllm. ai > en > latest > api > vllm > model_executor > layers > quantization > compressed_tensors > compressed_tensors_moe > compressed_tensors_moe_w8a8_mxfp8

compressed_tensors_moe_w8a8_mxfp8

2+ week, 3+ day ago (42+ words) v LLM Docs Compressed-tensors Mo E method for pre-quantized MXFP8 (W8 A8) checkpoints. Loads FP8 (E4 M3) weights with E8 M0 uint8 per-group scales (group_size=32) from checkpoint. Activations are dynamically quantized to MXFP8 at runtime. Supports Flash Infer TRT-LLM and Marlin backends (auto-selected)....

v LLM Docs
docs. vllm. ai > en > stable > api > vllm > model_executor > models > gemma4_utils

gemma4_utils

2+ week, 3+ day ago (340+ words) Gemma4 output parsing utilities for offline inference. Standalone functions that parse decoded model text to extract structured thinking content and tool calls from Gemma4 models. These are pure-Python utilities with zero heavy dependencies " they work on raw decoded strings from any inference…...

v LLM Docs
docs. vllm. ai > en > latest > api > vllm > model_executor > kernels > linear > mxfp8 > Mxfp8 Linear Kernel

Mxfp8 Linear Kernel

2+ week, 4+ day ago (46+ words) v LLM Docs Base class for MXFP8 quantized linear kernels. Each subclass implements a specific GEMM backend (Flash Infer CUTLASS, Marlin, emulation). Configuration for an MXFP8 linear layer. All MXFP8 layers share the same structure: FP8-E4 M3 weights with uint8 (E8 M0) per-block scales at block size 32....

v LLM Docs
docs. vllm. ai > en > latest > api > vllm > model_executor > kernels > linear > mxfp8 > emulation

emulation

2+ week, 4+ day ago (10+ words) v LLM Docs Software emulation fallback for MXFP8 (dequant to BF16)....

v LLM Recipes
docs. vllm. ai > projects > recipes > en > latest > Tencent-Hunyuan > Hunyuan-Instruct. html

Hunyuan-A13 B Instruct Usage Guide

2+ week, 3+ day ago (62+ words) v LLM Docs Hunyuan-A13 B Instruct Usage Guide" This guide provides instructions to install and run Hunyuan-A13 B-Instruct on AMD GPUs. Note: The v LLM wheel for ROCm requires Python 3. 12 and glibc >= 2. 35. If your environment does not meet these requirements,…...

v LLM Docs
docs. vllm. ai > en > latest > api > vllm > model_executor > kernels > linear > mxfp8

mxfp8 - v LLM

2+ week, 4+ day ago (28+ words) mxfp8v LLM Docs Configuration for an MXFP8 linear layer. All MXFP8 layers share the same structure: FP8-E4 M3 weights with uint8 (E8 M0) per-block scales at block size 32....

v LLM Docs
docs. vllm. ai > en > latest > api > vllm > transformers_utils > processors > fireredlid

fireredlid

2+ week, 4+ day ago (73+ words) v LLM Docs Fire Red LID feature extractor and processor. - Raw waveform " 80-dim log-mel filterbank (via kaldi_native_fbank) The Processor wraps the Feature Extractor and a tokenizer. Extracts 80-dim log-mel filterbank features from raw waveforms, applies CMVN, and returns padded feature tensors…...

v LLM Docs
docs. vllm. ai > en > latest > api > vllm > model_executor > kernels > linear > mixed_precision > triton_w4a16

triton_w4a16

2+ week, 4+ day ago (169+ words) Triton-based W4 A16 GEMM kernel for ROCm (MI300 and newer). Supports GPTQ-format int4 weights (uint4b8 symmetric, uint4 asymmetric) with grouped quantization. Weight tensors are transposed from the compressed-tensors checkpoint layout to the kernel's [K, N//8] layout. Fused W4 A16 GEMM using GPTQ-packed int4 weights. Activation matrix [M, K], float16 or…...

v LLM Docs
docs. vllm. ai > en > latest > api > vllm > model_executor > kernels > linear > nvfp4 > base

base - v LLM

2+ week, 5+ day ago (133+ words) Base class for NVFP4 quantized linear kernels. Each subclass implements a specific GEMM backend (CUTLASS, Marlin, etc). The kernel selection mechanism iterates over registered subclasses in priority order, calling is_supported and can_implement to find the best match for the current hardware. Run the…...