Search Results

News

v LLM docs
docs. vllm. ai > en > latest > api > vllm > parser > glm47_moe

glm47_moe - v LLM

4+ day, 22+ hour ago (45+ words) glm47_moe v LLM docs GLM-4. 7 parser for reasoning and tool calls. GLM-4. 7 uses XML-like tool calls: : The function name can be followed directly by the first tag, and tool calls may have no arguments. GLM-4. 7 parser backed by the declarative parser…...

Symbols: lgl-wt,g1g.mu,btc-usd,spod.cn,eti.cn,rain.cn

v LLM docs
docs. vllm. ai > en > latest > api > vllm > kernels > helion > ops > rms_norm_per_block_quant

rms_norm_per_block_quant

1+ week, 1+ day ago (66+ words) v LLM docs Pick the best pre-tuned config for the given input shape. - Find the closest hidden_size among available configs (exact match preferred). - Find the closest group_size among available configs (exact match preferred). - Among the num_tokens values tuned for that hidden_size and group_size, pick the…...

Symbols: nasdaq:voxr,nasdaq:lmb,nqse30lmn,nqdxusmcvt,nyse:peg

v LLM docs
docs. vllm. ai > en > latest > api > vllm > triton_utils > force_first_config

force_first_config

1+ week, 1+ day ago (23+ words) v LLM docs Skip Triton autotuning under VLLM_TRITON_FORCE_FIRST_CONFIG. Install the Autotuner. run replacement. Return whether the first-valid-config patch is currently installed....

Symbols: asx:zip,btc-usd,eth-usd,arpa-e,d05.S0,u11.S0

v LLM docs
docs. vllm. ai > en > latest > api > vllm > parser > nemotron_v3

nemotron_v3

1+ week, 1+ day ago (75+ words) v LLM docs The Nemotron 3 Super model uses the same tool call and reasoning format as Qwen3 ( / + XML). This config reuses: func: qwen3_config with a distinct name. When enable_thinking=False or force_nonempty_content=True and content is empty, reasoning and content are swapped. Nemotron V3 parser: same…...

Symbols: nasdaq:nmra

v LLM docs
docs. vllm. ai > en > latest > api > vllm > config > diffusion

diffusion - v LLM

1+ week, 5+ day ago (108+ words) diffusion v LLM docs Configuration for discrete diffusion (d LLM) models. Configuration for discrete diffusion language models (d LLMs). d LLMs generate tokens via iterative denoising over a fixed-length canvas rather than left-to-right autoregressive decoding. They reuse the speculative-decoding data path…...

Symbols: btc-usd

v LLM docs
docs. vllm. ai > en > latest > api > vllm > parser > harmony

harmony - v LLM

1+ week, 5+ day ago (24+ words) harmony v LLM docs Parse Harmony output from token IDs. Tool calls are always extracted regardless of enable_auto_tools. Callers must decide whether to surface them....

Symbols: nasdaq:hrmy

v LLM docs
docs. vllm. ai > en > latest > api > vllm > kernels > helion > ops > per_token_group_fp8_quant

per_token_group_fp8_quant

1+ week, 6+ day ago (66+ words) v LLM docs Pick the best pre-tuned config for the given input shape. - Find the closest hidden_size among available configs (exact match preferred). - Find the closest group_size among available configs (exact match preferred). - Among the num_tokens values tuned for that hidden_size and group_size, pick the…...

Symbols: otc:utkn,fip.16

v LLM docs
docs. vllm. ai > en > latest > configuration > optimization

Optimization and Tuning

7+ mon, 2+ week ago (1562+ words) This guide covers optimization strategies and performance tuning for v LLM V1. Running out of memory? Consult this guide on how to conserve memory. v LLM provides 4 optimization levels (-O0, -O1, -O2, -O3) that allow users to trade off startup time for performance: For more…...

Symbols: setaf-af

v LLM docs
docs. vllm. ai > en > latest > features > kv_offloading_usage

KV Offloading Usage Guide

2+ week, 21+ hour ago (263+ words) The Offloading Connector currently supports CUDA, ROCm, and XPU only. Two specs are available, selected by the spec_name key in kv_connector_extra_config: Only the CPU primary tier has direct GPU access. Secondary tiers cannot read from or write to GPU memory; all GPU'secondary…...

Symbols: nyse:vrt

Google News
docs. vllm. ai > en > latest > api > vllm > model_executor > layers > fused_qk_norm_rope

fused_qk_norm_rope - v LLM

2+ week, 1+ day ago (93+ words) Fused QK-RMSNorm + (partial) Ro PE + gate copy Triton kernel. Currently used by the Qwen3. 5 attention path (attn_output_gate with Neo X-style partial Ro PE). The unfused reference sequence is split -> Gemma RMSNorm -> Ro PE -> gate chunk; this collapses it into a single Triton…...

Symbols: btc-usd