WebNews
Please enter a web search for web results.
NewsWeb
glm47_moe - v LLM
4+ hour, 58+ min ago (45+ words) glm47_moe v LLM docs GLM-4. 7 parser for reasoning and tool calls. GLM-4. 7 uses XML-like tool calls: : The function name can be followed directly by the first tag, and tool calls may have no arguments. GLM-4. 7 parser backed by the declarative parser…...
rms_norm_per_block_quant
3+ day, 8+ hour ago (66+ words) v LLM docs Pick the best pre-tuned config for the given input shape. - Find the closest hidden_size among available configs (exact match preferred). - Find the closest group_size among available configs (exact match preferred). - Among the num_tokens values tuned for that hidden_size and group_size, pick the…...
force_first_config
3+ day, 10+ hour ago (23+ words) v LLM docs Skip Triton autotuning under VLLM_TRITON_FORCE_FIRST_CONFIG. Install the Autotuner. run replacement. Return whether the first-valid-config patch is currently installed....
nemotron_v3
3+ day, 15+ hour ago (75+ words) v LLM docs The Nemotron 3 Super model uses the same tool call and reasoning format as Qwen3 ( / + XML). This config reuses: func: qwen3_config with a distinct name. When enable_thinking=False or force_nonempty_content=True and content is empty, reasoning and content are swapped. Nemotron V3 parser: same…...
diffusion - v LLM
1+ week, 18+ hour ago (108+ words) diffusion v LLM docs Configuration for discrete diffusion (d LLM) models. Configuration for discrete diffusion language models (d LLMs). d LLMs generate tokens via iterative denoising over a fixed-length canvas rather than left-to-right autoregressive decoding. They reuse the speculative-decoding data path…...
harmony - v LLM
1+ week, 1+ day ago (24+ words) harmony v LLM docs Parse Harmony output from token IDs. Tool calls are always extracted regardless of enable_auto_tools. Callers must decide whether to surface them....
per_token_group_fp8_quant
1+ week, 1+ day ago (66+ words) v LLM docs Pick the best pre-tuned config for the given input shape. - Find the closest hidden_size among available configs (exact match preferred). - Find the closest group_size among available configs (exact match preferred). - Among the num_tokens values tuned for that hidden_size and group_size, pick the…...
Optimization and Tuning
7+ mon, 1+ week ago (1562+ words) This guide covers optimization strategies and performance tuning for v LLM V1. Running out of memory? Consult this guide on how to conserve memory. v LLM provides 4 optimization levels (-O0, -O1, -O2, -O3) that allow users to trade off startup time for performance: For more…...
KV Offloading Usage Guide
1+ week, 3+ day ago (263+ words) The Offloading Connector currently supports CUDA, ROCm, and XPU only. Two specs are available, selected by the spec_name key in kv_connector_extra_config: Only the CPU primary tier has direct GPU access. Secondary tiers cannot read from or write to GPU memory; all GPU'secondary…...
fused_qk_norm_rope - v LLM
1+ week, 3+ day ago (93+ words) Fused QK-RMSNorm + (partial) Ro PE + gate copy Triton kernel. Currently used by the Qwen3. 5 attention path (attn_output_gate with Neo X-style partial Ro PE). The unfused reference sequence is split -> Gemma RMSNorm -> Ro PE -> gate chunk; this collapses it into a single Triton…...