News
Customization Guide " Vime
1+ day, 19+ hour ago (779+ words) vime provides extensive customization capabilities through function path arguments. These allow you to inject custom logic at various stages of the training and rollout pipeline without modifying the core codebase. Below is a summary of all available customization interfaces and…...
vllm. kernels. helion. ops. fused_qk_norm_rope
3+ day, 1+ hour ago (70+ words) v LLM docs Pick the best pre-tuned config for the given input shape. - Find the closest q_heads among available configs (exact match preferred). - Find the closest kv_heads among available configs (exact match preferred). - Among the num_tokens values tuned for that q_heads and q_heads, pick the…...
Quickstart - v LLM-Omni
7+ mon, 13+ hour ago (172+ words) This guide will help you quickly get started with v LLM-Omni to perform: For installation on GPU from source: For additional installation methods " please see the installation guide. It is important to install the same major & minor version of v…...
protocol - v LLM
3+ day, 4+ hour ago (223+ words) Prompt token count for usage; defaults to 0 if omitted. Mirrors chat_request on Derender Chat Request. Required by the parsing so parsers receive the full request context. One prompt token count per response; each defaults to 0 if omitted. Char-level (start, end) offsets…...
vllm. model_executor. layers. attention. rswa_attention
3+ day, 7+ hour ago (86+ words) v LLM docs Attention layer that reports RSWASpec as its KV cache spec. Drop-in replacement for the standard Attention layer when the model is configured with Reference Sliding Window Attention (R-SWA, rswa_window > 0 ). The actual masking logic lives in the attention backend…...
fused_ops - v LLM
3+ day, 8+ hour ago (98+ words) v LLM docs Fused ops for deepseek_v32 (eager / breakable-cudagraph path). These recover fusions that v LLM's torch. compile passes would normally do but that don't fire when running eager under the breakable CUDA graph. All-reduce + add residual + (standard) RMSNorm, fused via…...
vllm. model_executor. models. openai_privacy_filter
3+ day, 7+ hour ago (46+ words) v LLM docs Inference-only Open AI Privacy Filter model. gpt-oss reused as a bidirectional encoder for token classification: every layer runs non-causal attention with a banded "sliding_window mask, and the LM head is replaced with a 33-class BIOES score head....
serving - v LLM
3+ day, 4+ hour ago (49+ words) Extract multimodal metadata from a rendered engine prompt. Returns None for text-only prompts. Validate the model and preprocess a chat completion request. Validate the model and preprocess a completion request. This is the authoritative implementation used directly by the GPU-less…...
hpc_moe - v LLM
3+ day, 8+ hour ago (174+ words) v LLM docs Mo E implementation powered by HPC. Only supported on NVIDIA Hopper GPUs (e. g. H20, H200), and currently limited to FP8 models such as Hy3-FP8, Qwen3-235 B-A22 B-FP8, etc. Compute the shapes for the temporary and final outputs of the two gemms workspace_shapes(M, N, K, topk,…...
vllm. model_executor. warmup. qwen_triton_warmup
3+ day, 5+ hour ago (26+ words) v LLM docs Warm up Qwen Triton kernels from the loaded model's compile keys. Warm Qwen Triton kernels reported by the JIT monitor....