News

v LLM Docs
docs. vllm. ai > projects > vllm-omni > en > latest > getting_started > installation > gpu

GPU - v LLM-Omni

4+ mon, 1+ week ago  (347+ words) v LLM-Omni is a Python library that supports the following GPU variants. The library itself mainly contains python implementations for framework and models. v LLM-Omni is currently not natively supported on Windows. It's recommended to use uv, a very fast…...

v LLM Docs
docs. vllm. ai > en > v0. 18. 1 > examples > pooling > token_embed

Token Embed

2+ day, 14+ hour ago  (19+ words) v LLM Docs Colqwen3 Token Embed Online" Jina Embeddings V4 Offline" Multi Vector Retrieval Offline" Multi Vector Retrieval Online"...

v LLM Docs
docs. vllm. ai > projects > ascend > zh-cn > v0. 18. 0 > faqs. html

FAQs " vllm-ascend

2+ day, 19+ hour ago  (1048+ words) [v0. 17. 0rc1] FAQ & Feedback [v0. 13. 0] FAQ & Feedback Atlas A2 "Atlas 800 T A2 Atlas 900 A2 Po DAtlas 200 T A2 Box16 Atlas 300 T A2" Atlas 800 I A2 "Atlas 800 I A2" Atlas A3 Training series (Atlas 800 T A3, Atlas 900 A3 Super Po D, Atlas 9000 A3 Super Po D) Atlas 800 I A3 Inference series (Atlas 800 I A3) [Experimental] Atlas 300 I Inference series…...

v LLM Docs
docs. vllm. ai > projects > ascend > en > v0. 18. 0 > community > versioning_policy. html

Versioning Policy

2+ day, 23+ hour ago  (662+ words) Final releases: Typically scheduled every three months, with careful alignment to the v LLM upstream release cycle and the Ascend software product roadmap. Pre releases: Typically issued on demand, labeled with rc N to indicate the Nth release candidate. They…...

v LLM Docs
docs. vllm. ai > projects > ascend > zh-cn > v0. 18. 0 > developer_guide > feature_guide > add_custom_aclnn_op. html

Adding a custom aclnn operation

2+ day, 22+ hour ago  (137+ words) This document describes how to add a custom aclnn operation to vllm-ascend. Custom aclnn operations are built and installed into vllm_ascend/cann_ops_custom directory during the build process of vllm-ascend. Then the aclnn operators are bound to torch. ops. _C_ascend module, enabling users to…...

v LLM Docs
docs. vllm. ai > en > v0. 18. 1 > api > vllm > tool_parsers > phi4mini_tool_parser

phi4mini_tool_parser

2+ day, 22+ hour ago  (41+ words) v LLM Docs Tool call parser for phi-4-mini models intended for use with the examples/tool_chat_template_llama. jinja template. Used when --enable-auto-tool-choice --tool-call-parser phi4_mini_json are all set Extract the tool calls from a complete model response....

v LLM Docs
docs. vllm. ai > en > latest > api > vllm > distributed > kv_transfer > kv_connector > v1 > ssm_conv_transfer_utils

ssm_conv_transfer_utils

3+ day, 17+ min ago  (234+ words) With DS conv state layout (dim, state_len), x/B/C sub-projections are contiguous in memory. Each D rank reads its x, B, C slices via 3 separate RDMA transfers " no P-side permutation needed. Per-rank byte sizes of x, B, C sub-projections in…...

v LLM Docs
docs. vllm. ai > en > latest > api > vllm > model_executor > layers > fused_moe > runner > chunking_moe_runner

chunking_moe_runner

3+ day, 2+ hour ago  (102+ words) v LLM Docs Mo E runner wrapper that adds chunked processing to any Mo ERunner Base. This runner wraps an inner Mo ERunner Base and overrides _forward_impl to process large batches by breaking them into smaller chunks. Each chunk is delegated…...

v LLM Docs
docs. vllm. ai > projects > recipes > en > latest > Google > Gemma4. html

Gemma 4 Usage Guide

1+ week, 1+ hour ago  (476+ words) Gemma 4 is Google's most capable open model family, featuring a unified multimodal architecture that natively processes text, images, and audio. Gemma 4 models support advanced capabilities including structured thinking/reasoning, function calling with a custom tool-use protocol, and dynamic vision resolution…...

v LLM Docs
docs. vllm. ai > en > latest > api > vllm > v1 > executor > ray_executor_v2

ray_executor_v2

1+ week, 21+ hour ago  (240+ words) Inherits from Multiproc Executor to reuse the MQ-based control plane and NCCL data plane. Workers are Ray actors. Async scheduling is enabled, inherited from Multiproc Executor. This is cricitcal for Ray Executor V2 to be performant. Build a runtime_env dict for Ray…...