News
GPU - v LLM-Omni
4+ mon, 1+ week ago (347+ words) v LLM-Omni is a Python library that supports the following GPU variants. The library itself mainly contains python implementations for framework and models. v LLM-Omni is currently not natively supported on Windows. It's recommended to use uv, a very fast…...
Token Embed
2+ day, 14+ hour ago (19+ words) v LLM Docs Colqwen3 Token Embed Online" Jina Embeddings V4 Offline" Multi Vector Retrieval Offline" Multi Vector Retrieval Online"...
FAQs " vllm-ascend
2+ day, 19+ hour ago (1048+ words) [v0. 17. 0rc1] FAQ & Feedback [v0. 13. 0] FAQ & Feedback Atlas A2 "Atlas 800 T A2 Atlas 900 A2 Po DAtlas 200 T A2 Box16 Atlas 300 T A2" Atlas 800 I A2 "Atlas 800 I A2" Atlas A3 Training series (Atlas 800 T A3, Atlas 900 A3 Super Po D, Atlas 9000 A3 Super Po D) Atlas 800 I A3 Inference series (Atlas 800 I A3) [Experimental] Atlas 300 I Inference series…...
Versioning Policy
2+ day, 23+ hour ago (662+ words) Final releases: Typically scheduled every three months, with careful alignment to the v LLM upstream release cycle and the Ascend software product roadmap. Pre releases: Typically issued on demand, labeled with rc N to indicate the Nth release candidate. They…...
Adding a custom aclnn operation
2+ day, 22+ hour ago (137+ words) This document describes how to add a custom aclnn operation to vllm-ascend. Custom aclnn operations are built and installed into vllm_ascend/cann_ops_custom directory during the build process of vllm-ascend. Then the aclnn operators are bound to torch. ops. _C_ascend module, enabling users to…...
phi4mini_tool_parser
2+ day, 22+ hour ago (41+ words) v LLM Docs Tool call parser for phi-4-mini models intended for use with the examples/tool_chat_template_llama. jinja template. Used when --enable-auto-tool-choice --tool-call-parser phi4_mini_json are all set Extract the tool calls from a complete model response....
ssm_conv_transfer_utils
3+ day, 17+ min ago (234+ words) With DS conv state layout (dim, state_len), x/B/C sub-projections are contiguous in memory. Each D rank reads its x, B, C slices via 3 separate RDMA transfers " no P-side permutation needed. Per-rank byte sizes of x, B, C sub-projections in…...
chunking_moe_runner
3+ day, 2+ hour ago (102+ words) v LLM Docs Mo E runner wrapper that adds chunked processing to any Mo ERunner Base. This runner wraps an inner Mo ERunner Base and overrides _forward_impl to process large batches by breaking them into smaller chunks. Each chunk is delegated…...
Gemma 4 Usage Guide
1+ week, 1+ hour ago (476+ words) Gemma 4 is Google's most capable open model family, featuring a unified multimodal architecture that natively processes text, images, and audio. Gemma 4 models support advanced capabilities including structured thinking/reasoning, function calling with a custom tool-use protocol, and dynamic vision resolution…...
ray_executor_v2
1+ week, 21+ hour ago (240+ words) Inherits from Multiproc Executor to reuse the MQ-based control plane and NCCL data plane. Workers are Ray actors. Async scheduling is enabled, inherited from Multiproc Executor. This is cricitcal for Ray Executor V2 to be performant. Build a runtime_env dict for Ray…...