News
Announcing vime: A Simple, Stable, and Efficient RL Framework for LLMs
2+ week, 4+ day ago (607+ words) We are excited to introduce vime, an LLM post-training framework within the v LLM ecosystem. Built on slime's training stack and data-generation design, vime connects Megatron and v LLM into a single RL pipeline so distributed training and inference can…...
v LLM Semantic Router v0. 3 Themis: From Signals to Stateful Production Routing
3+ week, 19+ hour ago (1636+ words) v LLM Semantic Router v0. 3, codename Themis, is where semantic routing becomes stateful, observable, and production-ready for real AI traffic. The previous two releases set the stage. Iris made routing decisions composable. Athena rebuilt the model foundation and expanded the router…...
Fast & Efficient LLM Inference with v LLM: A New Course with Deep Learning. AI
3+ week, 3+ day ago (492+ words) We're excited to announce, with Red Hat and Andrew Ng's Deep Learning. AI, a hands-on course that walks through LLM fundamentals and the full optimize, deploy, and benchmark AI deployment lifecycle using v LLM and it's ecosystem of tools. It's…...
Next-Level Inference: Why Your Single-Node v LLM Setup Needs Prefill-Decode Disaggregation | v LLM Blog
2+ mon, 2+ week ago (1331+ words) TL; DR: Prefill and decode fight over the same GPUs, causing ITL spikes under load. We show how to disaggregate them on a single 8-GPU MI300 X node using AMD's MO...
EAGLE 3. 1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, v LLM, and Torch Spec
1+ mon, 1+ day ago (439+ words) The EAGLE series " including EAGLE 1, EAGLE 2, and EAGLE 3 " has become one of the most widely adopted and practically deployed families of speculative decoding algorithms across both research and production systems. Today, the EAGLE team, v LLM team, and Torch Spec…...
Deep Seek V4 in v LLM: Efficient Long-context Attention
2+ mon, 3+ day ago (1400+ words) These models feature an efficient long-context attention mechanism, purpose-built for tasks involving up to one million tokens. While the new attention design may appear intricate on first reading, its underlying principles are straightforward once examined systematically. This blog post is…...
Google News
3+ mon, 3+ day ago (11+ words) Model Runner V2: A Modular and Faster Core for v LLM'vllm. ai...