Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124


The big news from Nvidia this week, which made headlines across all forms of media, was the company’s announcement regarding its Vera Rubin GPU.
This week, Nvidia CEO Jensen Huang used his CES speech to highlight the new chip’s performance metrics. According to Huang, the Rubin GPU is capable of 50 PFLOPs of NVFP4 inference and 35 PFLOPs of NVFP4 training performance, which is 5x and 3.5x the performance of Blackwell.
But it won’t be available until the second half of 2026. So what should companies do now?
The current Nvidia GPU architecture is Blackwell, which has been announcement in 2024 to succeed Hopper. Along with this release, Nvidia emphasized that its product engineering journey was also about extracting as much performance as possible from the previous Grace Hopper architecture.
It’s a direction that will also apply to Blackwell, with Vera Rubin arriving later this year.
"We continue to optimize our inference and training stacks for the Blackwell architecture," Dave Salvator, director of accelerated computing products at Nvidia, told VentureBeat.
The same week that Vera Rubin was touted by Nvidia’s CEO as its most powerful GPU ever, the company released a new research showing improved Blackwell performance.
Nvidia managed to increase Blackwell GPU performance up to 2.8 times per GPU in just three months.
The performance gains come from a series of innovations that have been added to the Nvidia TensorRT-LLM inference engine. These optimizations apply to existing hardware, allowing current Blackwell deployments to achieve higher throughput without hardware changes.
Performance gains are measured on DeepSeek-R1, a 671 billion parameter mixture of experts (MoE) model that activates 37 billion parameters per token.
Among the technical innovations that improve performance:
Program Dependent Launch (PDL): The extended implementation reduces kernel launch latencies, thereby increasing throughput.
All-to-all communication: The new implementation of communication primitives eliminates an intermediate buffer, reducing memory overhead.
Multi-Token Prediction (MTP): Generates multiple tokens per direct pass rather than one at a time, increasing throughput over different sequence lengths.
NVFP4 format: A hardware-accelerated 4-bit floating-point format in Blackwell that reduces memory bandwidth requirements while preserving model accuracy.
The optimizations reduce the cost per million tokens and enable existing infrastructure to serve higher request volumes with lower latency. Cloud providers and enterprises can scale their AI services without an immediate hardware upgrade.
Blackwell is also widely used as the basic hardware component for training the largest of the large language models.
In this regard, Nvidia also reported significant gains for Blackwell when used for AI training.
Since its initial launch, the GB200 NVL72 system has delivered up to 1.4x better training performance on the same hardware, a 40% increase achieved in just five months without any hardware upgrades.
The impetus for the training came from a series of updates, including:
Optimized training recipes. Nvidia engineers have developed sophisticated training recipes that effectively leverage NVFP4 accuracy. Blackwell’s initial submissions used FP8 precision, but transitioning to recipes optimized for NVFP4 unlocked substantial additional performance over existing silicon.
Algorithmic refinements. Continuous software stack enhancements and algorithmic improvements have enabled the platform to extract more performance from the same hardware, demonstrating continued innovation beyond initial deployment.
Salvator noted that the high-end Blackwell Ultra is a market-leading platform specifically designed to run cutting-edge AI models and applications.
He added that the Nvidia Rubin Platform will extend the company’s market leadership and enable the next generation of MoE to power a new class of applications to push AI innovation even further.
Salvator explained that Vera Rubin is designed to meet the growing demand for compute created by the continued growth in model sizes and the generation of reasoning tokens from leading models such as MoE.
"Blackwell and Rubin can serve the same models, but the difference lies in the performance, efficiency and cost of the tokens." he said.
According to early test results from Nvidia, compared to Blackwell, Rubin can train large MoE models with a quarter of the number of GPUs, inference token generation with 10x higher throughput per watt, and inference at 1/10th the cost per token.
"Better token throughput performance and efficiency means new models can be built with more reasoning capabilities and faster agent-to-agent interaction, creating better intelligence at lower cost." Salvator said.
For companies deploying AI infrastructure today, current investments in Blackwell remain sensible despite the arrival of Vera Rubin later this year.
Organizations with existing Blackwell deployments can immediately benefit from the 2.8x inference improvement and 1.4x training increase by updating to the latest versions of TensorRT-LLM, delivering real savings without capital expenditure. For those planning new deployments in the first half of 2026, it makes sense to proceed with Blackwell. Waiting six months means delaying AI initiatives and potentially falling behind competitors already deployed today.
However, companies planning to build large-scale infrastructure by the end of 2026 and beyond should incorporate Vera Rubin into their roadmaps. The 10x improvement in throughput per watt and 1/10 cost per token represents a transformational saving for large-scale AI operations.
The smart approach is a phased deployment: leverage Blackwell for immediate needs while designing systems that can integrate Vera Rubin when available. Nvidia’s continuous optimization model means this isn’t a binary choice; businesses can maximize the value of current deployments without sacrificing long-term competitiveness.