Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124


For two years, the dominant logic in generative AI has been that of brute force: if we want better reasoning, we need a larger model.
While "little" Models (fewer than 10 billion parameters) have become competent conversationalists, but have historically broken down when asked to perform multi-step logical deduction or complex mathematical proofs.
Now the Technological Innovation Institute (TII) in Abu Dhabi is challenging this far-reaching law by the release of the Falcon H1R 7B.
By abandoning the pure orthodoxy of Transformer in favor of a hybrid architecture, TII claims to have built a 7 billion parameter model that not only rivals but outperforms competitors nearly 7 times its size – including the 32B and 47B variants of Alibaba’s Qwen and Nvidia’s Nemotron.
This release marks a significant shift in the open weight ecosystem, shifting the battlefield from raw parameter counts to architectural efficiency and inference time scaling.
Full model code is available now at Hugging Face and can be tested by individuals during a live demo inference on Chat with the hawk (a chatbot experience). TII has also published an apparently quite comprehensive report technical report on the Falcon H1 7B training approach and methodology as well.
The defining characteristic of the Falcon H1R 7B is its "hybrid" spine. Most modern LLMs rely exclusively on the Transformer architecture, which scales predictably but suffers from high memory costs when processing long sequences.
Falcon H1R 7B integrates Mamba, a state space model (SSM) architecture, alongside standard Transformer attention layers.
Originally developed by researchers Albert Gu and Tri Dao of Carnegie Mellon University and Princeton University, Mamba was first introduced in the paper "Mamba: modeling linear temporal sequences with selective state spaces" published on December 1, 2023.
The architecture treats data sequences differently than Transformers: while Transformers compares each data element to every other element (quadratic scaling), Mamba processes tokens sequentially, allowing it to handle large amounts of information with linear scaling and significantly reduced computational costs.
This combination solves one of the most persistent bottlenecks in deploying reasoning models: the cost of "thought." Reasoning models require generating "thought chains"— internal monologues step by step — before arriving at an answer. For standard Transformers, these long contexts explode the calculation costs.
According to TII’s technical report, the hybrid approach allows the Falcon H1R 7B to maintain high throughput even as response lengths increase. With a batch size of 64, the model processes approximately 1,500 tokens per second per GPU, almost double the speed of the competing Qwen3 8B model.
In the benchmarks published by TII, the disparity between the Falcon H1R 7B’s size and its performance is glaring. On the LOVE 2025 ranking – a rigorous test of mathematical reasoning – Falcon H1R 7B scored 83.1%a result that disrupts the traditional hierarchy of model sizing.
While the 7B model naturally follows huge proprietary boundaries like GPT-5.2 (99.0%) and Gemini 3 Flash (97.0%) on the separate artificial analysis index (run by the independent organization of the same name, which has not yet evaluated the Falcon H1R 7B), it has effectively narrowed the gap between "effective" open weights and mid-level proprietary systems.
Beat bigger "Thinkers": Falcon H1R 7B (83.1%) exceeds the 15 billion parameter Apriel-v1.6-Thinker (82.7%) and the parameter of 32 billion OLMo 3 Think (73.7%), validating TII’s assertion that hybrid architectures can outperform large transformers.
In pursuit of owner leaders: It is within striking distance of Claudius 4.5 Sonnet (88.0%) and Amazon Nova 2.0 Lite (88.7%), suggesting that for specific math-intensive workflows, this 7B model provides a viable, low-latency alternative to expensive commercial APIs.
Outperforming Legacy Giants: On this specific reasoning metric, it decisively beats widely performing but older architectures like Mistral Grand 3 (38.0%) and Flame 4 Maverick (19.3%), highlighting the extent to which training in specialized reasoning ("think deeply") has become more critical than the raw scale for logic tasks.
Other key areas gained include:
Coding: The model produced 68.6% On the LCB v6 benchmark, the TII score is the highest among all models tested, including those four times larger.
General reasoning: Although he dominates in math and coding, his general reasoning score (49.48%) remains competitive, falling just below the 14B and 15B parameter models, but comfortably ahead of the comparable 8B models.
The performance of the Falcon H1R 7B is not just architectural; it arises from a rigorous two-stage training pipeline designed to maximize reasoning density without inflating the number of parameters, according to The TII technical report on the model.
Step 1: Supervised fine tuning at cold start (SFT). The model underwent "cold start" SFT on a curated dataset dominated by math (56.8% of tokens) and code (29.8%), with response lengths up to 48,000 tokens.
Weighting according to difficulty: TII has rejected the standard practice of treating all data the same. Instead, they applied a weighting system whereby "hard" problems were weighted from 1.25x to 1.75x, while easy problems were underweighted or removed entirely to avoid overfitting to trivial tasks.
Consistency with a single teacher: Ablation studies have revealed that mixing reasoning traces from multiple "teacher" the models actually degraded performance due to conflicting reasoning styles. Therefore, TII opted for a single teacher approach in order to maintain a consistent internal logic.
Normalization of balanced tokens: To handle the huge variation in sequence length (short instructions or massive reasoning chains), the team introduced a data-parallel token-balanced normalization strategy. This technique equalizes the gradient contribution of each token across GPUs, preventing ranks with shorter sequences from destabilizing the loss – a change that yielded a steady 4-10% increase in accuracy during training.
Step 2: Reinforcement learning via group relative policy optimization (GRPO). Following SFT, the model was refined using GRPO, a reinforcement learning algorithm that rewards correct results without the need for a separate value model.
THE "Non-KL" Change: Unlike standard RLHF, TII has removed the KL divergence penalty entirely (beta = 0). This allowed the model to deviate significantly from its basic SFT policy, encouraging aggressive exploration of new avenues of reasoning.
Mathematics program only: Surprisingly, TII found that training exclusively on math problems during the RL phase led to better generalization across domains, including coding and science, than mixed strategies. The ablations showed that "code only" training improved coding scores but harmed general reasoning, while math-focused RL improved performance globally.
TII optimized the model specifically for Test-Time Scaling (TTS), a technique in which a model generates multiple reasoning paths in parallel to find the best solution.
The model uses Deep Think with Confidence (DeepConf), which leverages the model’s internal confidence scores to dynamically prune low-quality reasoning traces.
Adaptive size: During generation, the system launches a "warm" phase with 16 traces to establish a trusted baseline. It then aggressively filters subsequent traces, terminating any chain that falls below the 10th percentile of base trust.
Efficiency gains: This method creates a new Pareto frontier for deployment. In benchmark testing, the Falcon H1R 7B achieved 96.7% accuracy on AIME 25 while reducing token usage by 38% compared to the DeepSeek-R1-0528-Qwen3-8B benchmark.
TII released the Falcon H1R 7B under custom Falcon LLM 1.0 License based on Apache 2.0 — but with notable modifications — including mainly: not suing TII, and also always crediting it.
For developers and startups, the license is largely permissive:
Royalty free: Users can run, modify, and distribute the model commercially without paying TII.
Attribution: Any derivative work (including edits) must clearly indicate: "[Name of work] is built using Falcon LLM technology from the Technology Innovation Institute".
However, unlike a pure Open Source Initiative (OSI) license, the Falcon license includes a strict acceptable use policy (AUP).
The license terminates automatically if the template is used to create a work inconsistent with the AUP or if the user initiates patent litigation against TII.
More specifically, the AUP prohibits the use of the Falcon H1R 7B or its derivatives for:
Violation of Laws: Any use that violates any applicable national, federal, state, local or international law or regulation.
Damage to minors or living beings: Exploiting, harming or attempting to exploit or harm minors or living beings.
Disinformation: the generation or dissemination of verifiable false information with the aim of harming others.
Harassment: defame, disparage or otherwise harass others.
TII is not the only one betting on this hybrid future; the industry is moving more and more towards architectures which combine the advantages of SSM and Transformers.
Nvidia recently launched the Nemotron Family 3 on December 15, 2025, which uses a hybrid mix of experts (MoE) and Mamba-Transformer design to drive effective agentic AI.
IBM launched his Granite Family 4.0 on October 2, 2025, using a hybrid Mamba-Transformer architecture to reduce memory requirements by more than 70% while maintaining high performance on enterprise benchmarks.
AI21 continued on this path with its Jamba models (Joint Attention and Mamba), by launching the Jamba Family 1.5 on August 22, 2024, to strengthen agentic AI capabilities through a hybrid SSM-Transformer approach.
Mistral entered space early with Mamba Codestral on July 16, 2024, a model specifically optimized for faster and longer code generation.
The Falcon H1R 7B represents the latest evolution of this trend, specifically targeting dense reasoning tasks in a compact form factor.