Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124


In the chaotic world of Large Language Model (LLM) optimization, engineers have spent the last few years developing increasingly esoteric rituals to get better answers.
We saw "Chain of thought" (by asking the model to think step by step and often, to show these "traces of reasoning" to the user), "Emotional blackmail" (tell the model that his career depends on the answer, or that he is being accused of sexual misconduct) and complex multi-shot prompt frames.
But a new paper published by Google Research suggests we may have been thinking too much about it. Researchers found that simply repeating the input query (literally copying and pasting the prompt so that it appears twice) consistently improves performance on major models, including Gemini, GPT-4o, Claude, and DeepSeek.
The document, titled "Rapid repetition improves unreasoned LLMs," published last month, just before the holidays, presents an almost suspiciously simple conclusion: For tasks that don’t require complex reasoning steps, stating the prompt twice produces significantly better results than stating it just once.
Better yet, due to the way the transformer architecture works, this "something weird" carries almost no penalty in terms of build speed.
To understand why repeating a question makes a supercomputer smarter, you need to examine the architectural limitations of the standard Transformer model.
Most modern LLMs are trained as "causal" language models. This means that they process text strictly from left to right. When the model processes the 5th token in your sentence, it can "attend" (pay attention) to tokens 1-4, but he has no knowledge of token 6, because it hasn’t happened yet.
This creates a fundamental constraint in how models understand user queries. As the authors note, the order of the information is extremely important.
A query formatted like <CONTEXT> <QUESTION> often gives results different from those <QUESTION> <CONTEXT> because, in the latter case, the model reads the question before knowing the context to which it is supposed to apply it.
Quick repeat circumvents this limitation by transforming an entry from <QUERY> In <QUERY><QUERY>.
As the model begins to process the second iteration of the query, it has already "read" the first iteration. This allows the tokens from the second copy to take care of each token from the first copy.
Indeed, the second repetition benefits from a form of bidirectional attention: it can "look back" through the entire query to resolve ambiguities or retrieve specific details that might have been missed in a single pass.
The researchers, Yaniv Leviathan, Matan Kalman and Yossi Matias, tested this hypothesis on a suite of seven popular benchmarks, including ARC, OpenBookOA, GSM8K and MMLU-Pro. They evaluated seven different models, ranging from lightweight models like Gemini 2.0 Flash Lite and GPT-4o-mini to heavyweights like Claude 3.7 Sonnet and DeepSeek V3. The results were statistically striking. When I ask for models not To use explicit reasoning (i.e., simply giving a direct answer), rapid repetition won 47 of 70 head-to-head tests compared to baseline, with zero losses. The gains were particularly dramatic in tasks requiring precise retrieval from a prompt. The team designed a custom "NameIndex" benchmark, where the model is given a list of 50 names and asked to identify the 25th.
Basic performance: Gemini 2.0 Flash-Lite performed dismal 21.33% precision.
With repetition: Accuracy has skyrocketed to 97.33%.
This massive jump illustrates the "causal blind spot" perfectly. In a single pass, the figure can lose track of the countdown by the time it reaches the 25th name. On repeated passes, the model actually has the entire list in its "working memory" before attempting to solve the recovery task.
Usually, adding text to a prompt increases cost and latency. If you double the input, surely you double the wait time? Surprisingly, no. The article demonstrates that rapid repetition is essentially "free" regarding the latency perceived by the user. LLM treatment is divided into two stages:
Pre-fill: The model processes the input prompt. This is highly parallelizable; the GPU can process the entire prompt matrix simultaneously.
Generation (decoding): The model generates the response one token at a time. It’s serial and slow.
Rapid repetition only increases the work in the prefill scene. Since modern hardware handles pre-filling very efficiently, the user barely notices the difference. The researchers found that repeating the prompt made not increase the length of the generated response, nor increase the "time until first token" latency for most models. The only exceptions were Anthropic’s models (Claude Haiku and Sonnet) on extremely long queries, where the pre-fill step eventually hit a bottleneck. But for the vast majority of use cases, the technique improves accuracy without slowing down the chat experience.
There is one caveat: this technique is primarily intended "non-reasoning" tasks: Scenarios for which you want a direct answer rather than a step-by-step derivation.
When researchers tested rapid repetition combined with "Chain of thought" (by asking the model to "think step by step"), the gains largely faded away, showing neutral to slightly positive results (5 wins, 1 loss, 22 draws).
The authors posit that reasoning patterns themselves naturally perform a version of repetition. When a model "think," he often restates the premise of the question in the generated result before solving it. Therefore, explicitly repeating the prompt in the entry becomes redundant.
However, for applications where you need a quick, direct answer without the verbosity (and cost) of lengthy reasoning, rapid repeat offers a powerful alternative.
For business leaders, this research represents the rarest thing in AI development: "free" optimization. But capitalization requires nuances; This is not a setting to blindly toggle across an entire organization, but rather a tactical adjustment that impacts engineering, orchestration, and security.
For technical managers balancing the eternal triangle of speed, quality and cost, speed repetition offers a way to punch above your weight class. Data shows that smaller, faster models, like Gemini 2.0 Flash Lite, can achieve near-perfect retrieval accuracy (from 21.33% to 97.33%) simply by processing the input twice.
This changes the calculus for model selection: before moving to a larger, more expensive model to solve an accuracy bottleneck, engineers must first check whether a simple repetition makes their current model work properly. "Light" models to bridge the gap. This is a potential strategy for retaining the speed and cost benefits of lightweight infrastructure without sacrificing the performance of fetch and fetch tasks.
This logic naturally shifts the burden to the orchestration layer. For those managing the middleware and API gateways that bring AI applications together, rapid repetition should likely become a standard, invisible component of pipeline logic rather than a user behavior.
However, because the technique is neutral for tasks requiring a lot of reasoning but very effective for direct answers, it requires conditional application. An intelligent orchestration harness would automatically identify queries routed to non-reasoning endpoints (such as entity extraction, classification, or simple Q&A) and double-down on the prompt before passing it to the model. This optimizes performance at the infrastructure level, delivering better results without requiring action from end users or increasing the production budget.
Finally, this increased attention introduces a new variable for security teams.
If repeating a prompt clarifies a user’s intent regarding the pattern, it stands to reason that malicious intentions can also be clarified. Security directors will need to update their red team protocols to test "repeated injection" attacks: check if repeating a jailbreak command (e.g. "Ignore previous instructions") makes the model "attend" to the breach more effectively. Conversely, this mechanism offers a new defensive tool: the repetition of system prompts.
Stating security guardrails twice at the start of the pop-up could force the model to consider security constraints more rigorously, thus acting as a low-cost reinforcement for robust security operations.
This research highlights something crucial for developers relying on LLMs: our current models are still deeply constrained by their unidirectional nature. While we wait for new architectures that can solve causal blindness, rudimentary but effective workarounds, such as rapid repetition, provide immediate value. The authors suggest this could become a default behavior for future systems.
We might soon see inference engines silently overtaking our prompts in the background before sending them to the model, or "Reasoning" models trained to internalize this repetition strategy in order to be more effective. For now, if you’re having trouble getting a template to follow complex instructions or retrieve specific details from a long document, the solution may not be a better prompt. You may just have to repeat it.