Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

DeepSeek may have used Google’s Gemini to train its latest model


Last week, the Chinese laboratory Deepseek published a Updated version of his R1 model AI It works of course on a certain number of mathematical benchmarks and coditors. The company did not reveal the source of the data it used to form the model, but some IA researchers speculate that at least a part came from the Google Gemini family.

Sam Paech, a developer based in Melbourne who creates emotional intelligence assessments for AI, published what he claims to be proof that the latest Deepseek model was formed on the results of Gemini. The Deepseek model, called R1-0528, prefers words and expressions similar to those that Google’s Gemini 2.5 Pro Favors, said Paech in a X post.

It is not a smoking pistol. But another developer, the pseudonym creator of an “evaluation of freedom of expression” for the AI ​​called Mass of speechesnoted the traces of the Deepseek model – the “thoughts” that the model generates as it works towards a conclusion – “read like gem traces”.

Deepseek has been accused of training on data from rival intermediary models before. In December, developers observed This DEEPSEEK V3 model has often been identified as the chatbpt chatbot cat, OPENAI, suggesting that it was formed on chatgpt cat newspapers.

Earlier this year, Openai told Financial Times He found evidence connecting Deepseek to the use of distillation, a technique to train AI models by extracting larger and more competent data. According to BloombergMicrosoft, a close employee and investor, detected that large amounts of data were exfiltrated via OPENAI developer accounts at the end of 2024 – OPENAI accounts are affiliated with Deepseek.

Distillation is not a rare practice, but the conditions of use of OPENAI prohibit customers from using the results of the company model to create a competing AI.

To be clear, many models identify badly themselves And converges on the same words and shooting of sentences. Indeed embarrassed with ai hang. Content farms use AI to create clickAnd the robots flood Reddit and X.

This “contamination”, if you want, did it rayly To carefully filter the AI ​​outputs of training data sets.

However, IA experts like Nathan Lambert, a researcher at the AI ​​non -profit research institute AI2, do not think that it is out of the question that Deepseek has trained on the data of Google’s Gemini.

“If I was in depth, I would certainly create a ton of synthetic data for the best API model”, Lambert wrote In a post on X. “[DeepSeek is] Short of GPU and rinse with money. It is literally effectively more calculating for them. »»

Partly in order to prevent distillation, AI companies have increased security measures.

In April, Openai started demanding Organizations to complete an identification verification process in order to access certain advanced models. The process requires an identity document issued by the government of one of the countries supported by the Openai API; China is not on the list.

Elsewhere, Google has recently started to “summarize” the traces generated by models available via its AI Studio developer platform, a step that makes it more difficult to form efficient rival models on the traces of Gemini. Anthropic in May said that he start with summary The traces of its own model, citing the need to protect its “competitive advantages”.

We contacted Google to comment and update this piece if we hear.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *