Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

It turns out you can train AI models without copyrighted material


AI companies demand their tools could not exist without training on the material protected by copyright. It turns out that they could – it’s really very difficult. To prove it, AI researchers have formed a new less powerful but much more ethical model. Indeed, the LLM data set uses only the public domain and the equipment openly under license.

THE paper (via The Washington Post) was a collaboration between 14 different institutions. The authors represent universities as MIT, Carnegie Mellon and the University of Toronto. Non -profit organizations such as Vector Institute and Allen Institute for IA have also contributed.

The group has built an ethical data set of 8 TB. Among the data, there was a set of 130,000 pounds at the Congress Library. After entering the material, they formed a large language model (LLM) of seven billion dollars on this data. The result? He played almost as well as the similar size of Meta LAMA 2-7B From 2023. The team did not publish benchmarks comparing its results to the best models today.

Performance comparable to a two -year model was not the only drawback. The gathering process was also a version. A large part of the data could not be read by machines, so humans had to revive it. “We use automated tools, but all of our belongings were manually annotated at the end of the day and verified by people,” said the co-author Stella Biderman Wapo. “And it’s really very difficult.” The determination of legal details also made the process difficult. The team had to determine which license applied to each website they scanned.

So what are you doing with a less powerful LLM that is much more difficult to train? If nothing else, it can serve as a counterpoint.

In 2024, OPENAI told a British parliamentary committee that Such a model could not mainly exist. The company said that it would be “impossible to train the main models of AI today without using material protected by copyright”. Last year, an anthropogenic expert witness added: “The LLM probably does not exist if AI companies were required to concede to work in their training data sets.”

Of course, this study will not change the trajectory of AI societies. After all, more work to create less powerful tools does not throw themselves with their interests. But at least, he perceives one of the common arguments of the industry. Don’t be surprised if you hear about this study again in legal affairs And Regulatory arguments.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *