Word frequencya project designed to track the evolution of language usage in more than 40 different languages, It has been closed in recent weeks. Since the proliferation of content generated by AI language models in the past three years has weakened the data on which research activities are based.
The project’s creator herself, Robin Speer, To advertise on Githubwarning that Wordfreq would be abandoned because of this. From the information “pollution” caused by generative artificial intelligence“I don’t think anyone has reliable information about how humans will use language after 2021,” Speer commented.
Wordfreq has been a valuable resource for academics and researchers for years. The system has analyzed millions of sources, including Wikipedia, movie and TV show subtitles, news articles, books, websites, Twitter, and Reddit, providing A detailed look at language evolutionand trace the emergence of new customs and old ones that have become neglected, the spread of new expressions and colloquial structures, and the reflection of cultural evolution in the way we communicate.
Go to scan the web freely, Wordfreq has encountered a significant amount of “unhelpful” content in the last two years.a real waste generated by great language models that were never actually written to communicate anything. Collecting this data compromises reliability regarding the frequency of word use: moreover, this is content that is already everywhere online, that effectively mimics real language, and is difficult to recognize and ignore. It is a completely different problem from spam, which has always been present on the web but to a lesser extent than genuine content and is much more easily identifiable.
spear led An example of the overuse of the English word “delve” (investigation, research) by ChatGPT, which does not reflect people’s actual usage of that word. However, this did change the recorded frequency of usage of that particular word, effectively polluting the data. Interestingly, the over-repetition of certain words is a phenomenon that was analyzed by another academic study to determine whether a text was written using generative AI.
The proliferation of AI has also brought a series of practical problems for the Wordfreq project: the tools the project uses to read large amounts of content are actually similar to those used by AI companies to train their language models. This has led to a certain amount of mistrust on the part of authors and content creators, who, when faced with a tool that actively collects text from books, articles, websites or publications, tend to think, quite understandably, that on the other hand there is someone training a “copycat” AI, perhaps even for profit. The immediate consequence is that Difficulty accessing content sourcesAs many organizations begin to raise the barriers to collecting data at scale, often at a cost.
The Wordfreq creator ended her call with some bitterness, expressing her disappointment with the big tech companies involved in developing AI.Also emphasizing his desire to avoid confusing his research work in any way with the training activities of large language models..
“Internet trailblazer. Travelaholic. Passionate social media evangelist. Tv advocate.”