Researchers have warned that AI systems could produce a large amount of misleading content as the internet becomes flooded with AI-generated material.
The growing popularity of ChatGPT and similar tools has led many people to share blog posts and other content created by these systems. However, these systems use existing information from the internet, which could lead to a cycle in which AI generators train themselves with AI-generated content. Researchers’ warnings are part of a broader concern about the “dead internet theory,” which suggests that an increasing portion of the internet is becoming automated, creating a vicious cycle, like an ouroboros.
According to Independent UK‘s news, researchers “found that one system tested with text about medieval architecture only needed nine generations before the output was just a repetitive list of jackrabbits, for instance.”
“The issue needs to be taken seriously if the benefits of scraping large-scale data from the internet for education are to be sustained.”
“A good analogy for this is when you take a photocopy of a piece of paper, and then you photocopy the photocopy – you start seeing more and more artifacts,” says “The Curse of Recursion: Training on Generated Data Makes Models Forget” paper co-author Nicolas Papernot, an Assistant Professor in University of Toronto, and added, “Eventually, if you repeat that process many, many times, you will lose most of what was contained in that original piece of paper.”
“What we’re seeing in the paper is, essentially, right now there is a fundamental issue with the way that models are trained, and that won’t be able to rely so heavily on data from the internet to continue scaling the training of these models,” Papernot says.