The Internet Is Not the Dataset Anymore

The headline. “The internet is full of AI slop, therefore future models will be stupid”

The truth. The open web can rot and still not be the source that matters. Model labs source data directly, collect feedback from their own (soon billions) of authenticated users, have tons of people hired to produce and judge the cleanliness of their data, generate controlled tasks, run evals, and build filters around the data they keep. What started as a giant scrape from the internet is, at the frontier, no longer close to that. It’s an enormous industrial curation process and “the open internet” is hardly involved in training new models.

Frontier AI providers no longer need to scrape the internet for data. The internet came to them.

That distinction matters for poisoned samples too. An outsider cannot simply “destroy” OpenAI’s dataset by publishing bad examples and hoping they get scraped. They have to be collected, survive filtering, matter enough to shift behavior, and remain undetected. Even then, adversarial examples are not only contamination. They can become valuable and rare training material once detected because they are not easy to produce. Hackers are doing OpenAI a favor when they provide them with adversarial examples like that for free.

The headline is not “AI makes AGI impossible by ruining the internet.” The headline is “Humans are making parts of the internet less useful for humans.” That is the heart of the problem. Search quality, provenance, social trust, and online writing all get worse when humans floods the channel with their generated outputs and poorly written bots. But that’s on the humans.

Still, a broken public internet does not automatically block intelligence. If humans can still be generally intelligent while navigating bad information environments, any other intelligent system can do it, too. A human doesn’t become less intelligent just because the internet is broken, and the same goes for AI. It just gets harder for everyone to find good information.