If generative AI is trained on the web, what happens when the web is full of AI text?

Covering culture, tech, luxury, travel, and media in 5 minutes, twice a week


As Pharrell Williams puts it, “this is not working. This is a dream.” His first Louis Vuitton menswear show is expected to be, in terms of attendance and square footage, the brand’s largest ever menswear show. (Julia Marino / British Vogue)

🔼 Berkshire Hathaway’s Japanese holdings. Warren Buffett’s firm bought more shares of Japan’s five biggest trading houses (conglomerates). Not coincidentally, Japan’s stock market is at a multi-year high »»

🔽 Traffic to various news and media websites. Many sites are reporting significantly lower traffic, dating back to February. Some point to a mysterious change in Facebook’s algorithm »»

💬 “If it is true I think it is a large part of the problem the industry has in general.” Cannes Lions jurors were reportedly told to dial down the politics, as they decide which ad campaigns deserve an annual award »»

🛫 5 things you should ask every time you check into a hotel »»

👗 Peek inside Louis Vuitton’s studio with Pharrell Williams, ahead of his menswear debut tonight »» More: A timeline of Pharrell’s fashion ascendance »»

💎 Piaget just unveiled a new limited edition Polo Field watch in dark emerald green. It’ll cost you just north of US$13,500 »»


Japanese mini trucks, called Kei trucks, are affordable, go-anywhere, 4×4 machines that are exploding in popularity across the English speaking world (The Rag Company / YouTube)

Heard of rich dad energy? Here’s how to spot menswear’s latest swagger signal »»

India’s IndiGo airline just placed the largest plane order in history. Airbus is celebrating »»

France is positioning itself as Europe’s AI capital »»

Domino’s is introducing “Pinpoint Delivery”. People in the US can now order a pizza to places without an address by adding a pin to a map »»

Inside the Latin American company teaching influencers how to get rich without going viral »»

A seed bank in Taiwan is home to more chili varieties than anywhere else on earth. In a warming world, we’re going to need them »»

The UK is a hot country. Time to build like it »»

How to tell when you’re getting good advice »»

These tiny Japanese pick-up trucks cost US$5,000 and used versions are winning serious fans in America »»

Chick-fil-A is testing a new grilled chicken sandwich »»

Nobu Hospitality signed a deal with a Thai real estate group to open a “Plaza Athenee Nobu Hotel and Spa” in Bangkok, and another in New York »»

That 360-degree camera your favorite YouTubers are using? It's worth it »»

BlackRock may have found a way to get SEC approval for a spot Bitcoin ETF »»

Publicly traded bottling company Coca-Cola Hellenic is buying the Finlandia vodka brand from current owners Brown-Forman (Jack Daniels) »»

Yikes. Users have discovered a weird ChatGPT jailbreak that will “trick” the generative AI chatbot into returning things it shouldn’t, like mobile phone IMEI numbers »»

Two foods can significantly reduce jet lag, says Qantas »»

Attention SaaS people: a new-ish app called Olvy promises to curate customer feedback from across the web and summarize it for you, so you can generate support tickets and action fixes faster »»

A new startup called Rever has an interesting twist on the problem of e-commerce returns: they’ll offer you an instant cash refund through a buy now, pay later-style (BNPL) model. Rever pays, automates the processes of label generation and refunds —and also provides retailers with analytics on customer behavior and purchasing trends »»


If generative AI is trained on the Internet, what happens when the Internet is full of AI generated text?

Human writers wanted (Dall-E)


A new scientific paper attempts to prove a theory this newsletter has shared in the past:

ChatGPT, and generative AI models like it, is destined to eat its own tail.

As future versions of generative AI are trained on AI generated content, their source material will degrade. This will, over the course of several generations, turn the AI-generated answers into nonsense, and render the entire product unusable.

Wait, what? Really?

Yep. And the scientists have a name for it too. They call the phenomenon “model collapse,” and the scene they paint is chilling.


“Within a few generations, text becomes garbage.“

First, a quick sidebar: as I’ve written, generative AI is a computational system. AI is not intelligent. But it is a math whiz. See, AI chatbots like ChatGPT use statistics and probability to help it decide which words to string together, and in which order, as it writes its answers to your queries.

The math behind model collapse looks super complex to me, a non-scientist. So, here’s a short version of the paper’s conclusion: after being trained on its own output for a few generations, the “normal probabilities” of what specific word the AI should return next (as it composes a sentence in answer to a query), converge on each other.

This turns the text it generates into nonsense. In an example they share, an input about ancient masonry techniques and parish church tower designs in medieval England returns word salad about various jackrabbit populations.

That’s “model collapse.”


The scientists built some theoretical intuition behind the phenomenon into their paper, and they’ve actually come to the conclusion that model collapse will happen to all learned generative models.

As mentioned, they use a whole lot of math themselves to come to the conclusion that generative AI models will cease to return valuable comprehensible answers within a small handful of generations. (For context, GPT-3 was released in June 2020. GPT-4 came out in March 2023.)

The paper’s authors say that their work demonstrates that “model collapse” has to be taken seriously —“if we are to sustain the benefits of training from large-scale data scraped from the web,” that is.


Actually some have: the paper’s authors note that so-called “long-term poisoning attacks” on language models actually aren’t new. (“Long-term poisoning attacks” is really dramatic language that just means deliberately feeding bad or wrong information to language learning models.)

Long before ChatGPT, Google’s ubiquitous, mysterious search algorithm inspired the creation of “click, content, and troll farms.”

These systems were invented to deliberately misguide social media and search engine users by deliberately gaming algorithms in order to funnel traffic to specific sites.


Well, the negative effect that these poisoning attacks had on search results —ie., filling them with clickbait, nonsense, lies, and spam— led to some pretty serious and substantive changes in the search engine’s algorithms.

Specifically, Google tweaked theirs to downgrade articles it thought were “farmed,” while at the same time putting way more emphasis on content that was produced by trustworthy sources, like educational domains.

(Meanwhile, other, smaller quality- and privacy-obsessed search engines, like DuckDuckGo, removed server farm content from their search engine results page altogether.)


Well, what’s different today is the scale at which such poisoning can happen. See, once language generation is automated —with Google’s Bard, ChatGPT, and the thousands upon thousands of “layer apps” built upon ChatGPT— anyone and everyone can generate anything and everything, at speed and at scale.

There is, however, a way out.


The scientists have a clear and simple answer to the danger of model collapse: make sure language learning models don’t “forget” their original training set —aka, the pre ChatGPT, human-written Internet— and ensure that future generations of language learning models are trained on fresh real life, human-written answers. (Both of these things are easier said than done.)

Bottom line: guess what? As this newsletter has repeatedly predicted, human written words are about to increase in value, not decrease.

Or, as the paper’s authors put it in scientist-speak: “the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.”


Use ChatGPT while you can. The next version might start returning goop.

Oh, and don’t forget how to write things yourself. Human-written words may yet come from behind for the win.


The paper’s summary »»

The official abstract »»

The full scientific paper »»

A fascinating comment thread on the topic on YC »»

Written by Jon Kallus. Any feedback? Simply reply. Like this? Share it!

Join the conversation

or to participate.