Does traning AI/ML-models on AI-generated content causes collapse on the quality of the output?

ryujin470@fedia.io · 14 hours ago

Does traning AI/ML-models on AI-generated content causes collapse on the quality of the output?

FaceDeer@fedia.io · 10 hours ago

Only in trivial cases where the training data isn’t being curated properly. There was a paper done on the subject a few years back where “model collapse” was demonstrated by repeatedly training generation after generation of models on the output of previous generations, and sure enough, the results were bad. This result gets paraded around every once in a while to “prove” that AI is doomed. However, in the real world this is not remotely close to how AI is actually trained. You can prevent model collapse simply by enriching the training data with good data - stuff that is already archived, that can’t be “contaminated.”

Indeed, the best models these days are trained largely on synthetic data - data that’s been pre-processed by other AIs to turn it into stuff that makes for better training material. For example a textbook could be processed by an LLM to turn it into a conversation about the information in the textbook, with questions and answers, and the result is training data an AI that’s better at understanding and talking about the content than if it was just fed the raw text.

If so are these programs that claim to ‘poison’ the training datasets effective?

This is a separate issue from the usual “model collapse” argument. I assume you’re talking about stuff like Nightshade, which claim to put false patterns into images that cause AIs to miscategorize them. These techniques are also something that only works in a “toy” environment, these adversarial patterns are tailored to affect specific AIs and won’t work on other AIs they weren’t specifically designed for. So for example you might “poison” an image so that a classifier based on Dall-E would become confused by it, but a GPT-Image classifier wouldn’t care. The most obvious illustration of this is the fact that humans are a separate lineage of image classifier and these “poisonings” have no effect on us.

There’s also the added problem that these adversarial patterns tend to be fragile, they break if you resample the image to resize or crop it. Since that’s usually a routine part of preparing training data for an image AI it may end up making the poison ineffective even for image AIs that it was designed for.

Essentially, all these things are just added background noise of the sort that AI training operations already have mechanisms for dealing with. But they make people feel better, I suppose.

Helix 🧬@feddit.org · 5 hours ago

understanding

This single word made me stop reading your text, which started with a somewhat good point about model collapse. LLMs are not “understanding” anything, they’re correlating tokens.

Apart from this, do you mind sharing a link to the studies about model collapse you mentioned had methodical errors?

FaceDeer@fedia.io · 5 hours ago

Semantic quibbling is one of the least interesting kinds of internet debate, so replace the word “understanding” with whatever word makes you happy. I continued with “and talking about” right afterwards so you can just delete the word entirely and the sentence still works fine. You could have just kept reading.

Since you didn’t read the rest of my comment, I should note that the rest of it after that sentence is about the other issue that OP raised and not even about model collapse at all.

Anyway. The article about model collapse that I see still crop up every once in a while is this one. It’s not that it has “methodological errors”, though, it’s just that it uses a very artificial training protocol to illustrate model collapse that doesn’t align with how LLMs are actually trained in real life. It’s like demonstrating the effects of inbreeding in animals by crossing brothers and sisters for twenty generations straight - you’ll almost certainly see some strong evidence, but it’s not a pattern of breeding that you are actually going to see in the wild.

Helix 🧬@feddit.org · 35 minutes ago

Semantic quibbling is one of the least interesting kinds of internet debate

Cueball is typing on a computer.
Voice outside frame: Are you coming to bed?
Cueball: I can't. This is important.
Voice: What?
Cueball: Someone is WRONG on the Internet.

Why do you engage in it then?

In my opinion, a debate about the semantics of understanding and intelligence in context of AI is highly interesting, and a huge issue for worldwide politics and policies, but you do you.

Brummbaer@pawb.social · 2 hours ago

If I understand it right you need to enrich and filter data with human input so as not to collapse the model.

Wouldn’t that imply if the human enrichment is emulating AI data too closely it will still collapse the model, since it’s now just the human filtering that’s mimicing AI data?

Iconoclast@feddit.uk · 4 hours ago

We don’t even have a good definition for what “understanding” actually means. It’s like the word “intelligence” - there are dozens of dictionary definitions.

I find it pretty ridiculous to dismiss a long, well-thought-out piece of writing in its entirety just because one word was used in a way you don’t like. Even if you disagree with how they used the term, you most likely still understand what they meant by it. LLMs aren’t generally intelligent, but they’re also not as dumb as people make them out to be. There’s clearly real information processing happening in the background that produces accurate answers way more often than pure chance would allow.