Does traning AI/ML-models on AI-generated content causes collapse on the quality of the output?

ryujin470@fedia.io · 12 hours ago

Does traning AI/ML-models on AI-generated content causes collapse on the quality of the output?

stravanasu@lemmy.ca · edit-2 2 hours ago

It is actually not so difficult to see this for yourself in a much simplified setting. One can easily build a “Small Language Model” that extracts correlations between only three consecutive words. On the web there’s plenty of short scripts that do this; here and here is one example. The output created by such a SLM can have remarkably long sentences with grammatical meaning (see the examples in the links above); this is remarkable since all it learned was correlations between triplets of words.

Now you can take a large amount of output from such a SLM, and use it to train a second, identical or even better SLM, then check the output generated by this second one. You’ll see that the new output is less coherent than the one from the first SLM. Give the output of the second SLM to a third, and you’ll see even less coherent text coming out. And so on.

FaceDeer@fedia.io · 8 hours ago

Only in trivial cases where the training data isn’t being curated properly. There was a paper done on the subject a few years back where “model collapse” was demonstrated by repeatedly training generation after generation of models on the output of previous generations, and sure enough, the results were bad. This result gets paraded around every once in a while to “prove” that AI is doomed. However, in the real world this is not remotely close to how AI is actually trained. You can prevent model collapse simply by enriching the training data with good data - stuff that is already archived, that can’t be “contaminated.”

Indeed, the best models these days are trained largely on synthetic data - data that’s been pre-processed by other AIs to turn it into stuff that makes for better training material. For example a textbook could be processed by an LLM to turn it into a conversation about the information in the textbook, with questions and answers, and the result is training data an AI that’s better at understanding and talking about the content than if it was just fed the raw text.

If so are these programs that claim to ‘poison’ the training datasets effective?

This is a separate issue from the usual “model collapse” argument. I assume you’re talking about stuff like Nightshade, which claim to put false patterns into images that cause AIs to miscategorize them. These techniques are also something that only works in a “toy” environment, these adversarial patterns are tailored to affect specific AIs and won’t work on other AIs they weren’t specifically designed for. So for example you might “poison” an image so that a classifier based on Dall-E would become confused by it, but a GPT-Image classifier wouldn’t care. The most obvious illustration of this is the fact that humans are a separate lineage of image classifier and these “poisonings” have no effect on us.

There’s also the added problem that these adversarial patterns tend to be fragile, they break if you resample the image to resize or crop it. Since that’s usually a routine part of preparing training data for an image AI it may end up making the poison ineffective even for image AIs that it was designed for.

Essentially, all these things are just added background noise of the sort that AI training operations already have mechanisms for dealing with. But they make people feel better, I suppose.

Helix 🧬@feddit.org · 3 hours ago

understanding

This single word made me stop reading your text, which started with a somewhat good point about model collapse. LLMs are not “understanding” anything, they’re correlating tokens.

Apart from this, do you mind sharing a link to the studies about model collapse you mentioned had methodical errors?

FaceDeer@fedia.io · 3 hours ago

Semantic quibbling is one of the least interesting kinds of internet debate, so replace the word “understanding” with whatever word makes you happy. I continued with “and talking about” right afterwards so you can just delete the word entirely and the sentence still works fine. You could have just kept reading.

Since you didn’t read the rest of my comment, I should note that the rest of it after that sentence is about the other issue that OP raised and not even about model collapse at all.

Anyway. The article about model collapse that I see still crop up every once in a while is this one. It’s not that it has “methodological errors”, though, it’s just that it uses a very artificial training protocol to illustrate model collapse that doesn’t align with how LLMs are actually trained in real life. It’s like demonstrating the effects of inbreeding in animals by crossing brothers and sisters for twenty generations straight - you’ll almost certainly see some strong evidence, but it’s not a pattern of breeding that you are actually going to see in the wild.

Brummbaer@pawb.social · 7 minutes ago

If I understand it right you need to enrich and filter data with human input so as not to collapse the model.

Wouldn’t that imply if the human enrichment is emulating AI data too closely it will still collapse the model, since it’s now just the human filtering that’s mimicing AI data?

Iconoclast@feddit.uk · 3 hours ago

We don’t even have a good definition for what “understanding” actually means. It’s like the word “intelligence” - there are dozens of dictionary definitions.

I find it pretty ridiculous to dismiss a long, well-thought-out piece of writing in its entirety just because one word was used in a way you don’t like. Even if you disagree with how they used the term, you most likely still understand what they meant by it. LLMs aren’t generally intelligent, but they’re also not as dumb as people make them out to be. There’s clearly real information processing happening in the background that produces accurate answers way more often than pure chance would allow.

stravanasu@lemmy.ca · 12 hours ago

Yes it does. Indeed it is a mathematical theorem from Information Theory, called the data-processing inequality. Quoting from two good textbooks on Information Theory:

“No clever manipulation of the data can improve the inferences that can be made from the data” (Cover & Thomas, Elements of Information Theory §2.8).

“Data processing can only destroy information” (MacKay, Information Theory, Inference, and Learning Algorithms exercise 8.9).

BB84@mander.xyz · 10 hours ago

You took those quotes wildly out of context. Of course there is a hard limit on how much information can be extracted from data. Clever processing won’t break that limit. But only in basic cases have we seen proofs that certain statistical inference methods make optimal use of the data. In complicated systems like neural nets it is basically impossible to prove such optimality. In fact the models are almost definitely not using the data optimally. Processing can help. A lot.

stravanasu@lemmy.ca · edit-2 5 minutes ago

They aren’t out of context, and you have just said the same thing. Data processing can help in removing noise, but it can’t help in creating information or extracting information that wasn’t there in the first place. In fact – again as you said – it can end up destroying part of the original information.

LLMs extract word correlations from textual data. Already in this process they are losing information, since they can’t extract correlations beyond a certain (yet large) length, and don’t extract correlations at shorter lengths. And in creating output they insert spurious correlations that replace (destroy) some of the original ones. This output will contain even less information than the original training data. So a new LLM trained with such an output will give back even less.

MoogleMaestro@lemmy.zip · edit-2 11 hours ago

If you think about AI systems as effectively complex DSP problems and equations, then logically any system that takes inputs that are potentially the outputs can cause system feedback or recursive (destructive) loops. What scares AI companies is that, while most recursive loops are easy to detect immediately, “content loops” will be much harder to detect as the delay time between inputs is much larger compared to, say, audio or programming loops where feedback is obvious immediately.

This is effectively the theory behind the practice of data poising, and it’s hard to say there’s no validity to it as most AI companies are terrified of data poisoning. If it didn’t work, companies wouldn’t be so adamantly vocal about their distaste for model poisoning conceptually. Also, a lot of time and money is spent trying to “detect” AI content for a reason – that reason is actually to help aid the detection of AI output which must be “valuable” to the companies to spend the resources on it.

Conversely, AI makers have learned of ways to avoid this by simply having human semantic “grading” of the content done by third parties. This is why there are so many deals going on in Africa / SE Asia where these AI companies are hiring English speakers to effectively “wash” the input by giving it contextual “extra information” and rough validation scoring. This is an expensive solution, though, so they’re very much dependent on AI being the bees-knees of lucrative investment for this process to continue. I’d also argue, with the rate at which AI development has slowed down, the semantic grading of content being fed into the system also has diminishing returns. However, this is effectively a “survival of the fittest” style evolutionary simulation, where the AI is only interested in training off information it happens to find is “right” or “close enough” or whatever metric the grader finds. The feedback is less of a problem if the validity of the input can be assured or “cleaned up” to prevent unintended loops, basically.

Now, “are the programs that claim to poison the datasets effective?” Hmm, that’s a difficult one to answer. Personally, I have some skepticism around these models as their origins are vague and most are not adopting an “open data” approach or even an open binary approach (freeware) for distribution. I understand that the concern from the makers is that publicly talking about how the sausage is made makes the software less effective, but it’s hard to validate that the people behind these models are providing the service as intended and that they aren’t doing anything with the data being sent to them for “protection.” There’s no assurances that they aren’t training models off the data artists send in themselves, for example, or any guarantees to how that data will be used for training. So it’s kind of a “miss” for me, unless there’s a project someone is aware of that is both open-source and open-data (I find that ‘open-source’ in the AI field is a hugely misleading moniker, as AI follows a “data is king” philosophy and the program that trains the models is inherently less important as a result.)

hendrik@palaver.p3x.de · edit-2 15 minutes ago

The issue with the tools I’ve seen is, they either don’t factor in how language models are trained and datasets are prepared in reality. Or they’re based on some outdated information. I’ve never seen any specific tool backed by science or even with a plausible way of working against current data gathering processes… So for all intents and purposes, they’re a bit more alike homeopathy or alternative medicine. Sure, you’re perfectly fine taking sugar pills, there’s nothing wrong with that. But don’t confuse it with actual science-backed medicine.

And I mean the poisoning goes even further than that. There’s not just people trying to make a LLM output gibberish. There’s also lots of people with a vested (commercial) interest in sneaking in false information, their political agenda, or even a tire company who wants ChatGPT to say “Company XY” is the most trustworthy shop for new tires for your car. Judging by the public information out there, we’re already way past simple attacks. And the AI companies are aware of it. It’s an ongoing cat and mouse game. And while there’s all these sweatshops, they’ll also use other AI to sift through the data, natural language processing. From what I remember they have secret watermarking in place in a lot of commecial chatbots and image generators… So unless people come up with very clever mechanisms, the “poisoning” attempt will probably be detected with some very basic (fully automated) plausibility checks and they’ll just discard your data without wasting a lot of resources on it.

BlameThePeacock@lemmy.ca · 12 hours ago

To some extent, yes, however, the companies building these systems are using heavily curated data for most of the things where that would matter. They aren’t just letting it free on the whole internet at this point, it would be absolutely useless.

hendrik@palaver.p3x.de · edit-2 11 hours ago

Depends and no. The tools are completely ineffective.

There was a paper once about how feeding generative AI it’s own output makes it deteriorate. But that’s not the entire story. Many/most modern large language models are in fact trained or fine-tuned on synthetic text. Depending on how it’s done, it can very well make models better. For example in “distillation”, and AI companies can replace expensive RLHF with synthetic examples. It can also make them worse. But you’re not the one curating the datasets or deciding what goes where and how.

In general in ML it’s not advised to train a model on its own output. That in itself can’t make the predictions any better, just worse.

Tyrq@lemmy.dbzer0.com · 11 hours ago

Well, it’s not entirely genetics, but genetics is an algorithmic biological engineering platform, which requires new data to sufficiently stabilize against bad data sectors.

Otherwise you end up with habsburg jaw from trying to consolidate your land holdings. Greed is a motherfucker, so to speak.