Yes it does. Indeed it is a mathematical theorem from Information Theory, called the data-processing inequality. Quoting from two good textbooks on Information Theory:
“No clever manipulation of the data can improve the inferences that can be made from the data” (Cover & Thomas, Elements of Information Theory §2.8).
You took those quotes wildly out of context. Of course there is a hard limit on how much information can be extracted from data. Clever processing won’t break that limit. But only in basic cases have we seen proofs that certain statistical inference methods make optimal use of the data. In complicated systems like neural nets it is basically impossible to prove such optimality. In fact the models are almost definitely not using the data optimally. Processing can help. A lot.
They aren’t out of context, and you have just said the same thing. Data processing can help in removing noise, but it can’t help in creating information or extracting information that wasn’t there in the first place. In fact – again as you said – it can end up destroying part of the original information.
LLMs extract word correlations from textual data. Already in this process they are losing information, since they can’t extract correlations beyond a certain (yet large) length, and don’t extract correlations at shorter lengths. And in creating output they insert spurious correlations that replace (destroy) some of the original ones. This output will contain even less information than the original training data. So a new LLM trained with such an output will give back even less.
Yes it does. Indeed it is a mathematical theorem from Information Theory, called the data-processing inequality. Quoting from two good textbooks on Information Theory:
“No clever manipulation of the data can improve the inferences that can be made from the data” (Cover & Thomas, Elements of Information Theory §2.8).
“Data processing can only destroy information” (MacKay, Information Theory, Inference, and Learning Algorithms exercise 8.9).
You took those quotes wildly out of context. Of course there is a hard limit on how much information can be extracted from data. Clever processing won’t break that limit. But only in basic cases have we seen proofs that certain statistical inference methods make optimal use of the data. In complicated systems like neural nets it is basically impossible to prove such optimality. In fact the models are almost definitely not using the data optimally. Processing can help. A lot.
They aren’t out of context, and you have just said the same thing. Data processing can help in removing noise, but it can’t help in creating information or extracting information that wasn’t there in the first place. In fact – again as you said – it can end up destroying part of the original information.
LLMs extract word correlations from textual data. Already in this process they are losing information, since they can’t extract correlations beyond a certain (yet large) length, and don’t extract correlations at shorter lengths. And in creating output they insert spurious correlations that replace (destroy) some of the original ones. This output will contain even less information than the original training data. So a new LLM trained with such an output will give back even less.