News Publishers Are Now Blocking The Internet Archive, And We May All Regret It

Powderhorn@beehaw.org · 3 hours ago

News Publishers Are Now Blocking The Internet Archive, And We May All Regret It

tal@lemmy.today · edit-2 2 hours ago

I’m very far from sure that this is an effective way to block AI crawlers from pulling stories for training, if that’s their actual concern. Like…the rate of new stories just isn’t that high. This isn’t, say, Reddit, where someone trying to crawl the thing at least has to generate some abnormal traffic. Yeah, okay, maybe a human wouldn’t read all stories, but I bet that many read a high proportion of what the media source puts out, so a bot crawling all articles isn’t far off looking like a human. All a bot operator need do is create a handful of paid accounts and then just pull partial content with each, and I think that a bot would just fade into the noise. And my guess is that it is very likely that AI training companies will do that or something similar if knowledge of current news events is of interest to people.

You could use a canary trap, and that might be more-effective:

https://en.wikipedia.org/wiki/Canary_trap

A canary trap is a method for exposing an information leak by giving different versions of a sensitive document to each of several suspects and seeing which version gets leaked. It could be one false statement, to see whether sensitive information gets out to other people as well. Special attention is paid to the quality of the prose of the unique language, in the hopes that the suspect will repeat it verbatim in the leak, thereby identifying the version of the document.

The term was coined by Tom Clancy in his novel Patriot Games,[1][non-primary source needed] although Clancy did not invent the technique. The actual method (usually referred to as a barium meal test in espionage circles) has been used by intelligence agencies for many years. The fictional character Jack Ryan describes the technique he devised for identifying the sources of leaked classified documents:

Each summary paragraph has six different versions, and the mixture of those paragraphs is unique to each numbered copy of the paper. There are over a thousand possible permutations, but only ninety-six numbered copies of the actual document. The reason the summary paragraphs are so lurid is to entice a reporter to quote them verbatim in the public media. If he quotes something from two or three of those paragraphs, we know which copy he saw and, therefore, who leaked it.

There, you generate slightly different versions of articles for different people. Say that you have 100 million subscribers. ln(100000000)/ln(2)=26.57... So you’re talking about 27 bits of information that need to go into the article to uniquely describe each. The AI is going to be lossy, I imagine, but you can potentially manage to produce 27 unique bits of information per article that can reasonably-reliably be remembered by an AI after training. That’s 27 different memorable items that need to show up in either Form A or Form B. Then you search to see what a new LLM knows about and ban the bot identified.

Cartographers have done that, introduced minor, intentional errors to see what errors maps used to see whether they were derived from their map.

https://en.wikipedia.org/wiki/Trap_street

In cartography, a trap street is a fictitious entry in the form of a misrepresented street on a map, often outside the area the map nominally covers, for the purpose of “trapping” potential plagiarists of the map who, if caught, would be unable to explain the inclusion of the “trap street” on their map as innocent. On maps that are not of streets, other “trap” features (such as nonexistent towns, or mountains with the wrong elevations) may be inserted or altered for the same purpose.[1]

https://en.wikipedia.org/wiki/Phantom_island

A phantom island is a purported island which has appeared on maps but was later found not to exist. They usually originate from the reports of early sailors exploring new regions, and are commonly the result of navigational errors, mistaken observations, unverified misinformation, or deliberate fabrication. Some have remained on maps for centuries before being “un-discovered”.

In some cases, cartographers intentionally include invented geographic features in their maps, either for fraudulent purposes or to catch plagiarists.[5][6]

That has weaknesses. It’s possible to defeat that by requesting multiple versions using different bot accounts and identifying divergences and maybe merging them. In the counterintelligence situation, where canary traps have been used, normally people only have access to one source, and it’d be hard for an opposing intelligence agency to get access to multiple sources, but it’s not hard here.

And even if you ban an account, it’s trivial to just create a new one, decoupled from the old one. Thus, there isn’t much that a media company can realistically do about it, as long as the generated material doesn’t rise to the level of a derived work and thus copyright infringement (and this is in the legal sense of derived — simply training something on something else isn’t sufficient to make it a derived work from a copyright law standpoint, any more than you reading a news report and then talking to someone else about it is).

Getting back to the citation issue…

Some news companies do keep archives (and often selling access to archives is a premium service), so for some, that might cover some of the “inability to cite” problem that not having Internet Archive archives produces, as long as the company doesn’t go under. It doesn’t help with a problem that many news companies have a tendency to silently modify articles without reliably listing errata, and that having an Internet Archive copy can be helpful. There are also some issues that I haven’t yet seen become widespread but worried about, like where a news source might provide different articles to people in different regions; there, having a trusted source like the Internet Archive can avoid that, and that could become a problem.

Powderhorn@beehaw.org · 31 minutes ago

When I took on a job as an editor for a paper in a tourist town, one of the first things I told the publisher was “what is going on with this map? This road doesn’t exist!”

Well, me being me, I decided we needed entirely new maps, and I was going to be the one who did them.

tal@lemmy.today · edit-2 2 hours ago

Actually, thinking about this…a more-promising approach might be deterrent via poisoning the information source. Not bulletproof, but that might have some potential.

So, the idea here is that what you’d do there is to create a webpage that looks, to a human, as if only the desired information shows up.

But you include false information as well. Not just an insignificant difference, as with a canary trap, or a real error intended to have minimal impact, only to identify an information source, as with a trap street. But outright wrong information, stuff where reliance on the stuff would potentially be really damaging to people relying on the information.

You stuff that information into the page in a way that a human wouldn’t readily see. Maybe you cover that text up with an overlay or something. That’s not ideal, and someone browsing using, say, a text-mode browser like lynx might see the poison, but you could probably make that work for most users. That has some nice characteristics:

You don’t have to deal with the question of whether the information rises to the level of copyright infringement or not. It’s still gonna dick up responses being issued by the LLM.
Legal enforcement, which is especially difficult across international borders — The Pirate Bay continues to operate to this day, for example — doesn’t come up as an issue. You’re deterring via a different route.
The Internet Archive can still archive the pages.

Someone could make a bot that post-processes your page to strip out the poison, but you could sporadically change up your approach, change it over time, and the question for an AI company is whether it’s easier and safer to just license your content and avoid the risk of poison, or to risk poisoned content slipping into their model whenever a media company adopts a new approach.

I think the real question is whether someone could reliably make a mechanism that’s a general defeat for that. For example, most AI companies probably are just using raw text today for efficiency, but for specifically news sources known to do this, one could generate a screenshot of a page in a browser and then OCR the text. The media company could maybe still take advantage of ways in which generalist OCR and human vision differ — like, maybe humans can’t see text that’s 1% gray on a black background, but OCR software sees it just fine, so that’d be a place to insert poison. Or maybe the page displays poisoned information for a fraction of a second, long enough to be screenshotted by a bot, and then it vanishes before a human would have time to read it.

shrugs

I imagine that there are probably already companies working on the problem, on both sides.

cmnybo@discuss.tchncs.de · 2 hours ago

Hidden junk that a person wouldn’t see would likely be picked up by a screen reader. That would make the site much harder to use for a visually impaired person.

Petter1@discuss.tchncs.de · edit-2 2 hours ago

But, AI chatbots aren’t trained on recent data, they just do a good old google style search and take the content of the top articles as reference while generating the response…

News Publishers Are Now Blocking The Internet Archive, And We May All Regret It

News Publishers Are Now Blocking The Internet Archive, And We May All Regret It

Just a moment...