Off-and-on trying out an account over at @tal@oleo.cafe due to scraping bots bogging down lemmy.today to the point of near-unusability.

  • 2 Posts
  • 954 Comments
Joined 2 years ago
cake
Cake day: October 4th, 2023

help-circle
  • If it happens again and you have Magic Sysrq enabled, you can do Magic Sysrq-t, which may give you some idea of what the system is doing, since you’ll get stack traces. As long as the kernel can talk to the keyboard, it should be able to get that.

    https://en.wikipedia.org/wiki/Magic_sysrq

    You maybe can’t see anything on your monitor, but if the system is working enough to generate the stack traces and log them to the syslog on disk (like, your kernel filesystem and disk systems are still functional), you’ll be able to view them on reboot.

    If it can’t even do that, you might be able to set up a serial console and then, using another system running screen or minicom or something like that linked up to the serial port, issue Magic Sysrq to that and view it on that machine.

    Some systems have hardware watchdogs, where if a process can’t constantly ping the thing, the system will reboot. That doesn’t solve your problem, but it may mitigate it if you just want it to reboot if things wedge up. The watchdog package in Debian has some software to make use of this.


  • You have to have a thin client device to access the servers out on the Internet, which is…kind of what a sub-$500 low-end PC or budget smartphone would be.

    I suspect that it’s more that a lot of people are going to defer upgrades at the low end of the scale, use an older device for longer than they otherwise would have.

    Might not be great for security; smartphone OSes won’t get security updates after N years, and Windows 10 is EOL.



  • I don’t know of a pre-wrapped utility to do that, but assuming that this is a Linux system, here’s a simple bash script that’d do it.

    #!/bin/bash
    
    # Set this.  Path to a new, not-yet-existing directory that will retain a copy of a list
    # of your files.  You probably don't actually want this in /tmp, or
    # it'll be wiped on reboot.
    
    file_list_location=/tmp/storage-history
    
    # Set this.  Path to location with files that you want to monitor.
    
    path_to_monitor=path-to-monitor
    
    # If the file list location doesn't yet exist, create it.
    if [[ ! -d "$file_list_location" ]]; then
        mkdir "$file_list_location"
        git -C "$file_list_location" init
    fi
    
    # in case someone's checked out things at a different time
    git -C "$file_list_location" checkout master
    find "$path_to_monitor"|sort>"$file_list_location/files.txt"
    git -C "$file_list_location" add "$file_list_location/files.txt"
    git -C "$file_list_location" commit -m "Updated file list for $(date)"
    

    That’ll drop a text file at /tmp/storage-history/files.txt with a list of the files at that location, and create a git repo at /tmp/storage-history that will contain a history of that file.

    When your drive array kerplodes or something, your files.txt file will probably become empty if the mount goes away, but you’ll have a git repository containing a full history of your list of files, so you can go back to a list of the files there as they existed at any historical date.

    Run that script nightly out of your crontab or something ($ crontab -e to edit your crontab).

    As the script says, you need to choose a file_list_location (not /tmp, since that’ll be wiped on reboot), and set path_to_monitor to wherever the tree of files is that you want to keep track of (like, /mnt/file_array or whatever).

    You could save a bit of space by adding a line at the end to remove the current files.txt after generating the current git commit if you want. The next run will just regenerate files.txt anyway, and you can just use git to regenerate a copy of the file at for any historical day you want. If you’re not familiar with git, $ git log to find the hashref for a given day, $ git checkout <hashref> to move where things were on that day.

    EDIT: Moved the git checkout up.



  • Is this worth the effort?

    In terms of electricity cost?

    I wouldn’t do it myself.

    If you want to know whether it’s going to save money, you want to see how much power it uses — you can use a wattmeter, or look up the maximum amount on the device ratings to get an upper end. Look up how much you’re paying per kWh in electricity. Price the hardware. Put a price on your labor. Then you can get an estimate.

    My guess, without having any of those numbers, is that it probably isn’t.



  • You would typically want to use static ip addresses for servers (because if you use DHCP the IP is gonna change sooner or later, and it’s gonna be a pain in the butt).

    In this case, he controls the local DHCP server, which is gonna be running on the OpenWRT box, so he can set it to always assign whatever he wants to a given MAC.


  • tal@lemmy.todaytoSelfhosted@lemmy.world[Solved] OpenWrt & fail2ban
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    1
    ·
    edit-2
    8 days ago

    except that all requests’ IP addresses are set to the router’s IP address (192.168.3.1), so I am unable to use proper rate limiting and especially fail2ban.

    I’d guess that however the network is configured, you have the router NATting traffic going from the LAN to the Internet (typical for a home broadband router) as well as from the home LAN to the server.

    That does provide security benefits in that you’ve basically “put the server on the Internet side of things”, and the server can’t just reach into the LAN, same as anything else on the Internet. The NAT table has to have someone on the LAN side opening a connection to establish a new entry.

    But…then all of those hosts on the LAN are going to have the same IP address from the server’s standpoint. That’s the experience that hosts on the Internet have towards the same hosts on your LAN.

    It sounds like you also want to use DHCP:

    Getting the router to actually assign an IP address to the server was quite a headache

    I’ve never used VLANs on Linux (or OpenWRT, and don’t know how it interacts with the router’s hardware).

    I guess what you want to do is to not NAT traffic going from the LAN (where most of your hardware lives) and the DMZ (where the server lives), but still to disallow the DMZ from communicating with the LAN.

    considers

    So, I don’t know whether the VLAN stuff is necessary on your hardware to prevent the router hardware from acting like a switch, moving Ethernet packets directly, without them going to Linux. Might be the case.

    I suppose what you might do — from a network standpoint, don’t know off-the-cuff how to do it on OpenWRT, though if you’re just using it as a generic Linux machine, without using any OpenWRT-specific stuff, I’m pretty sure that it’s possible — is to give the OpenWRT machine two non-routable IP addresses, something like:

    192.168.1.1 for the LAN

    and

    192.168.2.1 for the DMZ

    The DHCP server listens on 192.168.1.1 and serves DHCP responses for the LAN that tell it to use 192.168.1.1 as the default route. Ditto for hosts in the DMZ. It hands out addresses from the appropriate pool. So, for example, the server in the DMZ would maybe be assigned 192.168.2.2.

    Then it should be possible to have a routing table entry to route 192.168.1.1 to 192.168.2.0/24 via 192.168.2.1 and vice versa, 192.168.2.1 to 192.168.1.0/24 via 192.168.1.1. Linux is capable of doing that, as that’s standard IP routing stuff.

    When a LAN host initiates a TCP connection to a DMZ host, it’ll look up its IP address in its routing table, say “hey, that isn’t on the same network as me, send it to the default route”. That’ll go to 192.168.1.1, with a destination address of 192.168.2.2. The OpenWRT box forwards it, doing IP routing, to 192.168.2.1, and then that box says “ah, that’s on my network, send it out the network port with VLAN tag whatever” and the switch fabric is configured to segregate the ports based on VLAN tag, and only sends the packet out the port associated with the DMZ.

    The problem is that the reason that home users typically derive indirect security benefits from use NAT is that it intrinsically disallows incoming connections from the server to the LAN. This will make that go away — the LAN hosts and DMZ hosts will be on separate “networks”, so things like ARP requests and other stuff at the purely-Ethernet level won’t reach each other, but they can freely communicate with each other at the IP level, because the two 192.168.X.1 virtual addresses will route packets between each the two networks. You’re going to need to firewall off incoming TCP connections (and maybe UDP and ICMP and whatever else you want to block) inbound on the 192.168.1.0/24 network from the 192.168.2.0/24 network. You can probably do that with iptables at the Linux level. OpenWRT may have some sort of existing firewall package that applies a set of iptables rules. I think that all the traffic should be reaching the Linux kernel in this scenario.

    If you get that set up, hosts at 192.168.2.2, on the DMZ, should be able to see connections from 192.168.1.2, on the LAN, using its original IP address.

    That should work if what you had was a Linux box with three Ethernet cards (one for each of the Internet, LAN, and WAN) and the VLAN switch hardware stuff wasn’t in the picture; you’d just not do any VLAN stuff then. I’m not 100% certain that any VLAN switching fabric stuff might muck that up — I’ve only very rarely touched VLANs myself, and never tried to do this, use VLANs to hack switch fabric attached directly to a router to act like independent NICs. But I can believe that it’d work.

    If you do set it up, I’d also fire up sudo tcpdump on the server. If things are working correctly, sudo ping -b 192.168.1.255 on a host on the LAN shouldn’t show up as reaching the server. However, ping 192.168.2.2 should.

    You’re going to want traffic that doesn’t match a NAT table entry and is coming in from the Internet to be forwarded to the DMZ vlan.

    That’s a high-level of what I believe needs to happen. But I can’t give you a hand-holding walkthrough to configure it via off-the-cuff knowledge, because I haven’t needed to do a fair bit of this myself — sorry on that.

    EDIT: This isn’t the question you asked, but I’d also add that what I’d probably do myself if I were planning to set something like this up is get a small, low power Linux machine with multiple NICs (well, okay, probably one NIC, multiple ports). That cuts the switch-level stuff that I think that you’d likely otherwise need to contend with out of the picture, and then I don’t think that you’d need to deal with VLANs, which is a headache that I wouldn’t want, especially if getting it wrong might have security implications. If you need more ports for the LAN, then just throw a regular old separate hardware Ethernet switch on the LAN port. You know that the switch can’t be moving traffic between the LAN and DMZ networks itself then, because it can’t touch the DMZ. But I don’t know whether that’d make financial sense in your case, if you’ve already got the router hardware.



  • Actually, thinking about this…a more-promising approach might be deterrent via poisoning the information source. Not bulletproof, but that might have some potential.

    So, the idea here is that what you’d do there is to create a webpage that looks, to a human, as if only the desired information shows up.

    But you include false information as well. Not just an insignificant difference, as with a canary trap, or a real error intended to have minimal impact, only to identify an information source, as with a trap street. But outright wrong information, stuff where reliance on the stuff would potentially be really damaging to people relying on the information.

    You stuff that information into the page in a way that a human wouldn’t readily see. Maybe you cover that text up with an overlay or something. That’s not ideal, and someone browsing using, say, a text-mode browser like lynx might see the poison, but you could probably make that work for most users. That has some nice characteristics:

    • You don’t have to deal with the question of whether the information rises to the level of copyright infringement or not. It’s still gonna dick up responses being issued by the LLM.

    • Legal enforcement, which is especially difficult across international borders — The Pirate Bay continues to operate to this day, for example — doesn’t come up as an issue. You’re deterring via a different route.

    • The Internet Archive can still archive the pages.

    Someone could make a bot that post-processes your page to strip out the poison, but you could sporadically change up your approach, change it over time, and the question for an AI company is whether it’s easier and safer to just license your content and avoid the risk of poison, or to risk poisoned content slipping into their model whenever a media company adopts a new approach.

    I think the real question is whether someone could reliably make a mechanism that’s a general defeat for that. For example, most AI companies probably are just using raw text today for efficiency, but for specifically news sources known to do this, one could generate a screenshot of a page in a browser and then OCR the text. The media company could maybe still take advantage of ways in which generalist OCR and human vision differ — like, maybe humans can’t see text that’s 1% gray on a black background, but OCR software sees it just fine, so that’d be a place to insert poison. Or maybe the page displays poisoned information for a fraction of a second, long enough to be screenshotted by a bot, and then it vanishes before a human would have time to read it.

    shrugs

    I imagine that there are probably already companies working on the problem, on both sides.


  • I’m very far from sure that this is an effective way to block AI crawlers from pulling stories for training, if that’s their actual concern. Like…the rate of new stories just isn’t that high. This isn’t, say, Reddit, where someone trying to crawl the thing at least has to generate some abnormal traffic. Yeah, okay, maybe a human wouldn’t read all stories, but I bet that many read a high proportion of what the media source puts out, so a bot crawling all articles isn’t far off looking like a human. All a bot operator need do is create a handful of paid accounts and then just pull partial content with each, and I think that a bot would just fade into the noise. And my guess is that it is very likely that AI training companies will do that or something similar if knowledge of current news events is of interest to people.

    You could use a canary trap, and that might be more-effective:

    https://en.wikipedia.org/wiki/Canary_trap

    A canary trap is a method for exposing an information leak by giving different versions of a sensitive document to each of several suspects and seeing which version gets leaked. It could be one false statement, to see whether sensitive information gets out to other people as well. Special attention is paid to the quality of the prose of the unique language, in the hopes that the suspect will repeat it verbatim in the leak, thereby identifying the version of the document.

    The term was coined by Tom Clancy in his novel Patriot Games,[1][non-primary source needed] although Clancy did not invent the technique. The actual method (usually referred to as a barium meal test in espionage circles) has been used by intelligence agencies for many years. The fictional character Jack Ryan describes the technique he devised for identifying the sources of leaked classified documents:

    Each summary paragraph has six different versions, and the mixture of those paragraphs is unique to each numbered copy of the paper. There are over a thousand possible permutations, but only ninety-six numbered copies of the actual document. The reason the summary paragraphs are so lurid is to entice a reporter to quote them verbatim in the public media. If he quotes something from two or three of those paragraphs, we know which copy he saw and, therefore, who leaked it.

    There, you generate slightly different versions of articles for different people. Say that you have 100 million subscribers. ln(100000000)/ln(2)=26.57... So you’re talking about 27 bits of information that need to go into the article to uniquely describe each. The AI is going to be lossy, I imagine, but you can potentially manage to produce 27 unique bits of information per article that can reasonably-reliably be remembered by an AI after training. That’s 27 different memorable items that need to show up in either Form A or Form B. Then you search to see what a new LLM knows about and ban the bot identified.

    Cartographers have done that, introduced minor, intentional errors to see what errors maps used to see whether they were derived from their map.

    https://en.wikipedia.org/wiki/Trap_street

    In cartography, a trap street is a fictitious entry in the form of a misrepresented street on a map, often outside the area the map nominally covers, for the purpose of “trapping” potential plagiarists of the map who, if caught, would be unable to explain the inclusion of the “trap street” on their map as innocent. On maps that are not of streets, other “trap” features (such as nonexistent towns, or mountains with the wrong elevations) may be inserted or altered for the same purpose.[1]

    https://en.wikipedia.org/wiki/Phantom_island

    A phantom island is a purported island which has appeared on maps but was later found not to exist. They usually originate from the reports of early sailors exploring new regions, and are commonly the result of navigational errors, mistaken observations, unverified misinformation, or deliberate fabrication. Some have remained on maps for centuries before being “un-discovered”.

    In some cases, cartographers intentionally include invented geographic features in their maps, either for fraudulent purposes or to catch plagiarists.[5][6]

    That has weaknesses. It’s possible to defeat that by requesting multiple versions using different bot accounts and identifying divergences and maybe merging them. In the counterintelligence situation, where canary traps have been used, normally people only have access to one source, and it’d be hard for an opposing intelligence agency to get access to multiple sources, but it’s not hard here.

    And even if you ban an account, it’s trivial to just create a new one, decoupled from the old one. Thus, there isn’t much that a media company can realistically do about it, as long as the generated material doesn’t rise to the level of a derived work and thus copyright infringement (and this is in the legal sense of derived — simply training something on something else isn’t sufficient to make it a derived work from a copyright law standpoint, any more than you reading a news report and then talking to someone else about it is).

    Getting back to the citation issue…

    Some news companies do keep archives (and often selling access to archives is a premium service), so for some, that might cover some of the “inability to cite” problem that not having Internet Archive archives produces, as long as the company doesn’t go under. It doesn’t help with a problem that many news companies have a tendency to silently modify articles without reliably listing errata, and that having an Internet Archive copy can be helpful. There are also some issues that I haven’t yet seen become widespread but worried about, like where a news source might provide different articles to people in different regions; there, having a trusted source like the Internet Archive can avoid that, and that could become a problem.


  • Yeah, that’s something that I’ve wondered about myself, what the long run is. Not principally “can we make an AI that is more-appealing than humans”, though I suppose that that’s a specific case, but…we’re only going to make more-compelling forms of entertainment, better video games. Recreational drugs aren’t going to become less addictive. If we get better at defeating the reward mechanisms in our brain that evolved to drive us towards advantageous activities…

    https://en.wikipedia.org/wiki/Wirehead_(science_fiction)

    In science fiction, wireheading is a term associated with fictional or futuristic applications[1] of brain stimulation reward, the act of directly triggering the brain’s reward center by electrical stimulation of an inserted wire, for the purpose of ‘short-circuiting’ the brain’s normal reward process and artificially inducing pleasure. Scientists have successfully performed brain stimulation reward on rats (1950s)[2] and humans (1960s). This stimulation does not appear to lead to tolerance or satiation in the way that sex or drugs do.[3] The term is sometimes associated with science fiction writer Larry Niven, who coined the term in his 1969 novella Death by Ecstasy[4] (Known Space series).[5][6] In the philosophy of artificial intelligence, the term is used to refer to AI systems that hack their own reward channel.[3]

    More broadly, the term can also refer to various kinds of interaction between human beings and technology.[1]

    Wireheading, like other forms of brain alteration, is often treated as dystopian in science fiction literature.[6]

    In Larry Niven’s Known Space stories, a “wirehead” is someone who has been fitted with an electronic brain implant known as a “droud” in order to stimulate the pleasure centers of their brain. Wireheading is the most addictive habit known (Louis Wu is the only given example of a recovered addict), and wireheads usually die from neglecting their basic needs in favour of the ceaseless pleasure. Wireheading is so powerful and easy that it becomes an evolutionary pressure, selecting against that portion of humanity without self-control.

    Now, of course, you’d expect that to be a powerful evolutionary selector, sure — if only people who are predisposed to avoid such things pass on offspring, that’d tend to rapidly increase the percentage of people predisposed to do so — but the flip side is the question of whether evolutionary pressure on the timescale of human generations can keep up with our technological advancement, which happens very quickly.

    There’s some kind of dark comic that I saw — I thought that it might be Saturday Morning Breakfast Cereal, but I’ve never been able to find it again, so maybe it was something else — which was a wordless comic that portrayed a society becoming so technologically advanced that it basically consumes itself, defeats its own essential internal mechanisms. IIRC it showed something like a society becoming a ring that was just stimulating itself until it disappeared.

    It’s a possible answer to the Fermi paradox:

    https://en.wikipedia.org/wiki/Fermi_paradox#It_is_the_nature_of_intelligent_life_to_destroy_itself

    The Fermi paradox is the discrepancy between the lack of conclusive evidence of advanced extraterrestrial life and the apparently high likelihood of its existence.[1][2][3]

    The paradox is named after physicist Enrico Fermi, who informally posed the question—remembered by Emil Konopinski as “But where is everybody?”—during a 1950 conversation at Los Alamos with colleagues Konopinski, Edward Teller, and Herbert York.

    Evolutionary explanations

    It is the nature of intelligent life to destroy itself

    This is the argument that technological civilizations may usually or invariably destroy themselves before or shortly after developing radio or spaceflight technology. The astrophysicist Sebastian von Hoerner stated that the progress of science and technology on Earth was driven by two factors—the struggle for domination and the desire for an easy life. The former potentially leads to complete destruction, while the latter may lead to biological or mental degeneration.[98] Possible means of annihilation via major global issues, where global interconnectedness actually makes humanity more vulnerable than resilient,[99] are many,[100] including war, accidental environmental contamination or damage, the development of biotechnology,[101] synthetic life like mirror life,[102] resource depletion, climate change,[103] or artificial intelligence. This general theme is explored both in fiction and in scientific hypotheses.[104]


  • Now some of those users gather on Discord and Reddit; one of the best-known groups, the subreddit r/MyBoyfriendIsAI, currently boasts 48,000 users.

    I am confident that one way or another, the market will meet demand if it exists, and I think that there is clearly demand for it. It may or may not be OpenAI, it may take a year or two or three for the memory market to stabilize, but if enough people want to basically have interactive erotic literature, it’s going to be available. Maybe else will take a model and provide it as a service, train it up on appropriate literature. Maybe people will run models themselves on local hardware — in 2026, that still requires some technical aptitude, but making a simpler-to-deploy software package or even distributing it as an all-in-one hardware package is very much doable.

    I’ll also predict that what males and females generally want in such a model probably differs, and that there will probably be services that specialize in that, much as how there are companies that make soap operas and romance novels that focus on women, which tend to differ from the counterparts that focus on men.

    I also think that there are still some challenges that remain in early 2026. For one, current LLMs still have a comparatively-constrained context window. Either their mutable memory needs to exist in a different form, or automated RAG needs to be better, or the hardware or software needs to be able to handle larger contexts.


  • If I’m traveling or I wipe my device or get a new one, I would have to add the new key to many servers as authorized keys,

    So, I don’t want to get into a huge argument over the best way to deal with things, since everyone has their own use cases, but if that’s your only concern, you have a list of hosts that you want to put the key on, and you still have a key for another device, that shouldn’t be terribly difficult. Generate your new keypair for your new device. Then on a Linux machine, something like:

    $ cat username-host-pairs.txt
    me@host1
    me@host2
    me@host3
    $ cat username-host-pairs.txt|xargs -n1 ssh-copy-id -i new-device-key-file-id_ed25519.pub
    

    That should use your other device’s private key to authenticate to the servers in question and copy the new device’s pubkey to the accounts on the host in question. Won’t need password access enabled.



  • So an internet

    The highest data rate it looks like is supported by LoRa in North America is 21900 bits per second, so you’re talking about 21kbps, or 2.6kBps in a best-case scenario. That’s about half of what an analog telephone system modem could achieve.

    It’s going to be pretty bandwidth-constrained, limited in terms of routing traffic around.

    I think that the idea of a “public access, zero-admin mesh Internet over the air” isn’t totally crazy, but that it’d probably need to use something like laser links and hardware that can identify and auto-align to other links.




  • tal@lemmy.todaytoMildly Infuriating@lemmy.worldHe parked his car on the sidewalk
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    17
    ·
    edit-2
    22 days ago

    Google Maps

    This is New York City, and from the Google Street View image, it looks like there’s not a lot of street parking there.

    My guess is that a number of cities with a lot of density, like NYC, probably should mandate a certain amount of public parking garage space for users in an area. Multistory parking garage space isn’t cheap, but using up street space via committing space to street parking also has costs in terms of congestion, even if the business owner doesn’t bear the costs.

    EDIT: I also note, by way driving the point home with a sledgehammer, that in my Google Street View image, there’s a different vehicle parked on the sidewalk in the same spot, a red sports car.