Off-and-on trying out an account over at @tal@oleo.cafe due to scraping bots bogging down lemmy.today to the point of near-unusability.

  • 3 Posts
  • 962 Comments
Joined 2 years ago
cake
Cake day: October 4th, 2023

help-circle
  • New York City is a port city. It has an effectively infinite supply of salt water, which you can use for evaporative cooling, albeit with some extra complications.

    EDIT: Hell, you can use the waste energy from an evaporative cooler to drive a distiller to generate fresh water from some of the evaporated salt water, if you want. Microsoft is doing that combined datacenter-nuclear-power-plant thing. IIRC, if I’m not combining two different cases of an AI datacenter using full output of a power plant, they have the entire output of a nuclear power plant never touching the grid (and thus avoiding any transmission cost overhead and as a bonus, avoiding regulatory requirements attached to transmission and distribution from power generation):

    https://arstechnica.com/ai/2024/09/re-opened-three-mile-island-will-power-ai-data-centers-under-new-deal/

    Re-opened Three Mile Island will power AI data centers under new deal

    Microsoft would claim all of the nuclear plant’s power generation for at least 20 years.

    From past reading, desalination from reverse osmosis has wound up being somewhat cheaper than via using distillation, but combined generation-distillation using waste heat is a thing. IIRC Spain has some company that does combined generation-distillation facilities.

    And in a case like that, you have the waste heat from generation and the waste heat from use all in one spot, so you’ve got a lot of water vapor to condense.


  • tal@lemmy.todaytoSelfhosted@lemmy.worldSftp client gor android?
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    2 days ago

    If you can use Termux, you can use the command-line lftp, which supports SFTP; I use this on Linux, so I’m familiar with it.

    $ pkg install lftp
    $ lftp sftp://foo.com
    

    I also use rsync in Termux after being exasperated over the lack of a reasonable F-Droid graphical client for that.

    I wound up using some non-open-source graphical SCP or SFTP client out of the Google Play Store using Aurora Store’s anonymous login at one point, which worked but wasn’t what I wanted to use.



  • It wouldn’t be effective, because it’s trivial to bypass. There are many ways one can do a DNS lookup elsewhere and get access to the response, as the information isn’t considered secret. Once you’ve done that, you can reach a host. And any Computer A participating in a DDoS such that Comptuer B can see the traffic from the DDoS has already resolved the domain name anyway.

    It’s sometimes been used as a low-effort way for a network administrator to try to block Web browser users on that network from getting access to content, but it’s a really ineffective mechanism even for that. The only reason that I think it ever showed up is because it’s very easy to deploy in that role. Browsers often use DNS-over-HTTP to an outside server today rather than DNS, so it won’t even affect users of browsers doing that at all.

    In general, if I can go to a website like this:

    https://mxtoolbox.com/DNSLookup.aspx

    And plonk in a hostname to get an IP address, I can then tell my system about that mapping so that it will never go to DNS again. On Linux and most Unixy systems, an easy way to do this would be in /etc/hosts:

    5.78.97.5 lemmy.today
    

    On Windows systems, the hosts file typically lives at C:\\Windows\system32\drivers\etc\hosts

    EDIT: Oh, maybe I misunderstood. You don’t mean as a mechanism to block Computer A from reaching Computer B itself, but just as just a transport mechanism to hand information to routers? Like, have some way to trigger a router to do a DNS lookup for a given IP, the way we do a PTR lookup today to resolve an IP address to a hostname, but obtain blacklist information?

    That’s a thought. I haven’t spent a lot of time on DNSSec, but it must have infrastructure to securely distribute information.

    DNS is public — I don’t know if that would be problematic or not, to expose to the Internet at large the list of blacklists going to a given host. It would mean that it could be easier to troubleshoot problems, since if I can’t reach host X, I can check to see whether it’s because that host has requested that my traffic be blacklisted.




  • The older headphones there don’t look like you can rotate the pads, yeah? I mean, it’s that rotating hinge which failed here.

    I guess one could say “well, I don’t want headphones with rotating pads”, but it’s that rotation that lets the XM5 headphones fit into a fairly-flat carrying case.

    I will say, though, that the XM5s probably weren’t going to last over 30 years, if for no other reason than because they use an internal battery…



  • Not what you asked, but regardless of whatever else you’re doing, I would take any really critical data you need, encrypt it, put it on a laptop or other portable device, and bring it with you. Trying to throw together some last-minute setup that you rely on and can’t easily resolve remotely is asking for trouble.

    Another fallback option, if you have a friend who you trust and can call and ask them to type stuff in – give 'em a key before you go and call 'em and ask 'em to type whatever you need if you get into trouble.


  • If it happens again and you have Magic Sysrq enabled, you can do Magic Sysrq-t, which may give you some idea of what the system is doing, since you’ll get stack traces. As long as the kernel can talk to the keyboard, it should be able to get that.

    https://en.wikipedia.org/wiki/Magic_sysrq

    You maybe can’t see anything on your monitor, but if the system is working enough to generate the stack traces and log them to the syslog on disk (like, your kernel filesystem and disk systems are still functional), you’ll be able to view them on reboot.

    If it can’t even do that, you might be able to set up a serial console and then, using another system running screen or minicom or something like that linked up to the serial port, issue Magic Sysrq to that and view it on that machine.

    Some systems have hardware watchdogs, where if a process can’t constantly ping the thing, the system will reboot. That doesn’t solve your problem, but it may mitigate it if you just want it to reboot if things wedge up. The watchdog package in Debian has some software to make use of this.


  • You have to have a thin client device to access the servers out on the Internet, which is…kind of what a sub-$500 low-end PC or budget smartphone would be.

    I suspect that it’s more that a lot of people are going to defer upgrades at the low end of the scale, use an older device for longer than they otherwise would have.

    Might not be great for security; smartphone OSes won’t get security updates after N years, and Windows 10 is EOL.



  • I don’t know of a pre-wrapped utility to do that, but assuming that this is a Linux system, here’s a simple bash script that’d do it.

    #!/bin/bash
    
    # Set this.  Path to a new, not-yet-existing directory that will retain a copy of a list
    # of your files.  You probably don't actually want this in /tmp, or
    # it'll be wiped on reboot.
    
    file_list_location=/tmp/storage-history
    
    # Set this.  Path to location with files that you want to monitor.
    
    path_to_monitor=path-to-monitor
    
    # If the file list location doesn't yet exist, create it.
    if [[ ! -d "$file_list_location" ]]; then
        mkdir "$file_list_location"
        git -C "$file_list_location" init
    fi
    
    # in case someone's checked out things at a different time
    git -C "$file_list_location" checkout master
    find "$path_to_monitor"|sort>"$file_list_location/files.txt"
    git -C "$file_list_location" add "$file_list_location/files.txt"
    git -C "$file_list_location" commit -m "Updated file list for $(date)"
    

    That’ll drop a text file at /tmp/storage-history/files.txt with a list of the files at that location, and create a git repo at /tmp/storage-history that will contain a history of that file.

    When your drive array kerplodes or something, your files.txt file will probably become empty if the mount goes away, but you’ll have a git repository containing a full history of your list of files, so you can go back to a list of the files there as they existed at any historical date.

    Run that script nightly out of your crontab or something ($ crontab -e to edit your crontab).

    As the script says, you need to choose a file_list_location (not /tmp, since that’ll be wiped on reboot), and set path_to_monitor to wherever the tree of files is that you want to keep track of (like, /mnt/file_array or whatever).

    You could save a bit of space by adding a line at the end to remove the current files.txt after generating the current git commit if you want. The next run will just regenerate files.txt anyway, and you can just use git to regenerate a copy of the file at for any historical day you want. If you’re not familiar with git, $ git log to find the hashref for a given day, $ git checkout <hashref> to move where things were on that day.

    EDIT: Moved the git checkout up.



  • Is this worth the effort?

    In terms of electricity cost?

    I wouldn’t do it myself.

    If you want to know whether it’s going to save money, you want to see how much power it uses — you can use a wattmeter, or look up the maximum amount on the device ratings to get an upper end. Look up how much you’re paying per kWh in electricity. Price the hardware. Put a price on your labor. Then you can get an estimate.

    My guess, without having any of those numbers, is that it probably isn’t.



  • You would typically want to use static ip addresses for servers (because if you use DHCP the IP is gonna change sooner or later, and it’s gonna be a pain in the butt).

    In this case, he controls the local DHCP server, which is gonna be running on the OpenWRT box, so he can set it to always assign whatever he wants to a given MAC.


  • tal@lemmy.todaytoSelfhosted@lemmy.world[Solved] OpenWrt & fail2ban
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    1
    ·
    edit-2
    18 days ago

    except that all requests’ IP addresses are set to the router’s IP address (192.168.3.1), so I am unable to use proper rate limiting and especially fail2ban.

    I’d guess that however the network is configured, you have the router NATting traffic going from the LAN to the Internet (typical for a home broadband router) as well as from the home LAN to the server.

    That does provide security benefits in that you’ve basically “put the server on the Internet side of things”, and the server can’t just reach into the LAN, same as anything else on the Internet. The NAT table has to have someone on the LAN side opening a connection to establish a new entry.

    But…then all of those hosts on the LAN are going to have the same IP address from the server’s standpoint. That’s the experience that hosts on the Internet have towards the same hosts on your LAN.

    It sounds like you also want to use DHCP:

    Getting the router to actually assign an IP address to the server was quite a headache

    I’ve never used VLANs on Linux (or OpenWRT, and don’t know how it interacts with the router’s hardware).

    I guess what you want to do is to not NAT traffic going from the LAN (where most of your hardware lives) and the DMZ (where the server lives), but still to disallow the DMZ from communicating with the LAN.

    considers

    So, I don’t know whether the VLAN stuff is necessary on your hardware to prevent the router hardware from acting like a switch, moving Ethernet packets directly, without them going to Linux. Might be the case.

    I suppose what you might do — from a network standpoint, don’t know off-the-cuff how to do it on OpenWRT, though if you’re just using it as a generic Linux machine, without using any OpenWRT-specific stuff, I’m pretty sure that it’s possible — is to give the OpenWRT machine two non-routable IP addresses, something like:

    192.168.1.1 for the LAN

    and

    192.168.2.1 for the DMZ

    The DHCP server listens on 192.168.1.1 and serves DHCP responses for the LAN that tell it to use 192.168.1.1 as the default route. Ditto for hosts in the DMZ. It hands out addresses from the appropriate pool. So, for example, the server in the DMZ would maybe be assigned 192.168.2.2.

    Then it should be possible to have a routing table entry to route 192.168.1.1 to 192.168.2.0/24 via 192.168.2.1 and vice versa, 192.168.2.1 to 192.168.1.0/24 via 192.168.1.1. Linux is capable of doing that, as that’s standard IP routing stuff.

    When a LAN host initiates a TCP connection to a DMZ host, it’ll look up its IP address in its routing table, say “hey, that isn’t on the same network as me, send it to the default route”. That’ll go to 192.168.1.1, with a destination address of 192.168.2.2. The OpenWRT box forwards it, doing IP routing, to 192.168.2.1, and then that box says “ah, that’s on my network, send it out the network port with VLAN tag whatever” and the switch fabric is configured to segregate the ports based on VLAN tag, and only sends the packet out the port associated with the DMZ.

    The problem is that the reason that home users typically derive indirect security benefits from use NAT is that it intrinsically disallows incoming connections from the server to the LAN. This will make that go away — the LAN hosts and DMZ hosts will be on separate “networks”, so things like ARP requests and other stuff at the purely-Ethernet level won’t reach each other, but they can freely communicate with each other at the IP level, because the two 192.168.X.1 virtual addresses will route packets between each the two networks. You’re going to need to firewall off incoming TCP connections (and maybe UDP and ICMP and whatever else you want to block) inbound on the 192.168.1.0/24 network from the 192.168.2.0/24 network. You can probably do that with iptables at the Linux level. OpenWRT may have some sort of existing firewall package that applies a set of iptables rules. I think that all the traffic should be reaching the Linux kernel in this scenario.

    If you get that set up, hosts at 192.168.2.2, on the DMZ, should be able to see connections from 192.168.1.2, on the LAN, using its original IP address.

    That should work if what you had was a Linux box with three Ethernet cards (one for each of the Internet, LAN, and WAN) and the VLAN switch hardware stuff wasn’t in the picture; you’d just not do any VLAN stuff then. I’m not 100% certain that any VLAN switching fabric stuff might muck that up — I’ve only very rarely touched VLANs myself, and never tried to do this, use VLANs to hack switch fabric attached directly to a router to act like independent NICs. But I can believe that it’d work.

    If you do set it up, I’d also fire up sudo tcpdump on the server. If things are working correctly, sudo ping -b 192.168.1.255 on a host on the LAN shouldn’t show up as reaching the server. However, ping 192.168.2.2 should.

    You’re going to want traffic that doesn’t match a NAT table entry and is coming in from the Internet to be forwarded to the DMZ vlan.

    That’s a high-level of what I believe needs to happen. But I can’t give you a hand-holding walkthrough to configure it via off-the-cuff knowledge, because I haven’t needed to do a fair bit of this myself — sorry on that.

    EDIT: This isn’t the question you asked, but I’d also add that what I’d probably do myself if I were planning to set something like this up is get a small, low power Linux machine with multiple NICs (well, okay, probably one NIC, multiple ports). That cuts the switch-level stuff that I think that you’d likely otherwise need to contend with out of the picture, and then I don’t think that you’d need to deal with VLANs, which is a headache that I wouldn’t want, especially if getting it wrong might have security implications. If you need more ports for the LAN, then just throw a regular old separate hardware Ethernet switch on the LAN port. You know that the switch can’t be moving traffic between the LAN and DMZ networks itself then, because it can’t touch the DMZ. But I don’t know whether that’d make financial sense in your case, if you’ve already got the router hardware.



  • Actually, thinking about this…a more-promising approach might be deterrent via poisoning the information source. Not bulletproof, but that might have some potential.

    So, the idea here is that what you’d do there is to create a webpage that looks, to a human, as if only the desired information shows up.

    But you include false information as well. Not just an insignificant difference, as with a canary trap, or a real error intended to have minimal impact, only to identify an information source, as with a trap street. But outright wrong information, stuff where reliance on the stuff would potentially be really damaging to people relying on the information.

    You stuff that information into the page in a way that a human wouldn’t readily see. Maybe you cover that text up with an overlay or something. That’s not ideal, and someone browsing using, say, a text-mode browser like lynx might see the poison, but you could probably make that work for most users. That has some nice characteristics:

    • You don’t have to deal with the question of whether the information rises to the level of copyright infringement or not. It’s still gonna dick up responses being issued by the LLM.

    • Legal enforcement, which is especially difficult across international borders — The Pirate Bay continues to operate to this day, for example — doesn’t come up as an issue. You’re deterring via a different route.

    • The Internet Archive can still archive the pages.

    Someone could make a bot that post-processes your page to strip out the poison, but you could sporadically change up your approach, change it over time, and the question for an AI company is whether it’s easier and safer to just license your content and avoid the risk of poison, or to risk poisoned content slipping into their model whenever a media company adopts a new approach.

    I think the real question is whether someone could reliably make a mechanism that’s a general defeat for that. For example, most AI companies probably are just using raw text today for efficiency, but for specifically news sources known to do this, one could generate a screenshot of a page in a browser and then OCR the text. The media company could maybe still take advantage of ways in which generalist OCR and human vision differ — like, maybe humans can’t see text that’s 1% gray on a black background, but OCR software sees it just fine, so that’d be a place to insert poison. Or maybe the page displays poisoned information for a fraction of a second, long enough to be screenshotted by a bot, and then it vanishes before a human would have time to read it.

    shrugs

    I imagine that there are probably already companies working on the problem, on both sides.


  • I’m very far from sure that this is an effective way to block AI crawlers from pulling stories for training, if that’s their actual concern. Like…the rate of new stories just isn’t that high. This isn’t, say, Reddit, where someone trying to crawl the thing at least has to generate some abnormal traffic. Yeah, okay, maybe a human wouldn’t read all stories, but I bet that many read a high proportion of what the media source puts out, so a bot crawling all articles isn’t far off looking like a human. All a bot operator need do is create a handful of paid accounts and then just pull partial content with each, and I think that a bot would just fade into the noise. And my guess is that it is very likely that AI training companies will do that or something similar if knowledge of current news events is of interest to people.

    You could use a canary trap, and that might be more-effective:

    https://en.wikipedia.org/wiki/Canary_trap

    A canary trap is a method for exposing an information leak by giving different versions of a sensitive document to each of several suspects and seeing which version gets leaked. It could be one false statement, to see whether sensitive information gets out to other people as well. Special attention is paid to the quality of the prose of the unique language, in the hopes that the suspect will repeat it verbatim in the leak, thereby identifying the version of the document.

    The term was coined by Tom Clancy in his novel Patriot Games,[1][non-primary source needed] although Clancy did not invent the technique. The actual method (usually referred to as a barium meal test in espionage circles) has been used by intelligence agencies for many years. The fictional character Jack Ryan describes the technique he devised for identifying the sources of leaked classified documents:

    Each summary paragraph has six different versions, and the mixture of those paragraphs is unique to each numbered copy of the paper. There are over a thousand possible permutations, but only ninety-six numbered copies of the actual document. The reason the summary paragraphs are so lurid is to entice a reporter to quote them verbatim in the public media. If he quotes something from two or three of those paragraphs, we know which copy he saw and, therefore, who leaked it.

    There, you generate slightly different versions of articles for different people. Say that you have 100 million subscribers. ln(100000000)/ln(2)=26.57... So you’re talking about 27 bits of information that need to go into the article to uniquely describe each. The AI is going to be lossy, I imagine, but you can potentially manage to produce 27 unique bits of information per article that can reasonably-reliably be remembered by an AI after training. That’s 27 different memorable items that need to show up in either Form A or Form B. Then you search to see what a new LLM knows about and ban the bot identified.

    Cartographers have done that, introduced minor, intentional errors to see what errors maps used to see whether they were derived from their map.

    https://en.wikipedia.org/wiki/Trap_street

    In cartography, a trap street is a fictitious entry in the form of a misrepresented street on a map, often outside the area the map nominally covers, for the purpose of “trapping” potential plagiarists of the map who, if caught, would be unable to explain the inclusion of the “trap street” on their map as innocent. On maps that are not of streets, other “trap” features (such as nonexistent towns, or mountains with the wrong elevations) may be inserted or altered for the same purpose.[1]

    https://en.wikipedia.org/wiki/Phantom_island

    A phantom island is a purported island which has appeared on maps but was later found not to exist. They usually originate from the reports of early sailors exploring new regions, and are commonly the result of navigational errors, mistaken observations, unverified misinformation, or deliberate fabrication. Some have remained on maps for centuries before being “un-discovered”.

    In some cases, cartographers intentionally include invented geographic features in their maps, either for fraudulent purposes or to catch plagiarists.[5][6]

    That has weaknesses. It’s possible to defeat that by requesting multiple versions using different bot accounts and identifying divergences and maybe merging them. In the counterintelligence situation, where canary traps have been used, normally people only have access to one source, and it’d be hard for an opposing intelligence agency to get access to multiple sources, but it’s not hard here.

    And even if you ban an account, it’s trivial to just create a new one, decoupled from the old one. Thus, there isn’t much that a media company can realistically do about it, as long as the generated material doesn’t rise to the level of a derived work and thus copyright infringement (and this is in the legal sense of derived — simply training something on something else isn’t sufficient to make it a derived work from a copyright law standpoint, any more than you reading a news report and then talking to someone else about it is).

    Getting back to the citation issue…

    Some news companies do keep archives (and often selling access to archives is a premium service), so for some, that might cover some of the “inability to cite” problem that not having Internet Archive archives produces, as long as the company doesn’t go under. It doesn’t help with a problem that many news companies have a tendency to silently modify articles without reliably listing errata, and that having an Internet Archive copy can be helpful. There are also some issues that I haven’t yet seen become widespread but worried about, like where a news source might provide different articles to people in different regions; there, having a trusted source like the Internet Archive can avoid that, and that could become a problem.