Incoherent rant.

I’ve, once again, noticed Amazon and Anthropic absolutely hammering my Lemmy instance to the point of the lemmy-ui container crashing. Multiple IPs all over the US.

So I’ve decided to do some restructuring of how I run things. Ditched Fedora on my VPS in favour of Alpine, just to start with a clean slate. And started looking into different options on how to combat things better.

Behold, Anubis.

“Weighs the soul of incoming HTTP requests to stop AI crawlers”

From how I understand it, it works like a reverse proxy per each service. It took me a while to actually understand how it’s supposed to integrate, but once I figured it out all bot activity instantly stopped. Not a single one got through yet.

My setup is basically just a home server -> tailscale tunnel (not funnel) -> VPS -> caddy reverse proxy, now with anubis integrated.

I’m not really sure why I’m posting this, but I hope at least one other goober trying to find a possible solution to these things finds this post.

Anubis Github, Anubis Website

Edit: Further elaboration for those who care, since I realized that might be important.

  • You don’t have to use caddy/nginx/whatever as your reverse proxy in the first place, it’s just how my setup works.
  • My Anubis sits between my local server and inside Caddy reverse proxy docker compose stack. So when a request is made, Caddy redirects to Anubis from its Caddyfile and Anubis decides whether or not to forward the request to the service or stop it in its tracks.
  • There are some minor issues, like it requiring javascript enabled, which might get a bit annoying for NoScript/Librewolf/whatever users, but considering most crawlbots don’t do js at all, I believe this is a great tradeoff.
  • The most confusing part were the docs and understanding what it’s supposed to do in the first place.
  • There’s an option to apply your own rules via json/yaml, but I haven’t figured out how to do that properly in docker yet. As in, there’s a main configuration file you can override, but there’s apparently also a way to add additional bots to block in separate files in a subdirectory. I’m sure I’ll figure that out eventually.

Edit 2 for those who care: Well crap, turns out lemmy-ui crashing wasn’t due to crawlbots, but something else entirely.
I’ve just spent maybe 14 hours troubleshooting this thing, since after a couple of minutes of running, lemmy-ui container healthcheck would show “unhealthy” and my instance couldn’t be accessed from anywhere (lemmy-ui, photon, jerboa, probably the api as well).
After some digging, I’ve disabled anubis to check if that had anything to do with it, it didn’t. But, I’ve also noticed my host ulimit -n was set to like 1000… (I’ve been on the same install for years and swear an update must have changed it)
After changing ulimit -n (nofile) and shm_size to 2G in docker compose, it hasn’t crashed yet. fingerscrossed
Boss, I’m tired and I want to get off Mr. Bones’ wild ride.
I’m very sorry for not being able to reply to you all, but it’s been hectic.

Cheers and I really hope someone finds this as useful as I did.

  • blob42@lemmy.ml
    link
    fedilink
    English
    arrow-up
    9
    ·
    edit-2
    18 hours ago

    I am planning to try it out, but for caddy users I came up with a solution that works after being bombarded by AI crawlers for weeks.

    It is a custom caddy CEL expression filter coupled with caddy-ratelimit and caddy-defender.

    Now here’s the fun part, the defender plugin can produce garbage as response so when a matching AI crawler fits it will poison their training dataset.

    Originally I only relied on the rate limiter and noticed that AI bots kept trying whenever the limit was reset. Once I introduced data poisoning they all stopped :)

    git.blob42.xyz {
        @bot <<CEL
            header({'Accept-Language': 'zh-CN'}) || header_regexp('User-Agent', '(?i:(.*bot.*|.*crawler.*|.*meta.*|.*google.*|.*microsoft.*|.*spider.*))')
        CEL
    
    
        abort @bot
        
    
        defender garbage {
    
            ranges aws azurepubliccloud deepseek gcloud githubcopilot openai 47.0.0.0/8
          
        }
    
        rate_limit {
            zone dynamic_botstop {
                match {
                    method GET
                     # to use with defender
                     #header X-RateLimit-Apply true
                     #not header LetMeThrough 1
                }
                key {remote_ip}
                events 1500
                window 30s
                #events 10
                #window 1m
            }
        }
    
        reverse_proxy upstream.server:4242
    
        handle_errors 429 {
            respond "429: Rate limit exceeded."
        }
    
    }
    

    If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud

    • azertyfun@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      1
      ·
      4 hours ago

      If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud

      That’s an ARIN block according to Wikipedia so North America, under Northen Telecom until 2010. It does look like Alibaba operate many networks under that /8, but I very much doubt it’s the whole /8 which would be worth a lot; a /16 is apparently worth around $3-4M, so a /8 can be extrapolated to be worth upwards of a billion dollars! I doubt they put all their eggs into that particular basket. So you’re probably matching a lot of innocent North American IPs with this.

      • blob42@lemmy.ml
        link
        fedilink
        English
        arrow-up
        1
        ·
        3 hours ago

        Right I must have just blanket banned the whole /8 to be sure alibaba cloud is included. Did some time ago so I forgot