I saw this post and I was curious what was out there.

https://neuromatch.social/@jonny/113444325077647843

Id like to put my lab servers to work archiving US federal data thats likely to get pulled - climate and biomed data seems mostly likely. The most obvious strategy to me seems like setting up mirror torrents on academictorrents. Anyone compiling a list of at-risk data yet?

  • Otter@lemmy.caOP
    link
    fedilink
    English
    arrow-up
    55
    ·
    14 hours ago

    One option that I’ve heard of in the past

    https://archivebox.io/

    ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view websites offline.

    • tomtomtom@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      3 hours ago

      I am using archivebox, it is pretty straight-forward to self-host and use.

      However, it is very difficult to archive most news sites with it and many other sites as well. Most cookie etc pop ups on a site will render the archived page unusable and often archiving won’t work at all because some bot protection (Cloudflare etc.) will kick-in when archivebox tries to access a site.

      If anyone else has more success using it, please let me know if I am doing something wrong…