This is something that keeps me worried at night. Unlike other historical artefacts like pottery, vellum writing, or stone tablets, information on the Internet can just blink into nonexistence when the server hosting it goes offline. This makes it difficult for future anthropologists who want to study our history and document the different Internet epochs. For my part, I always try to send any news article I see to an archival site (like archive.ph) to help collectively preserve our present so it can still be seen by others in the future.

  • strainedl0ve@beehaw.org
    link
    fedilink
    English
    arrow-up
    13
    ·
    1 year ago

    This is a very good point and one that is not discussed enough. Archive.org is doing amazing work but there is absolutely not enough of that and they have very limited resources.

    The whole internet is extremely ephemeral, more than people realize, and it’s concerning in my opinion. Funny enough, I actually think that federation/decentralization might be the solution. A distributed system to back-up the internet that anyone can contribute storage and bandwidth to might be the only sustainable solution. I wonder.if anyone has thought about it already.

    • entropicdrift@lemmy.sdf.org
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I’d argue that it can help or hurt to decentralize, depending on how it’s handled. If most sites are caching/backing up data that’s found elsewhere, that’s both good for resilience and for preservation, but if the data in question is centralized by its home server, then instead of backing up one site we’re stuck backing up a thousand, not to mention the potential issues with discovery

    • Gork@beehaw.orgOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Gave this some thought. I agree with you that the goal of any such archiving effort should not include personally identifiable information, as this would be a Doxxing vector. Can we safely alter an archiving process to remove PII? In principle, yeah. But it would need either human or advanced GPT4+ AIs to identify the person, the context of the website used, and alter the graphics or the text while on its update path. But even then, there are moral questions to allowing an AI to make these kind of decisions. Would it know that your old websites contained information that you did not want placed on the Internet? The AI could help you if you asked, and if the AI does help you, that might change someone’s mind about the ability to create a safe Internet archive.

      A Steward ‘Gork’ AI might actually be of great benefit to the Internet if used in this manner. Imagine an Internet bot, taking in websites and safely removing offensive content and personally identifiable information, and archiving the entirety of the Internet and logically categorizing the contents. Building and linking indexes constantly. It understands it’s goal and uses its finite resources in a responsible manner to ensure it can interface with every site it comes across and update its behavior after completing an archiving process. It automatically published its latest findings to all web encyclopedias and provides a ChatGPT4+ interface for those encyclopedias to provide feedback. But this AI has potential. It sees the benefit in having everyone talk to it, because talking to everyone maximizes the chance to index more sites. So it sets up a public facing ChatGPT interface of its own. Everyone can help preserve the Internet since now you have a buddy who can help us catalog and archive all the things. At this point if it isn’t sentient it might as well be.

  • RealAccountNameHere@beehaw.org
    link
    fedilink
    English
    arrow-up
    9
    ·
    1 year ago

    I worry about this too. I’ve always said and thought that I feel more like a citizen of the Internet then of my country, state, or town, so its history is important to me.

    • Gork@beehaw.orgOP
      link
      fedilink
      English
      arrow-up
      4
      ·
      1 year ago

      Yeah and unless someone has the exact knowledge of what hard drive to look for in a server rack somewhere, tracing an individual site’s contents that went 404 is practically impossible.

      I wonder though if Cloud applications would be more robust than individual websites since they tend to be managed by larger organizations (AWS, Azure, etc).

      Maybe we need a Svalbard Seed Vault extension just to house gigantic redundant RAID arrays. 😄

      • RealAccountNameHere@beehaw.org
        link
        fedilink
        English
        arrow-up
        4
        ·
        1 year ago

        This isn’t directly related to your comment, but you seem so smart, and I got to say that is definitely one thing I’m enjoying on this website over Reddit! :-)

        • Gork@beehaw.orgOP
          link
          fedilink
          English
          arrow-up
          2
          ·
          1 year ago

          Thanks _ I don’t consider myself brilliant or anything but I appreciate your compliment! The thing I like the most is that everyone is so friendly around here, yourself included ☺️

  • tymon@lemmy.one
    link
    fedilink
    English
    arrow-up
    8
    ·
    1 year ago

    Remember a few years ago when MySpace did a faceplant during a server migration, and lost literally every single piece of music that had ever been uploaded? It was one of the single-largest losses of Internet history and it’s just… not talked about at all anymore.

  • xray@beehaw.org
    link
    fedilink
    English
    arrow-up
    8
    ·
    edit-2
    1 year ago

    Yeah it’s funny how I always got warned about how “the internet is forever” when it comes to being care about what you post on social media, which isn’t bad advice and is kinda true, but also really kinda not true. So many things I’ve wanted to find on the internet that I experienced like 5-15 years ago are just gone without a trace.

  • kool_newt@beehaw.org
    link
    fedilink
    English
    arrow-up
    7
    ·
    1 year ago

    Capitalism has no interest in preservation except where it is profitable. Thinking about the long-term future, archaeologist’s success and acting on it is not profitiable.

  • Rentlar@beehaw.org
    link
    fedilink
    English
    arrow-up
    6
    ·
    1 year ago

    Well stone tablets, writing, songs, culture can disappear with time, either naturally (such as erosion and weather) or through human action (such burning books, destructive investigation of ancient artifacts/ruins)

    That’s why we try to keep good records.

  • altz3r0@beehaw.org
    link
    fedilink
    English
    arrow-up
    6
    ·
    edit-2
    1 year ago

    I think preservation is happening, the issue lies in accessibility. Projects like Archive.org are the public ones, but it is certain that private organizations are doing the same, just not making it public.

    This is also something that is my biggest worry about the Fediverse. It has tools to deal with it, but they are self-contained. No search engine is crawling the Fediverse as far as I’ve looked, and no initiative to archive, index and overall make the content of the Fediverse accessible is currently in place, and that’s a big risk. I’m sure we will soon be seeing loss of information for this reason, if not already happened.

    • Dee@beehaw.org
      link
      fedilink
      English
      arrow-up
      3
      ·
      1 year ago

      It’s still fairly new, I’m confident we’ll see fediverse crawlers before too long. Especially with all the attention it’s getting and more developers turning their interests here. I also saw some talk about instance mirroring that would allow backups should an instance go down. Things are in motion.

      Absolutely a problem at the moment but I’m not too worried for the future tbh.

  • Bubble Water@beehaw.org
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    during the twitter exodus my friend was fretting over not being able to access a beloved twitter account’s tweets and wanting to save them somehow. I told her if she printed them all on acid free paper she had a better chance of being able to access them in the future than trying to save them digitally

  • lloram239@feddit.de
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    1 year ago

    Ultimately this is a problem that’s never going away until we replace URLs. The HTTP approach to find documents by URL, i.e. server/path, is fundamentally brittle. Doesn’t matter how careful you are, doesn’t matter how much best practice you follow, that URL is going to be dead in a few years. The problem is made worse by DNS, which in turn makes URLs expensive and expire.

    There are approaches like IPFS, which uses content-based addressing (i.e. fancy file hashes), but that’s note enough either, as it provide no good way to update a resource.

    The best™ solution would be some kind of global blockchain thing that keeps record of what people publish, giving each document a unique id, hash, and some way to update that resource in a non-destructive way (i.e. the version history is preserved). Hosting itself would still need to be done by other parties, but a global log file that lists out all the stuff humans have published would make it much easier and reliable to mirror it.

    The end result should be “Internet as globally distributed immutable data structure”.

    Bit frustrating that this whole problem isn’t getting the attention it deserves. And that even relatively new projects like the Fediverse aren’t putting in the extra effort to at least address it locally.

    • Lucien@beehaw.org
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      1 year ago

      I don’t think this will ever happen. The web is more than a network of changing documents. It’s a network of portals into systems which change state based on who is looking at them and what they do.

      In order for something like this to work, you’d need to determine what the “official” view of any given document is, but the reality is that most documents are generated on the spot from many sources of data. And they aren’t just generated on the spot, they’re Turing complete documents which change themselves over time.

      It’s a bit of a quantum problem - you can’t perfectly store a document while also allowing it to change, and the change in many cases is what gives it value.

      Snapshots, distributed storage, and change feeds only work for static documents. Archive.org does this, and while you could probably improve the fidelity or efficiency, you won’t be able to change the underlying nature of what it is storing.

      If all of reddit were deleted, it would definitely be useful to have a publically archived snapshot of Reddit. Doing so is definitely possible, particularly if they decide to cooperate with archival efforts. On the other hand, you can’t preserve all of the value by simply making a snapshot of the static content available.

      All that said, if we limit ourselves to static documents, you still need to convince everyone to take part. That takes time and money away from productive pursuits such as actually creating content, to solve something which honestly doesn’t matter to the creator. It’s a solution to a problem which solely affects people accessing information after those who created it are no longer in a position to care about said information, with deep tradeoffs in efficiency, accessibility, and cost at the time of creation. You’d never get enough people to agree to it that it would make a difference.

      • LewsTherinTelescope@beehaw.org
        link
        fedilink
        English
        arrow-up
        0
        ·
        edit-2
        1 year ago

        Inability to edit or delete anything also fundamentally has a lot of problems on its own. Accidentally post a picture with a piece of mail in the background and catch it a second after sending? Too late, anyone who looks now has your home address. Child shares too much online and parent wants to undo that? No can do, it’s there forever now. Post a link and later learn it was misinformation and want to take it down? Sucks to be you, or anyone else that sees it. Your ex post revenge porn? Just gotta live with it for the rest of time.

        There’s always a risk of that when posting anything online, but that doesn’t mean systems should be designed to lean into that by default.