Hello fellow selfhoster, I was wondering how important it is to have ECC Memory. I want a server that is really reliable and ECC memory pops up as one of the must haves for reliability. But it seems to me in my research that it is quite expensive to get a setup with ECC memory. How important is ECC memory for a server (I rely on).

So far I have been rocking a Raspberry pi 4 which has ECC memory

  • Hopfgeist@feddit.de
    link
    fedilink
    English
    arrow-up
    5
    ·
    1 year ago

    For large storage, ECC helps a lot for avoiding storage corruption. In combination with a redundant architecture in zfs it is almost bullet-proof. (Make no mistake, redundant storage is no substitute for backups! You still need those.)

    One option is to use comparatively old server hardware. I have some pretty old stuff (around 10 years) that uses DDR3 RAM, which is dirt cheap, even with ECC (somewhere around 1 €/GB). And it will be fast enough by far for most applications. The downside is higher power consumption for the same performance. The Dell T320 I have with eight 3.5" SAS disks and 32 GB RAM uses some 140 W of power, to give you a ballpark figure.

    • nevalem@programming.dev
      link
      fedilink
      English
      arrow-up
      3
      ·
      1 year ago

      From the link:

      @PriorProjectEnglish7

      The answers in this thread are surprisingly complex, and though they contain true technical facts, their conclusions are generally wrong in terms of what it takes to maintain file integrity. The simple answer is that ECC ram in a networked file server can only protect against memory corruption in the filesystem, but memory corruption can also occur in application code and that’s enough to corrupt a file even if the file server faithfully records the broken bytestream produced by the app.

      If you run a Postgres container, and the non-ecc DB process bitflips a key or value, the ECC networked filesystem will faithfully record that corrupted key or value. If the DB bitflips a critical metadata structure in the db file-format, the db file will get corrupted even though the ECC networked filesystem recorded those corrupt bits faithfully and even though the filesystem metadata is intact.
      If you run a video transcoding container and it experiences bitflips, that can result in visual glitches or in the video metadata being invalid… again even if the networked filesystem records those corrupt bits faithfully and the filesystem metadata is fully intact.
      

      ECC in the file server prevents complete filesystem loss due to corruption of key FS metadata structures (or at least memory bit-flips… but modern checksumming fs’s like ZFS protect against bit-flips in the storage pretty well). And it protects from individual file loss due to bitflips in the file server. It does NOT protect from the app container corrupting the stream of bytes written to an individual file, which is opaque to the filesystem but which is nonetheless structured data that can be corrupted by the app. If you want ECC-levels of integrity you need to run ECC at all points in the pipeline that are writing data.

      That said, I’ve never run an ECC box in my homelab, have never knowingly experienced corruption due to bit flips, and have never knowingly had a file corruption that mattered despite storing and using many terabytes of data. If I care enough about integrity to care about ECC, I probably also care enough to run multiple pipelines on independent hardware and cross-check their results. It’s not something I would lose sleep over.

  • Krik@feddit.de
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    According to source the ecc has to ‘kick-in’ about 3700 times per year and dimm module. That’s 10 times per day and dimm.

    Depending on how important your server is to you you’ll either need it (in case of important data you absolutely don’t want to lose) or forget about it (just a hobby project, nothing serious).

  • nevalem@programming.dev
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    1
    ·
    1 year ago

    DDR5 has built in data checking which is ECC without the automatic correction which might be worthwhile depending on your setup.

    Your ECC on the pi i believe isn’t for the memory chip but for the on chip die’s cache for ARM.

    For me personally, if my racked server supports it, I get ECC. If it doesn’t, I don’t sweat it. Redundance in drives, power, and networking is much more important to me and are order of magnitudes higher chance of failing from my anecdotal experience. If I can save those dollars for another higher probably failure, I do that.

    DNS is a lynchpin of my network (and wife approval factor) which I splurge a bit for with physical redundance of an identical mini computer that runs it and fail over to same ip if the first box fails. Those considerations are way before if the server has ECC. Just my $0.02.

    • freedomenjoyer@sh.itjust.worksOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Thanks for the feedback! Yea think a ZFS redundancy + Backup will do for my application then. From what I am reading here it is less common than I imagined

      • nevalem@programming.dev
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        It’s extremely common in Enterprise where costs for a 100k+ server isn’t the most expensive part of running, maintaining, servicing said server. If your home lab isn’t practicing 3-2-1 backups (at least three copies of your data, two local (on-site) but on different media/devices, and at least one copy off-site) yet, I’d spend money on that before ECC.