How do you observe your server functions?

Fermiverse@kbin.social · edit-2 3 years ago

SheeEttin@lemmy.world · 3 years ago

I’ll keep it very simple: I don’t.

If I’m trying to do something and I notice an issue, then I’ll investigate it. But if it’s not affecting anything, is it really a problem?

mea_rah@lemmy.world · 3 years ago

I was kind of the same, but I still collected metrics, because I just love graphs.

Over time I ended up setting alerts for failures I wish I was aware of earlier. Some examples:

HDD monitoring - usually drive is showing signs of failure couple days before it fails, so I have time to shop around for replacement. If I had no alert set, I’d probably only notice when both sides of a mirror failed which would mean couple days of downtime, lot of work with backup restoration and very limited time to find drive for reasonable price
networking issues - especially VPN, it’s much better to know that it is broken before you leave house
some core services like DNS. With two Adguard instances it’s much better to be alerted when one is down, than to realize that you suddenly have no DNS when both fail and you can’t even google stuff without messing with your connection settings.
SSD writes - same as HDDs, but in this case the alert is around 90% declared TBW lifetime claimed by manufacturer and I tend to replace them proactively as they are usually used as system disk without mirror, which holds no valuable data, but would again lead to extended unplanned downtime
CPU usage being maxed out for long time - I had one service fail in a way where it consumed 100% of all cores. This had no impact on other services because process scheduler did its job, but I ended up burning kilowats of electricity as this continued unnoticed for weeks. This was before energy prices went up, but it was still noticeable power consumption. (Had double CPU server back then, that consumed a lot of juice when maxed out)

H2iK@lemmy.world · 3 years ago

What do you use to collect these metrics?

mea_rah@lemmy.world · 3 years ago

I use Telegraf for most of the metrics.