I run Prometheus + Grafana on a handful of machines we manage, with different so...

I run Prometheus + Grafana on a handful of machines we manage, with different software and setups. The cost of setting up a Prometheus instance and monitoring something is not too high, but things around it depend on your actual setup

- Service discovery can get tricky. In our case we manage the physical machines by ourselves so at the beginning we just had to write down static configs, although by now we have automatic discovery (took a few days of development).

- Depending on what you want to monitor, you might need to write your own exporters to Prometheus. However, mtail [1] has been really useful to create metrics from logs without too much work. In any case, you'll have to put time in deploying and configuring those exporters.

- Dashboards and alerts. There are dashboards for a lot of exporters, and there are collections of alerts too [2], but you will need to put time and effort in modifying/creating dashboards and writing down alerts. However, it's a productive effort because it helps in having a better understanding of which metrics are important and how do they relate to the workings of the software you use. Also, PromQL is a pretty nice query language for the purposes of Prometheus.

- Notification integrations. In my case we had to put some time to properly configure a Microsoft Teams integration and a deadman switch channel, but in most cases it will be pretty straightforward.

All in all, you'll need to invest some time in the integrations, but those are things that you need to do in any case. Prometheus itself is pretty easy to set up and maintain, and doesn't stand in your way. No tweaks, no undocumented settings, no bugs. I'm pretty happy in that regard, once you get it running you don't have to worry about it. Storage usage is pretty low even with a high amount of exporters and metrics per node, maybe around 10GB for 60 days of data of a single node? (I'm not sure because Prometheus does some compression and it's not exactly linear with the time or the number of nodes).

And that relatively low investment pays off quickly. The machines we manage use various tools and programs to deal with quite a lot of data at high bandwidths, so performance problems and bugs can be difficult to debug. The Prometheus + Grafana setup has made several times easier the debugging of issues and performance problems, the alerting system helps us prevent outages and we have even discovered issues that were unknown to us. For me, the moment you manage machines with even just a little bit of complexity in the software or setup, it's already worth it to look into monitoring.

1: https://github.com/google/mtail 2: https://awesome-prometheus-alerts.grep.to/