We are maintaining solutions for monitoring validators’ uptime to keep track if any validators is missing blocks, jailed or tombstoned, on the following chains
We are using our own servers to host our validators at, as it gives us more control over our infra. Our servers are built specifically for our own needs, we don’t use enterprise solutions, preferring to build the servers ourselves. On all our servers, we use SSD and sometimes NVME drives to ensure the top performance. For public nodes, we are usually using a cloud-based solutions or dedicated servers using multiple hosting providers
We are using Mikrotik routers to maintain a safe and stable setup. Some of our servers are exposed ourside, while some are kept in the local network only, to prevent unauthorized access.
We are using Prometheus as a time-series DB to collect and store the data on our validators. Grafana is used to visualise the metrics collected and stored in Prometheus in a fancy view. We have our own set of dashboards to have a quick overview of everything that we need to know. Additionally, we use Alertmanager to get notified about anything that doesn’t work as expected. We have a lot of alerting rules to be triggered in case something may be wrong. Lastly, if something goes wrong, depending on a severity, Alertmanager notifies us about that via either Telegram (if that’s something not really urgent) or PagerDuty (which will send us push notifications and calls unless it’s fixes).
In our infrastructure we are using multiple centralised logging solutions to collect data from any server we have to be stored there. Namely, we are using Elastic Stack (ElasticSearch + Kibana + Logstash + Filebeat + Auditbeat), for a more complex overview and aggregation, and Loki, a more simple approach that also allows us to correlate validators metrics and logs.
We are doing our best to ensure the best security possible. Our firewalls prevent us from any person getting access to our servers. Our servers follow the best security practices. Moreover, we are using hardware security modules (namely, YubiHSM 2) to ensure nobody can steal our keys even if they somehow get access to our servers, and cold wallets (namely, Ledger), to ensure no website or a person can steal our mnemonics.