Hosting Service Manual
Monitoring & Alerts
Server Performance Metrics
Agent based monitoring is used for measuring utilisation of various resources at a server level, for example CPU, memory, disk and network usage. The agents are also used to probe the local applications and ensure 3rd party dependencies are reachable, such as dependencies accessed via a site to site VPN.
Important metrics such as CPU load statistics are collected every 1 minute. Thresholds for alerting are configured for most metrics.
Currently Zabbix is used for this purpose. Agents are deployed automatically and standardised templates are attached to hosts automatically.
Application Performance Metrics
We have started to introduce application level monitoring by integrating OpenMetrics libraries into our own applications and releases. This includes, but is not limited to, HTTP response code metrics, request latency metrics and internal application statistics. This provides the flexibility to narrow down performance issues to specific areas of our stack, or specific templates in the website.
All metrics are collected and alerting rules are evaluated every 15 seconds. Currently Prometheus and Thanos are used for the server side, and language specific libraries are implemented client side.
Cloud Native Metrics
In each cloud provider we utilise, we also configure monitoring and alerts in their native monitoring service for aspects of the platform we cannot monitor ourselves. In the case of AWS CloudWatch is used.
External Monitoring
In addition to our own monitoring solutions we also implement external probes using StatusCake to provide independent assurance of availability. External probes are run every 1 minute.
Alerting
All environments generate alerts for configured thresholds around capacity, latency and availability. Some alerts will trigger automated restarts of services in an attempt to automatically resolve issues without intervention.
For Production environments alerts are escalated automatically. After 10 minutes an SMS is sent to the on call engineer, with reminders sent every 5 minutes if an alert has not been resolved or acknowledged. An automated phone call and escalation to a backup engineer will also happen after 25 minutes if the alert still hasn't been resolved or acknowledged.
Client Alert and Metrics Access
As it currently stands we do not provide direct access to any of the metric data we collect, except for the external StatusCake availability dashboard.
All alerting is internal to us. We will inform you of any alerts that caused a significant impact on availability.