Issue 043

Hey folks, welcome to another installment of Monitoring Weekly! Did you write something about monitoring recently? Maybe got an idea rolling around in your head? Send it on over and let the community learn from you. :D

Monitoring News, Articles, and Blog posts
How To Establish a High Severity Incident Management Program

Monitoring and incident management go hand-in-hand, and this article from the great folks at Gremlin is pretty awesome. It is incredibly thorough in its approach, covering example severity levels and their meanings, incident lifecycles, the creation of severity levels for different kinds of products, and much more. Seriously great article.

Putting Monitoring and Alerting into Practice

This is quite a nice overview of monitoring at the conceptual/component level.

Google Cloud Platform Blog: An example escalation policy

This article from the folks at GCP walks us through what an typical escalation policy looks like for this. What’s interesting about their escalation policies is how detailed they are.

Sensu & InfluxDB: Storing Data from Metrics Collection Checks

Most Sensu deployments I’ve seen rely on Graphite as the TSDB so I’m pretty happy about this article that takes you through Sensu+InfluxDB.

Now Publish Log Files from Amazon RDS for MySQL and MariaDB to Amazon CloudWatch Logs

I’ve always thought it was dumb that I had to send my MySQL RDS logs to S3 before being able to move them elsewhere. Now you can send them to Cloudwatch Logs. Still not great, but at least there’s better integration with Cloudwatch Logs and third-party logging systems than there is is S3, so all-said this is a pretty good improvement.

*[Project STAR: Streamlining Our On-Call Process

LinkedIn Engineering](https://engineering.linkedin.com/blog/2018/01/project-star-streamlining-our-on-call-process)**

At first, I expected this article to basically be about LinkedIn’s efforts at reworking an internal version of PagerDuty, but the more I read, the more interesting it actually is: in setting out to solve a standard scheduling problem, they found large organizational challenges such as engineers misunderstanding the impact and importance of being on-call to LinkedIn’s mission. I love this particular bit: In reality, however, Voyager On-Call is far more important than even the most important project. If the site goes down, even the greatest revenue-doubling project is dead in the water.

Building a Distributed Log from Scratch, Part 4: Trade-Offs and Lessons Learned

Continuing the series, Part 4 talks about the actual product that spawned the series to begin with and the lessons learned so far. In essence, CAP theorem is the bane of existence for building a high-quality, distributed data store.

Key metrics for EC2 monitoring (part 1) & How to collect EC2 metrics (part 2)

Datadog recently published a deep-dive treatment of AWS EC2 metrics, both from the perspective what matters and what the metrics mean, as well as how to actually collect the metrics.

See you next week!

– Mike (@mike_julian) Monitoring Weekly Editor