Issue 083
I’m going to be in Nuremberg and Munich, Germany all next week for the Open Source Monitoring Conference. I’d love to meet up with folks while I’m there! Shoot me an email?
This issue is sponsored by:
Move Faster, See Everything, and Deploy Confidently
Get real-time analytics and massive scale so your Dev and Ops teams can move faster on a stable cloud application estate. Use full stack monitoring to slash MTTR. Start your free 30-day trial today with Wavefront by VMware.
Latest Articles on monitoring.love
What driving an old jalopy taught me about monitoring
Driving old, beat-up cars is both a treat and a nightmare, especially when it comes to figuring out why they’ve stopped working (this time). In many ways, diagnosing issues with any old car feels not-at-all dissimilar to monitoring for and diagnosing failures in software.
From The Community
How many metrics should an application return?
The question sounds kinda weird to those well-versed in monitoring and observability, but underlying it is actually a point I find is commonly-found. Many people will tend to instrument too little, not too much, so the advice given in this article is actually a great starting point for level-setting. Think of it like negotiating a salary, in a sense: if you think your worth is X, it would be a gamechanger for someone knowledgable to say “No, you should be expecting at least X+20%”
Implementing SLOs using Prometheus and Grafana
Great explanations of SLOs, error budgets, and metrics, but also an awesome bonus: they’ve codified the explanation in some publicly-available Grafana dashboard definitions.
Working with Irregular Time Series
Sparse metrics, aka, datapoints that aren’t in consistent intervals or have large gaps between them, are a big pain to deal with in the time series world for a bunch of reasons. The folks at Influx have some suggestions on how to properly handle them so you don’t end up with incorrect answers.
Best Practices for On-Call and Incident Response
Here’s some good insight into how New Relic handles on-call and incident response. I really like this pattern of companies posting their on-call/incident management methodologies publicly now. It’s silly for people to reinvent the wheel when so many companies have workable processes already.
Cron jobs execution monitoring in slack
Exactly as the title says, but also introduced me to a new tool as well: slacktee, which is like the Linux tee command but to Slack instead of stdout.
Parsing logs 230x faster with Rust
Perhaps surprisingly, one of the most challenging things about operating RubyGems.org is the logs. Unlike most Rails applications, RubyGems sees between 4,000 and 25,000 requests per second, all day long, every single day. As you can probably imagine, this creates… a lot of logs. A single day of request logs is usually around 500 gigabytes on disk. That’s, uhh, a lot of logs. There’s some interesting comments on this at Hacker News, mainly around how slow that parsing rate actually is (which is absolutely not the fault of the author) and why that might be.
When Baron Schwartz talks databases, you should be listening. This talk goes through his own framework for monitoring a database.
Now Open Source: Sematext Monitoring Agent
It feels like there’s a recent uptick in monitoring companies open-sourcing stuff, which I quite like–for all the reasons SemaText lays out in this article, actually. Their agent is Java-based, has all the integrations you’d expect, and writes to InfluxDB.
Capacity planning for Etsy’s web and API clusters
Capacity planning is one of those things that everyone knows they should be doing but no one ever actually dones–usually because it’s so complex and just a total pain in the ass. The folks at Etsy, in their usual way, wrote up a concise article on how they recently did their capacity planning exercise in the wake of their migration from datacenter to Google Cloud Platform.
This issue is sponsored by:
450PB of Storage Capacity!!!
Want to help us protect over a quarter of a million computers, provide network connectivity via half a million access points, and remotely monitor and manage 2.5 million endpoints? You’ve probably never heard of us, but we’d like to hear from you! Click here to learn about opportunities at Datto!
Events
Open Source Monitoring Conference - November 5-8, 2018 - Nuremberg, Germany
I’m speaking at OSMC in Nuremberg next week. Come on out if you’re in the area.
InfluxDays - November 7-8, 2018 - San Francisco, CA, USA
FOSDEM2019: Monitoring Room CFP Open - February 2-3, 2018 - Brussels, Belgium
Jobs
Technical Evangelist - Wavefront - Location Flexible
I had the pleasure of speaking with the hiring manager recently and it sounds like a really awesome gig. If you’re into Ops/SRE/DevOps and love monitoring, click through to check it out.
Want your job listed here? Why not submit a post to the job board? It’s only $199/ad for 30 days.
See you next week!
– Mike (@mike_julian) Monitoring Weekly Editor