Improve your monitoring in real, tangible steps

How do you improve monitoring, specifically? Where do you even start? Worse: how do you know you’re done? My new video course, Monitor Anything, teaches exactly this. This course is one of my standard consulting engagements, but in video form so you can work through it at your own pace. Pre-orders are open until July 17th. Learn more more about the course.

Articles & News

Production Monitoring at Scale

The folks at Leanplum talk about their GCE-based infrastructure and what’s led to them doing a full-scale migration from monitoring with Google Stackdriver to Prometheus.··

Incident Management at Netflix Velocity

David Hahn, one of the handful of people on the Netflix SRE (“CORE”) team, gave this talk last year at QCon SF, and it bubbled back up this week. It is a fantastic talk, not the least of which is that David is a wonderful speaker too.

James Turnbull’s Monitoring With Prometheus is out!

And James was awesome enough to give a promo code for all Monitoring Weekly readers! Use MONWEEK at checkout for 25% off. Thanks James!

Hit the Ground Running with Distributed Tracing Core Concepts

This is an incredible, monster article on distributed tracing at a conceptual level (rather than implementation, which most articles are). To quote the article, That is what this post is about: an attempt to boil down and demystify distributed tracing so its core concepts can be quickly absorbed by anybody, regardless of skill, experience, or interest level in [distributed tracing].

I think it’s telling it still took 4300 words (that’s about 15 book pages!) to boil down the concepts to one article.

Logging Wisdom: How to Log

Logging is awesome, yo. …except when it’s not (but we don’t talk about those times.) Instead, let’s talk about doing it well, which this article makes a great read for. That said, I’ll make a minor disagreement with the author: you probably don’t want secrets/credentials in your logs, at all, ever.

Overview of Monitoring in Azure

I knew the folks at Azure were doing some great stuff, but I hadn’t realized how well they’ve covered the map when it comes to monitoring tooling. Pretty awesome, if you ask me.

A Practical Introduction to Logstash

The folks at Elastic have written up a really good howto on using Logstash. If you’re not that familiar with Logstash and are considering an ELK stack, add this to your reading list.

Monitoring Microservices: Divide and Conquer

The folks at Salesforce engineering are talking about how they approach monitoring of their microservice architecture. THe part I really like is the “contract”: every services must have an SLA (and therefore SLIs), every service must monitor how other services use it, and every service must monitor how it uses other services. There’s some details in here, so check it out.

Free Ebook: Distributed Systems Observability

This minibook by Cindy Sridharan aka @copyconstruct is 10,000% awesome. Cindy expounds at great length and depth on observability. It’s a lot like of some of the posts from her I’ve linked to before…except even better. Also, note: you don’t have to sign up for the marketing emails to get the ebook. It’s very much worth reading.

Monitoring, the Prometheus Way (video)

This is about a year old now, but totally new to me. It’s one of the best videos I’ve seen on Prometheus and how it works, as told by one of the co-creators of it.

Lessons from Building Observability Tools at Netflix

This monster post from Netflix explains a whole lot about how Netflix builds and has evolved their observability tools.

Cloudprober: open source black-box monitoring software

Monitoring black boxes is a real challenge and this tool from the folks at Google makes it a lot easier by mimicing requests within your infrastructure. The only downside here is that it uses synthetic requests to determine health, so it won’t be nearly as accurate as a shim/proxy that adds instrumentation to real requests. Then again, if you don’t have any monitoring around black boxes in your environment, this is a huge step foward.

No, seriously. Root Cause is a Fallacy.

The Five Whys are a useful tool, but they don’t magically make Root Cause Analysis (RCA) a viable methodology. I love this article.

A Deep Dive into Kubernetes Metrics Part 2

What I really love about this article is the explanation of USE vs RED vs Four Golden Signals models of instrumentation. Quoting Tom Wilkie, The USE method is for resources and the RED method is for my services.

What is “observability”? - Monitoring Weekly

There have been a lot of questions coming up about observability lately, thanks to all the articles being written about it. I compiled the most useful and informative articles into a single list to make it easier to share with your colleagues. Let me know what you think, and if there’s another topic you’d love to see this done for.

Nathaniel’s Quick And Dirty Python Logging Lesson

Python has a special place in my heart in that it’s the only language I actually know. If you’ve ever struggled to get logging working well in a Python app (I know I have!), this article is a great guide on making it work.

Who monitors the monitoring systems?

Monitoring your monitoring systems isn’t as straightforward as you might think, and there are a lot of non-obvious challenges involved in it. This article talks about a few of those challenges, though we don’t actually have any solutions yet. Any takers?

The Mon-ifesto

A fantastic series about improving monitoring from the folks at Capital One, running the gamut of incident management, postmortems, metrics, graphing, and much more.

Comprehensive Container-Based Service Monitoring with Kubernetes and Istio

Ostensibly, this article is about Kubernetes and Istio, but there’s so much more here that applies to everyone: lots about SLAs/SLIs/SLOs, USE and RED Methods, and a whole lot more where that came. Even if you’re not using k8s, you’ll get something useful out of this.

Elasticsearch Performance Tuning

No doubt if you run an ELK cluster, you’ve likely run into Elasticsearch performance challenges. This article makes a few suggestions for basic Elasticsearch performance improvements, such as the optimal memory size in order to avoid any issues with HEAP.

rcoh/angle-grinder: Slice and dice log files on the command line

This is a super cool CLI-based log analyzer. I can’t really do it justice in text, but there’s a short demo gif in the docs. Check it out.

See you next week!

– Mike (@mike_julian) Monitoring Weekly Editor