SPECIAL EDITION: Q3 2018 Best of
This issue is sponsored by:
Why Hosted Metric Analytics to Monitor Modern Cloud Applications & Infrastructure at Scale
Modern cloud application architectures require a modern monitoring and analytics approach. Find out why SaaS leaders like Workday, Intuit, Box, and Reddit chose hosted metric analytics for real-time insights across all their engineering teams.
From The Community
Be A Grafana/Graphite Power User
This a list of really awesome tips about improving how you use Grafana. Seriously, it’s a great list.
Now available: The open source guide to DevOps monitoring tools
Dan Barker, one of the writers for OpenSource.com and organizer of DevOpsDays Kansas City and the DevOps Kansas City meetup, just released a fantastic guide to open source monitoring and observability tools. Highly recommend downloading the full guide.
The RED Method: How to Instrument Your Services
I’ve bene teaching the USE and RED methods to clients lately and it’s been fascinating to see the impact it has on their monitoring. In my experience, the biggest challenge in monitoring isn’t alert fatigue, but knowing what needs instrumented and monitored to begin with. USE and RED make for a fantastic starting point.
Observerless: The hottest new thing in monitoring you’re already doing
The (in)famous Corey Quinn of Last Week in AWS recently graced the website of Monitoring Weekly and wrote this gem on a new/old concept: Observerless. Now with video of this talk from ServerlessConf, available at A Cloud Guru.
10 monitoring talks that every developer should watch
Most of the time, these Top N lists are garbage, but this one… Well, this one is gold. It starts off with Coda Hale’s seminal Metrics, Metrics Everywhere talk from 2011 and only gets better from there.
Key metrics for AWS monitoring
Basically exactly what the title says: metrics you should absolutely be tracking if you’re running in AWS.
I like this straightforward explanation of log levels. For those who don’t know, these are a subset of the well-defined syslog severity levels, which are awesome and more people should use them.
Google - Site Reliability Workbook
Everyone’s favorite book, the Google Site Reliability Engineering book, now has a companion book: The Site Reliability Workbook. This new book aims to be the practical application of the original book, which was a whole lot of theory. Looking at the table of contents, there’s a lot of great stuff about monitoring, incident management, and more.
Kubernetes Monitoring with Prometheus, the ultimate guide (part 1)
Need a bit more k8s and Prometheus in your life? Here’s a new well-written walkthrough.
What is Cardinality in Monitoring?
With all this talk about “high cardinality” around monitoring lately, this article finally explains what it really means in concrete examples.
Calculating a latency SLO is harder than it first seems, and you’re probably doing it wrong (I know I am).
Health Checks and Graceful Degradation in Distributed Systems
Cindy Sridharan/@copyconstruct is back with a new monster post on health checks and it’s great. Go read it.
Struggling with getting your company on board with a monitoring overhaul? (sponsored)
When I’m not writing this newsletter, I help companies overhaul their approach to monitoring. After working with and talking with tnos of large companies, you know what they all have in common? Getting buy-in from the teams they’re trying to help. Yep, you’re not alone. Want some expert help with the problem? Let’s chat.
Operational Logging at Lyft: Migrating from Splunk Cloud to Amazon ES to Self-Managed Elasticsearch
In other words, they traded an invoice large enough to give most CFOs a heart attack for a complexity level high enough to give most SREs a heart attack. Then again, the team managing their ES stack is probably larger than most of our entire SRE teams.
Grafana’s Explore UI: Taking a Deeper Dive into Data with Prometheus Queries
This upcoming feature in Grafana looks amazing and I can’t wait to see it out of beta. In essence, this allows you to take a query that’s defined in a dashboard and explore deeper in an ad hoc way by changing the query–without losing the dashboard config. Only Prometheus is supported in the beta and I’m looking forward to seeing more datasources supported on this.
What I Talk About When I Talk About Logging
Breaking down large problems into smaller problems is a tried-and-true method of solving problems and finding insight, and this article on logging does just that. The article makes the observation that logging is really five separate problems.
This isn’t strictly monitoring-related, but given everyone’s job on this newsletter, I know you’ll be interested in it. For those not familiar with it, the annual State of DevOps Report is a tremendous work headed up by Dr. Nicole Forsgren every year using legit, rigorous research and statistical analysis methods. The results of this one are pretty neat to read.
You can’t debug systems with dashboards
This interview between A Cloud Guru and Charity Majors has some great stuff in it that will get you thinking about what your future in monitoring and observability could be.
Grafana as a Yet Another Tool for Technical Monitoring of Software Products We Build
It’s always interesting to get peek into how other companies use tools. The folks at Logicify have gone into detail on how they use Grafana and the different use cases they have for it. I especially like the business intelligence use case.
How to write a status page update
Not sure what to write for your status page updates? Follow these instructions–they’re great. You may also be interested in the followup article, Status page updates: It’s all about timing.
A Primer on Building a Monitoring Strategy for Amazon RDS
Want more about monitoring RDS than you could possibly ask for? Here you go: a monster post about RDS covering metrics, logging, and even Cloudtrail.
Infrastructure Monitoring with Mark Carter
Mark Carter, Product Manager for Stackdriver, talks to Software Engineering Daily about monitoring, tracing, observability and everything in between.
My thoughts on this are kinda tangential to the article, but everyone loves a good rant, right? To quote Jeff Hodges, “A systems engineer without a good startup idea inevitably winds up doing monitoring.” Building time series databases is a hard problem so maybe this post will head some startups off at the pass should they decide, “I know! We’ll build yet another damn monitoring service!”
Monitoring the awful horribleness that is the banking industry has always fascinated me (I’m a masochist, clearly), so this post from the folks at Plaid got my attention. They take us through how they chose the components for the next iteration of their monitoring platform and how it all fits together to monitor 9600+ banks.
One of the annoying things about monitoring is that it’s actually kinda hard to do at small scale. When you’ve got 100+ nodes, it makes sense to deploy a robust monitoring infrastructure, but that’d be dumb when you have one or two servers (such as a personal VPS). And yet, there’s not a lot of good tools for monitoring that few systems well. This seems to be a solution to that problem area.
This issue is sponsored by:
Move Faster, See Everything, and Deploy Confidently
Get real-time analytics and massive scale so your Dev and Ops teams can move faster on a stable cloud application estate. Use full stack monitoring to slash MTTR. Start your free 30-day trial today with Wavefront by VMware.
Jobs
Technical Evangelist - Wavefront - Location Flexible
I had the pleasure of speaking with the hiring manager recently and it sounds like a really awesome gig. If you’re into Ops/SRE/DevOps and love monitoring, click through to check it out.
Want your job listed here? Why not submit a post to the job board? It’s only $199/ad for 30 days.