Issue 095
Did you know I have a podcast too? Check it out: Real World Devops
This issue is sponsored by:
Monitor What Matters Most and Diagnose Anomalies in a Matter of Seconds
When it’s time to troubleshoot an issue, are you providing the right monitoring signals to your team? SignalFx APM helps by providing full distributed tracing, anomaly detection, and predictive analytics – all right out of the box.
Latest on monitoring.love
Observability & Monitoring Community Slack
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
Logs vs Structured Events by Charity Majors
Based on the title, this sounds like just another “structure your logs, m’kay” article but it’s even better than that: why your teammates might be against structured logging and how to convince them. I particularly like the observations on philosophy of logging for monolithic applications vs distributed applications.
This is an awesome new podcast.
PagerDuty Postmortem Documentation
PagerDuty recently released their postmortem documentation/guides. It’s quite good.
The folks at Expedia have released some code they use for anomaly detection.
A Blueprint for Splunk ITSI Alerting
For the Splunk fans in the room, a five-part guide on alerting methodology and setup in ITSI.
Best Practices for Instrumenting Applications with OpenTracing
“The biggest risk in the process of getting started with distributed tracing is only doing partial instrumentation. Often, someone becomes interested in tracing, acts as a champion by convincing groups and teams to use it, but not everyone does. That means there’s incomplete data, so it’s hard to show value, the momentum drops off, and the tracing effort ends.” – This is really the biggest risk in any new initiative, whether you’re trying to change deployment practices, overhaul monitoring, or even just improve testing. The author’s point is well-taken though: especially when you’re working on an initiative where the value-add isn’t clear upfront (eg, tracing), it’s easy for the initiative to end prematurely due to not enough adoption.
Security Information and Event Management (SIEM) versus a newer approach, Security Analytics. Goodness, I do not miss working with SIEM systems.
Tune up your SLI metrics: CRE life lessons
Quoth the article, “A good SLI will exhibit the following properties: It rises when your customers become happier; It falls when your customers become less happy; It shows materially different measurements during an outage as compared to normal operations; It oscillates within a narrow band (i.e., showing a low variance) during normal operations.”
[Prometheus] Using tsdb analyze to investigate churn and cardinality
“When it comes to Prometheus resource usage and efficiency, the important questions are around cardinality and churn. That is how many time series you have, and how often the set of time series changes. I recently added a subcommand to the tsdb utility to help determine this for existing blocks.”
AWS SLA: Are you able to keep your availability promise?
I’ll save you a click: probably not. (but you should still totally read the article anyway)
This issue is sponsored by:
If you’ve never read this Post-Incident Review guide from O’Reilly, you’re missing out. It’s one of the best ones I’ve seen.
Jobs
Want your job listed here? Why not submit a post to the job board? It’s only $99/ad for 30 days.
See you next week!
– Mike (@mike_julian) Monitoring Weekly Editor