Issue 143
I love talking tools, and this newsletter is chock full of them. Yeah I know… tools don’t solve problems, but they’re still fun to play with. Enjoy! 🛠🌈🥇
This issue is sponsored by:
Start incident response with context to all your alerts in one view
Moogsoft speeds up incident response with dynamic anomaly detection, suppressed alert noise, and correlated insights across all your telemetry data. Go from debugging across multiple tools, screens, and dashboards into a single incident view so you and your teams can take a more proactive approach to reduce MTTR. Sign up for the Moogsoft Free community plan today!
Articles & News on monitoring.love
Observability & Monitoring Community Slack
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
Kubernetes Logging in Production
A thorough look at the logging patterns for Kubernetes clusters with a comprehensive look at the pros and cons for each approach.
Ok, this legit looks like a fun tool to use for debugging network activity. It’s great to see how quickly the tools and community are growing around eBPF and Cilium.
React faster: Use Alertmanager and MS Teams
I’m starting to see more adoption of Microsoft Teams across our industry. This article will show you how it’s possible to integrate with Alertmanager so you don’t miss out on any alerts.
Grafana 8.2.3 released with medium severity security fix: CVE-2021-41174 Grafana XSS
There’s a new XSS vulnerability in Grafana affecting versions 8.0.0-beta1 to 8.2.2. Although you should upgrade to the patched version at the earliest opportunity, there is a workaround available.
SLA vs. SLO vs. SLI: Understanding the Similarities and Differences
A good primer on how SLAs, SLOs, and SLIs relate and how to think about defining your own.
I always enjoy reading how other teams approach monitoring and reliability. Although we often end up at the same destination, many of us take very different journeys to get there.
How sparse histograms can improve efficiency, precision, and mergeability in Prometheus TSDB
Sparse histograms look like a vastly simpler approach to histograms in Prometheus. It sounds like there’s still a lot of work to be done, but the prototype results are promising.
Work. Without the hard work.
LogicMonitor empowers teams to spend less time troubleshooting and more time innovating with fully automated infrastructure monitoring and log analysis. AI-powered intelligence automatically detects monitoring resources, surfaces anomalies, and provides root cause analysis across your entire stack. Leave the manual configuration, expensive hardware, and long hours of troubleshooting behind with a free trial of LogicMonitor. (SPONSORED)
Docker: Run Telegraf as non-root
If you’re a Telegraf user, take note that the official DockerHub image for release 1.20.3 runs the service as a non-root user.
Top key metrics for monitoring MySQL
An excellent walkthrough for monitoring your MySQL databases with the MySQL Prometheus Exporter, including some tips on which metrics to keep an eye on.
NGINX Monitoring: 7 Best Tools & Key Metrics to Measure
I don’t know many folks that are picking their monitoring stack based on how well it monitors NGINX specifically, but if you’re one of those people, I have just the article for you.
Zabbix 6.0 native High Availability
The release of Zabbix 6.0 will mark the first version with native high availability (read: no need for third-party clustering methods). This post does a solid job pulling together all of the steps needed to perform the upgrade along with some gotchas to watch out for.
Tools
“prom2teams is an HTTP server built with Python that receives alert notifications from a previously configured Prometheus Alertmanager instance and forwards it to Microsoft Teams using defined connectors.”
“pwru is an eBPF-based tool for tracing network packets in the Linux kernel with advanced filtering capabilities. It allows fine-grained introspection of kernel state to facilitate debugging network connectivity issues.”
Job Opportunities
Sr. Site Reliability Engineer, Infrastructure at Flexe (US Remote)
Site Reliability Engineer, Infrastructure at Flexe (US Remote)
Systems Engineer - Database at Cloudflare (US Remote)
DevOps Engineer at Ring.io (Remote)
Ready to lower your AWS bill? Now might be the perfect time for an AWS Cost Optimization project with The Duckbill Group. The Duckbill Group aims for a 15-20% cost reduction in identified savings opportunities through tweaks to your architecture–or your money back. (SPONSORED)
See you next week!
– Jason (@obfuscurity) Monitoring Weekly Editor