This week reminds me of what got me excited about monitoring in the first place… building reliable, scaleable systems with the visibility and knowledge to maintain them. Some fun topics around time-series, alerting, and Kubernetes troubleshooting. Enjoy! 🎈🥂🌻

This issue is sponsored by:

Chronosphere logo

Headed to Portland for Monitorama 2023 PDX? We sure are! Chronosphere’s Co-founder and CTO, Rob Skillington, will be speaking about cost-efficient metrics aggregation on Monday, June 26! Come check out his session, grab some swag, and enter for a chance to win a Mighty Bowser™ LEGO set! See what other activities we’re up to that week here.

Articles & News on

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Unveiling the Architectural Brilliance of Prometheus

As a fan of push-based metrics collection, I’m not sure I buy into the rhetoric here, but this is a very good look at Prometheus’ strengths and how to use its multitude of features.

Improved Alerting with Atlas Streaming Eval

Anyone who’s tried to alert on complex time-series queries can empathize with this one. More proof that Kyle Kingsbury’s Riemann was ahead of its time.

Demystifying OOM Killer in Kubernetes: Tracking Down Memory Issues

Kubernetes does a good job of managing resources, but it’s naive to think we won’t need to troubleshoot it like any other system from time to time. And knowing how to debug something is the first step to monitoring it effectively.

Quick start with SkyWalking Go Agent

Golang developers rejoice, the SkyWalking project has released a new auto-instrumenting agent specifically for Go applications. Looks like the older go2sky project will be deprecated in the not-too-distant future.

Mastering Kubernetes Troubleshooting: Best Practices and Tools for Effective Cluster Maintenance

Some foundational tips and techniques for Kubernetes debugging that should inform your monitoring strategy.

Laffer’s Curve and Reliability of Software Systems

Interesting discussion on the balance we strive for when building any reliable system. I’d posit that any team revisits this numerous times over a company’s growth.

Moogsoft logo

People, Process, Technology - How has your business changed?

The 2nd Annual State of Availability survey is out and we want to hear from you. Tell us how your business has changed over the last year around ITOps, DevOps & AIOps. Survey respondents will be entered to win a $100 Amazon Gift Card. (SPONSORED)

Thanos Ruler and Prometheus Rules — a match made in heaven

If you’re looking to minimize your Prometheus retention but need to support longer ranges on Thanos queries, you might want to check out the Ruler component. This post is a quick anecdotal look at one company’s need for it in lieu of evolving retention demands.

10 tips, practices, to handle major incident

We often focus on the processes and responsibilities during an incident response, but we often neglect how we communicate can have a tremendous effect on our peers and customers.

Grafana security release: CVE-2023-2183 and CVE-2023-2801

Security patch releases for Grafana have been released to address medium and high severity CVE advisories.



The Golang auto-instrument Agent for Apache SkyWalking, which provides the native tracing abilities for Golang projects.


Monitorama 2023 PDX

Just two weeks left until everyone’s favorite monitoring conference of the year. I’m super excited to see everyone back in Portland for another awesome lineup of speakers and plenty of fun activities. Hope to see you there!

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor