Issue 120

Better late than never, right? All kidding aside, I’m thrilled to bring back the Monitoring Weekly newsletter. I’m grateful for so many of you sticking around through our extended hiatus. I hope you’re all doing well and starting to resume some post-pandemic levels of socialization. For now, enjoy a few more moments of quiet time reading this week’s articles. 😍

Articles & News on monitoring.love

Monitoring Weekly newsletter is back!

If you’re reading this, you already know the punch line. Good to see you again!

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Profiling in production to detect server bottlenecks

A collection of production profiling examples from Miro, where they walk through some of their learnings as they build observability into their stack. And you have to respect their honest self-assessment that they’ve re-invented the wheel out of necessity.

The Future is Bright, the Future is Prometheus Remote Write

One of the talks from the recent PromCon Online 2021 virtual event, Tom Wilkie delivers a good overview on Prometheus remote write and how it fits into the greater Prometheus and Thanos federation story. Tom includes coverage of some of remote write’s strengths and weaknesses, as well as how we might expect to see it evolve in the not-too-distance future.

Everything is broken, and it’s okay

Humans are flawed, our systems are flawed, and it’s ok. Although the article is focused on aspects of systems reliability, it also serves as a reminder that not all of the software we run is our own.

Open-source tools you should use on an on-prem Kubernetes cluster

This article is a mile wide and an inch deep, covering a variety of tools relevant to Kubernetes. I love it because even if you don’t pick any of the tools discussed, the author does a great job of covering a ton of different considerations for the average K8s administrator.

How to correlate Graphite metrics and Loki logs

One of the upcoming features in Grafana 8.0, you’ll soon be able to map Graphite metrics to be easily correlatable with Loki logs.

Reverse debugging at scale

At Facebook scale, it’s often impractical to capture, rewind, and replay crashed for debugging. This group is leveraging Intel Processor Trace (PT) and eBPF to identify and trigger these captures, which are later used with other profiling software to reconstruct the instructions and debug the issue.

Step by Step detailed guide to setup Apache Skywalking on kubernetes

I’m not sure why, but the Apache SkyWalking project has been flying under the radar outside of China. If you’re interested in running APM in-house, this looks like a straightforward guide to setting up SkyWalking with OpenTelemetry for your first attempt.

Tools

conprof/conprof

Conprof is a continuous profiling system designed to work alongside Prometheus (by design; the project creator is also a Prometheus maintainer). Reading the project description, it almost reminds me of a modern take on Riemann, but with an emphasis on tracing.

Events

GrafanaCONline 2021 - June 7-17 (Virtual)

GrafanaCONline is back again this year, this time with two weeks days of talks and workshops. The upcoming release of Grafana 8.0 looks to be a major theme, along with a variety of talks on Prometheus, Loki, Tempo, and more.

o11ycon + hnycon - June 9-10 (Virtual)

Honeycomb is bringing ollycon back for 2021, this time as a two-day event. The first day will include multiple tracks of vendor-neutral observability talks, while the second day (“hnycon”) will focus on Honeycomb-specific workshops and customer presentations.

Monitorama PDX 2021 - September 13-15 (Portland, OR)

One of the first technical conferences to resume in-person events, Monitorama is returning to Portland, OR this fall. It looks like a return to form for one of our favorite events (ok, we might be biased). Hope to see you there!

Job Opportunities

Site Reliability Engineer (Observability) at Major League Baseball

Sr Software Engineer, Reliability (Agents of Webapp) at Slack

Cloud SRE (Reliability) at Elastic

Ready to lower your AWS bill? Now might be the perfect time for an AWS Cost Optimization project with The Duckbill Group. The Duckbill Group aims for a 15-20% cost reduction in identified savings opportunities through tweaks to your architecture–or your money back. (SPONSORED)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor