Issue 134

Tons of great logging and SLO content this week. Oh, and a fresh stack of new job postings. Enjoy this issue and have a great day! 😎

This issue is sponsored by:

Rootly logo

Manage incidents directly from Slack

Rootly helps automate the tedious manual work like creating incident channels, searching for runbooks, documenting the postmortem timeline, and more. Teams sized 20 to 2000 manage hundreds of incidents daily and save thousands of engineering hours a year within Rootly. Get started in <5min or book a demo to learn more and get Starbucks ☕ on us!

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Send your metrics to a Prometheus Remote Write endpoint without Prometheus – OpenTelemetry

If you haven’t heard about Prometheus remote write, this is a gentle introduction to the functionality. We used this at my last gig and it offers a lot of flexibility (compared to traditional Prometheus scraping) for increasingly diverse deployment scenarios.

Under Disk Pressure

A little heavy on the memes, but a fun Kubernetes debugging story nonetheless.

Data-driven negotiation with SLIs, SLOs and Error Budgets

An impressively thorough look at everything related to SLOs and error budgets. I would set aside a good half hour to read (and re-read) this two-part series.

Log Aggregation in Kubernetes, and Transporting Logs to Splunk for Analysis

If you can afford it, I have no doubt that Splunk-Connect is a hella useful integration for aggregating your Kubernetes container logs.

Making Your On-call and Incident Management Program Stick

It takes a lot of work to build, adopt, and maintain a healthy Incident Management program, but it’s worth the investment. This article is a nice introduction to many of the considerations and open questions you should start thinking about when developing your own IM strategy.

Logging with Loki

This might be the first article that successfully explained to me what Loki is all about. Excellent summary here, pass it along to your peers.

Raspberry Pi Monitoring using Telegraf, InfluxDB, and Grafana

I know it may seem hard to believe, but Prometheus isn’t the only metrics system out there today. This engineer believes they have a case for choosing Telegraf with InfluxDB (has he heard of remote write, I wonder…).

OpsRamp logo

See ROI on cloud and cloud native monitoring in minutes, free.

Ready to see insights on your cloud infrastructure workloads in minutes? The OpsRamp free trial makes it easy. Set it up with no credit cards or commitments, onboard your resources with our wizard, and use out-of-the-box or custom dashboards to get the metrics that matter. We'll even supply the GCP resources if you just want to see how it works. Get started today. (SPONSORED)

Pinterest’s Analytics as a Platform on Druid (Part 1 of 3)

The first entry in a three-part series, Pinterest engineers walk through their transition to Apache Druid for analytics data. Although it’s not a monitoring or observability story in the strictest sense, I feel like the lines are beginning to blur between high cardinality metrics and analytics systems.

5 steps to improve your application availability

Monitoring Weekly readers are probably not the intended audience here, but it wouldn’t hurt to bookmark this one the next time you need to justify your paycheck to a pointy-haired boss.

Improving efficiency and reducing runtime using S3 read optimization

Tons of useful SLIs and considerations in here for optimizing your own S3 usage.

Grafana Tempo 1.1 released: New hedged requests reduce latency by 45%

Some nice performance improvements and bug fixes in this minor release. Make sure to read the release notes (duh), looks like there are some deprecated block formats.

Tools

splunk/splunk-connect-for-kubernetes

“Splunk Connect for Kubernetes provides a way to import and search your Kubernetes logging, object, and metrics data in your Splunk platform deployment.”

Events

IBM PREVAIL Conference: October 19–21, 2021x

“PREVAIL is a unique follow-the-sun virtual event devoted to IT resilience, performance, security, quality testing and Site Reliability Engineering.”

Job Opportunities

Hardware Infrastructure Engineer, Analytics at DigitalOcean (Remote)

Senior DevOps/SRE at Reify Health (Remote)

Senior SRE, Data Infrastructure at Reify Health (Remote)

Reliability Engineer - Observability at Two Sigma (NYC)

Software Engineer, Observability at The New York Times (Remote)

Negotiating your AWS contract? Let us help. At The Duckbill Group, we’re on your side and we see dozens of these a year–more than most AWS account managers! We’ve helped negotiate everything from $3mm contracts to $650mm contracts and a whole slew in between. Check out our AWS contract negotiation services. (SPONSORED)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor