Issue 225

This week’s issue is heavy on AWS, serverless, and time series database topics. Looks like hiring for remote engineers is still going strong, with numerous job postings at the bottom. Enjoy! 😎🍻🚠

This issue is sponsored by:

Armory logo

Can you rely on your deployments?

In a recent Armory and Gartner report, 35% of respondents’ top pain point with app deployment is reliability and consistency. If you need help with consistent, reliable deployments, try Armory Continuous Deployment-as-a-Service. Check out more in the reports here.

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Visualising and Monitoring in Test Automation

An example for monitoring Selenium test results in Grafana using Prometheus and Pushgateway. If you’re already familiar with Prometheus you can skip the first half of the article.

AWS Lambda Observability Best Practices

Some tips and considerations for folks new to Lambda observability practices. Note that some of these still require practical hands-on experience with your respective service(s) in order to configure things properly.

The Evolution of Serverless Monitoring Tools

This post is less about the evolution of serverless than an overview and collection of insights and concerns for anyone looking to adopt and maintain serverless infrastructure. But yes, it concludes with a very brief summary of some of the more popular commercial serverless monitoring services.

Analyzing Time Series for Pinterest Observability

A wonderful article from Pinterest engineering discussing their use of time series, and how their unique needs drove the design of the current implementation.

Improving query performance in Grafana Mimir: Why we dropped mmap from the store gateway

If you’re a time series database geek you might appreciate this post from Grafana engineers on how they reworked Mimir’s store-gateway to alleviate stalling issues with queries.

Guardian of the Functions: Keeping an Eye on your Galaxy of AWS Step Functions with Custom Metrics on CloudWatch

One of the more useful guides I’ve seen for crafting your own custom CloudWatch metrics and alarms.

Crash tolerance, and missing notifications from Alertmanager

Alertmanager is an excellent tool for routing alerts, but it can suffer from crashes like any other piece of software. This post explains how it aggregates alerts, what happens to them after a crash, and how to optimize your use to minimize any unexpected behaviors.

Spark clusters monitoring with Prometheus and Graphite Exporter

A nice writeup from engineers at QuintoAndar on how they leverage the Graphite exporter to collect metrics from Apache Spark into their existing Prometheus cluster.

Job Opportunities

Infrastructure Engineer at Platform Science (US Remote)

Senior/Staff Site Reliability Engineer at Platform Science (US Remote)

Principal DevOps Engineer, Automation at Calix (NA Remote)

Site Reliability Engineer at Teleport (EU Remote)

Senior Site Reliability Engineer at Teleport (US Remote)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor