Issue 225
This week’s issue is heavy on AWS, serverless, and time series database topics. Looks like hiring for remote engineers is still going strong, with numerous job postings at the bottom. Enjoy! 😎🍻🚠
This issue is sponsored by:
Can you rely on your deployments?
In a recent Armory and Gartner report, 35% of respondents’ top pain point with app deployment is reliability and consistency. If you need help with consistent, reliable deployments, try Armory Continuous Deployment-as-a-Service. Check out more in the reports here.
Articles & News on monitoring.love
Observability & Monitoring Community Slack
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
Visualising and Monitoring in Test Automation
An example for monitoring Selenium test results in Grafana using Prometheus and Pushgateway. If you’re already familiar with Prometheus you can skip the first half of the article.
AWS Lambda Observability Best Practices
Some tips and considerations for folks new to Lambda observability practices. Note that some of these still require practical hands-on experience with your respective service(s) in order to configure things properly.
The Evolution of Serverless Monitoring Tools
This post is less about the evolution of serverless than an overview and collection of insights and concerns for anyone looking to adopt and maintain serverless infrastructure. But yes, it concludes with a very brief summary of some of the more popular commercial serverless monitoring services.
Analyzing Time Series for Pinterest Observability
A wonderful article from Pinterest engineering discussing their use of time series, and how their unique needs drove the design of the current implementation.
Improving query performance in Grafana Mimir: Why we dropped mmap from the store gateway
If you’re a time series database geek you might appreciate this post from Grafana engineers on how they reworked Mimir’s store-gateway to alleviate stalling issues with queries.
One of the more useful guides I’ve seen for crafting your own custom CloudWatch metrics and alarms.
Crash tolerance, and missing notifications from Alertmanager
Alertmanager is an excellent tool for routing alerts, but it can suffer from crashes like any other piece of software. This post explains how it aggregates alerts, what happens to them after a crash, and how to optimize your use to minimize any unexpected behaviors.
Spark clusters monitoring with Prometheus and Graphite Exporter
A nice writeup from engineers at QuintoAndar on how they leverage the Graphite exporter to collect metrics from Apache Spark into their existing Prometheus cluster.
Job Opportunities
Infrastructure Engineer at Platform Science (US Remote)
Senior/Staff Site Reliability Engineer at Platform Science (US Remote)
Principal DevOps Engineer, Automation at Calix (NA Remote)
Site Reliability Engineer at Teleport (EU Remote)
Senior Site Reliability Engineer at Teleport (US Remote)
See you next week!
– Jason (@obfuscurity) Monitoring Weekly Editor