Histograms are awesome. You should use them. This talk goes into more detail about how, why, and their different types.
I’m always a big fan of these types of talks and hearing how a team/company goes through a radical evolution of something. This one is the folks at Bloomberg, who overhauled and consolidated all of their metrics tools into a single company-wide platform. Definitely worth a watch.
This a list of really awesome tips about improving how you use Grafana. Seriously, it’s a great list.
A neat visualization of different open-source APM tools and how they fit into a complete solution.
Got Spring Boot applications laying around? Considering/currently using Prometheus? This two-parter walks through two scenarios: instrumenting the code directly and monitoring black boxes (for when you can’t change the code).
Monitoring the awful horribleness that is the banking industry has always fascinated me (I’m a masochist, clearly), so this post from the folks at Plaid got my attention. They take us through how they chose the components for the next iteration of their monitoring platform and how it all fits together to monitor 9600+ banks.
Left without comment…maybe because I’m now trying to figure out how to make this a thing in other tools. I asked around my Splunk friends and this integration is apparently a few years old and built by Dominos themselves. So that’s cool.
This idea made the rounds on Twitter this week: set up an alert on increased page views of your support site or status page as an early warning mechanism that your customers are experiencing something wrong. I love the idea, but be careful about the noise the alert may generate. Also: not a bad way to create a DoS on someone’s support team: just hit their status page with an automated curl script. :/
There’s one paragraph in this that’s super important and relevant for you folk (the one about metrics): error budgets are tied to customer-impacting errors, which means the metric(s) you use to determine errors must be an accurate portrayal of customer impact. But, sometimes it isn’t and you’ve got the wrong metric(s). This is harder than it sounds to do well when you’re running complex applications, and even more when you’re trying to come up with SLOs and error budgets for internal systems. Even trickier is understanding at what level of errors equals actual customer impact: if you drop one request, do customers notice? What if it’s ten? A thousand? Do you have data to back that up? Moral of the story: arbitrary SLOs are bad, m’kay.
If you’re into reading academic papers, here’s one I found via John Allspaw this weekend: a diagnosis of why “too much data” is such a difficult problem to solve and why it seems to just be getting worse.
There’s a couple really cool bits in this release: alerting for Elasticsearch datasources and native Grafana builds for ARM. That second one is something I’ve been waiting on–I run Grafana on ARM devices all the time so I love that there’s now a native build for it.
If you’re looking for ways to improve your logging with Java applications, this logging library from Google might do the trick. They explain more about it in the readme, but the gist is that they’ve combined all of their Java logging libraries into one standard that aims to solve most, if not all, of their own pain points with logging in Java.
If you’re in the Stuttgart, Germany area, there’s an Icinga meetup coming up soon.
Sensu has graciously offered a discount code for all Monitoring Weekly readers! Use
MonitoringWeekly at checkout for $50 off the early bird ticket.
See you next week!
— Mike (@mike_julian) Monitoring Weekly Editor