Reading time: 10 minutes

Applying MicroProfile Metrics to your application is easy (as usual for MicroProfile and Jakarta EE). But do you really understand all the numbers that it returns? Join Captain Freelants on his 10-minute mission to explore not-so-strange, new technologies.



It always starts with a distress call.

“…requesting assistance with our hyperdrive, please respond! Repeat: This is Starship Supersonic, we are requesting assistance with…“

Captain Freelants turned down the volume of his music synthesizer (currently playing “Raise Your Hands”, which he had heard in some ancient movie), and responded.

“This is Captain Freelants speaking. My ship’s distance from yours is about 2 million kilometers, so I can be with you shortly. What’s the problem?”

“We are trapped within the gravitational field of a black hole, and something’s wrong with our Quarkus drive! It’s too slow to get us out of here, and in a few minutes, we’re gonna turn into spaghetti! There are a bunch of blinking red lights and strange numbers in the diagnostics console, but no quick fix suggestion!”

“Let me speak with your engineer, please.”

“I AM the engineer! I even read the ‘Getting Started’ page before installing the Quarkus drive. Perhaps the problem’s got something to do with our monolithic energy core…“

“Tell me exactly what you see on the diagnostics console.”

“Way too much, but nothing I could understand! Like, there is one line labeled with ‘p99’ and the value ‘33643’, and…“

“Hold on. This means that 99 percent of your boost requests to the quarkus drive have a duration of 33643 seconds, which is way too long. You have to increase the power input to your drive immediately.”

“Are you really sure? Don’t you want to take a look at the other metrics as well?”

“Trust me, I know what I’m doing.”

“All right, increasing power input. The numbers are changing. Now they…“

In this moment, the other starship’s image on the Captain’s radar display turned into a cloud of debris, which was rapidly sucked in by the adjacent black hole.

“What the hell has happened?”

A translucent, hovering text box appeared in front of his head.

Freelancer license examination failed (miserably).

Primary reason: Wrong understanding of percentiles and metric units.

Consequence: Your client has turned into charcoal.

Insert coin to try again (in 30 solar days).

“Alright, perhaps I should spend a little more time preparing for the next attempt. I’ll go directly to the source of truth - the Hall of Specifications!”

After a short travel through hyperspace, during which he kept taking nervous looks at his diagnostics console, Captain Freelants finally arrived at a huge, white hall with just a single desk in its center. Sitting behind the desk, there was a pale, old android with a grumpy-looking face and a sour voice.

"I am the Keeper of Specifications. Who are you, and what do you want?"

“I am Captain Freelants, and…”

"Comic zone is next door. Besides, you misspelled your own name. Goodbye, Captain Freelants."

“‘Captain Freelance’ was already taken by some other guy. How can you hear that I misspelled it, anyway?”

"My superior mind allows me to peek into higher dimensions."

“Then you should also know that I’d like to learn more about MicroProfile Metrics, …”

The Captain read the small name tag on the Android’s robe.

“…Mr. Speck.”

"This is what I've been built for, so let's get this over with. Choose the type of metric that you want me to explain:"
  • @Counted
  • @SimplyTimed
  • @Gauge
  • @ConcurrentGauge
  • @Metered
  • @Timed

“I’m mostly interested in the meaning of ‘p99’, wherever this is buried. But it’s probably a good idea to learn about all of this stuff.”

Mr. Speck rolled his eyes.

"You know that it says '10 minute read' at the top of this blog entry?"

“And I didn’t have lunch yet, so hurry up, please. Perhaps I can speed up things for the first three types, which are quite easy: A Counter just counts the number of invocations of a method. A SimpleTimer measures the durations of a method’s invocations, and returns their sum. And a Gauge returns any number that you want it to, such as the current CPU temperature. So far, so good?”

"Yes, apart from the numerous omissions and imprecisions."

“But what on earth is a concurrent gauge? Concurrent CPU temperatures?”

"Wrong. An incrementally increasing or decreasing numeric value (e.g. number of parallel invocations of a method). This type exposes three values: current count, highest count within the previous full minute and lowest count within the previous full minute. [1]"

“Alright, now I get it. It’s a gauge measuring the number of concurrent invocations.”

"Isn't that what I just said?"

“We’re making excellent progress. Now what’s a Meter?”

"A metric which tracks mean throughput and one-, five-, and fifteen-minute exponentially-weighted moving average throughput. [1]"

”???”

"Meter is a complex metric type comprised of multiple key/values. [1]"

“Could you explain this in a more understandable way?”

"???"

"Perhaps I can help you."

Startled, Captain Freelants turned around to the source of this warm, comforting voice. Next to him, there was another desk he hadn’t noticed before, with a friendly-looking, blue-skinned woman sitting behind it. And she had cute antennas, too.

"Please excuse Mr. Speck. He is such a valuable source of information, but sometimes, it is hard for him to explain certain basics that he takes for granted. This is usually where my assistance is required. My name is Tute Oriel, nice to meet you!"

“Likewise. How did you get here so fast?”

"Materialized through auto-suggest. This is a sci-fi story, after all."

“OK. Could you start by explaining what exactly ‘mean througput’ means?”

"Of course. This simply tells you (for the method or class that you have annotated with @Metered) how many requests per second have been processed on average since the Meter was created. So for instance, if your server has been running for 100 seconds, and has processed 1000 requests so far, you get a mean throughput of 10 requests per second."

“And what is an ‘exponentially-weighted moving average throughput’?”

"The problem with the normal, mean throughput is that in many cases it can't really tell you how your server is doing right now, because the server has already been running for quite some time. If your server's throughput was fine for the first 10 days, but started getting worse in the last 5 minutes, you won't see this in the average number: The computation for the simple average treats all observations equally, regardless of when they were made."

“So it would be nice to have some other kind of average that focuses on the recent past.”

"Right, and this is exactly what the weighted moving average does: Newer observations get a higher 'weight' than older ones. Roughly speaking, the one-minute moving average has its focus on the last minute, the five-minute on the last five minutes, and so on. You can imagine a window of focus that keeps moving in time to keep up with the newest observations. And finally, it is called 'exponential' because this is the way how older observations lose their importance as time progresses. If you google for 'exponential decay', you'll see some nice diagrams that will give you the basic idea of how this looks like."

“That’s great, I think I got this one, too. And what about @Timed?”

"A metric which aggregates timing durations and provides duration statistics, plus throughput statistics. [1]"

"Thank you, Mr. Speck. First, a Timer comes with the same data as a Meter."

“So this means that I don’t need @Metered at all when I use @Timed?”

"Yes. In addition, it tells you the minimum processing duration that was observed since the server was started, as well as the maximum, and the average. These ones are quite easy to understand, I suppose?"

"Of course they are! How could anybody not understand these?"

"A Timer also gives you the standard deviation, which tells you how your durations tend to deviate from the average. In the simplest case, if all your processing durations are exactly 50 milliseconds, you also get an average duration of 50 milliseconds, with a standard deviation of 0. In another example, if half of your durations are 40 milliseconds, and the other half are 60 milliseconds, you get an average of 50 milliseconds again. But this time, the standard deviation is 10 milliseconds, meaning that the values usually deviate 10 milliseconds from the average."

“Once you get it, it’s quite easy to understand. What about p99, when do we get to that?”

"That's part of the last group of metrics returned by a Timer: the percentiles. If the 99th percentile (p99) for your durations is 50 milliseconds, it means that 99% of all observed durations are below 50 milliseconds. High percentiles can quickly tell you whether the majority of your requests has a decent duration, although you shouldn't neglect the last percent, either. And besides, you should also check what the API mentions as default unit. In case of the @Timed annotation, it is nanoseconds."

“That’s why the simulated client’s ship exploded.”

"And you should never rely on just a single metric at a single point in time. The usual approach is to continually collect ("scrape") metrics, and display them on a neat timeline using Prometheus and Grafana. This way, you can easily spot the time of events that were causing problems."

“Wow, thank you - that was really helpful! However, now I feel a little lost, with all these different kinds of metrics…”

"I COULD provide you with a summary of Tute Oriel's explanations. Unofficially."

“Thank you, too, Mr. Speck. Now that I understand this stuff a little better, I might even consult you for reference from time to time.”

"One day, you MIGHT become worthy."

“After all this learning, what would you think about having lunch together, Tute Oriel? I’ve got a starship, you know?”

"Thank you for the invitation. But this is just a simulation, you know?"

“Oh, right…”

A little disappointed, Captain Freelants took off his VR helmet and put it back on his desk, next to the model of the starship he would buy after finally passing the freelancer examination.

“Time to go to lunch. And then I’ll take a look at Mr. Speck’s summary.”



Mr. Speck's Unofficial Summary

@Counted

  • total number of processed requests
{
  "package.class.method": 42
}

@SimplyTimed

  • also includes data from @Counted
  • processing duration (sum)
  • default unit: nanoseconds
{
  "package.class.method": {
    "count": 42,
    "elapsedTime": 30338.0
  }
}

@Gauge

  • example: CPU temperature
{
  "package.class.method": 42
}

@ConcurrentGauge

  • parallel invocations:
    • current number
    • minimum number in the previous minute
    • maximum number in the previous minute
{
  "package.class.method": {
    "current": 2,
    "min": 0,
    "max": 7
  }
}

@Metered

  • also includes data from @Counted
  • mean rate: average number of requests per second (since the meter was created)
  • 1m / 5m / 15m rate per second
    • exponentially-weighted moving average with focus on the given time interval
    • includes values since the meter was created (older values continuously getting weaker)
  • default unit: per second
{
  "package.class.method": {
    "count": 85457,
    "meanRate": 88.02073105290496,
    "oneMinRate": 89.56418908089488,
    "fiveMinRate": 85.79830589848899,
    "fifteenMinRate": 85.48272286429467
  }
}

@Timed

  • also includes data from @Metered
  • minimum duration so far
  • maximum duration so far
  • average duration so far
  • standard deviation
    • i.e., deviation from the average
  • percentiles: p50/75/95/98/99/999
    • x% of the durations are below the returned value
  • default unit: nanoseconds
{
  "package.class.method": {
    "count": 85457,
    "meanRate": 88.02073105290496,
    "oneMinRate": 89.56418908089488,
    "fiveMinRate": 85.79830589848899,
    "fifteenMinRate": 85.48272286429467,
    "min": 10037187,
    "max": 12008665,
    "mean": 10176720.438832652,
    "stddev": 155191.9130158552,
    "p50": 10113833,
    "p75": 10157627,
    "p95": 1.041846E+7,
    "p98": 10489736,
    "p99": 1.064567E+7,
    "p999": 12008665
  }
}