Exposing Metrics and Health Checks in Quarkus Services

Logs explain what happened. Metrics explain how the application is behaving now.

A Java service that runs in production should expose measurable signals about resource usage, call volume, timing, health, and business activity. Those signals let the platform alert when something goes wrong and let the team understand whether a release improved or damaged the system.

Quarkus and MicroProfile-style APIs make this practical by exposing metrics and health checks through HTTP endpoints.

The Problem

A production service can fail slowly.

It may still answer requests, but memory usage grows. Thread count rises. A database connection becomes unstable. A payment endpoint becomes slower. A traffic spike prevents the service from accepting more work.

Logs may show symptoms, but they do not provide an easy real-time snapshot.

Logs:
Sequential events produced by the application

Metrics:
Current or aggregated values exposed by the application

Health checks:
Simple up or down signals used by platforms

A good monitoring design uses all three. This post focuses on metrics and health checks.

Metrics versus Logs

A log entry is pushed somewhere: a console, file, database, or external system.

A metric is exposed by the application and pulled by another system. Metrics are commonly collected repeatedly and stored over time.

Application
  |
  v
/metrics endpoint
  |
  v
collector pulls values
  |
  v
time series storage
  |
  v
dashboard and alerts

Metrics show snapshots and trends. They can answer questions such as:

How many calls has this endpoint received?
How long does this operation take?
How many threads are currently active?
How much heap memory is used?
How many payment transactions are being processed?

The chapter describes metrics as telemetry. That term is useful because the application is instrumented to expose operational signals.

Core Metrics Types

MicroProfile Metrics can expose metrics through annotations. Two practical examples are counters and timers.

A counter tracks how many times something has happened. A timer tracks how long an operation took.

@GET
@Path("/hello")
@Produces(MediaType.TEXT_PLAIN)
@Counted(name = "callsNumber", description = "How many calls received.")
@Timed(name = "callsTimer", description = "Time for each call", unit = MetricUnits.MILLISECONDS)
public String hello() throws InterruptedException {
    int rand = (int) (Math.random() * 30);
    Thread.sleep(rand * 100);
    return "Hello RESTEasy";
}

The method exposes a simple REST endpoint. The annotations describe metrics to collect from that endpoint.

The exact implementation can calculate useful values such as count, minimum, maximum, average, and related timing summaries.

A simplified output can look like this:

# HELP application_it_test_MetricsTest_callsNumber_total How many calls received.
# TYPE application_it_test_MetricsTest_callsNumber_total counter
application_it_test_MetricsTest_callsNumber_total 4.0

# HELP application_it_test_MetricsTest_callsTimer_seconds Time for each call
# TYPE application_it_test_MetricsTest_callsTimer_seconds summary
application_it_test_MetricsTest_callsTimer_seconds_count 4.0

The important design habit is to name metrics clearly. A metric name and description should be useful to a person reading a dashboard during a production problem.

Metrics Endpoints

Metrics need stable endpoints so external systems can collect them.

The chapter highlights these endpoint groups:

/metrics/application
  application-defined metrics

/metrics/vendor
  vendor-specific metrics

/metrics/base
  predefined platform metrics

/metrics
  all available metrics

In Quarkus, application metrics are exposed under the /q/metrics/application path.

Base metrics can include useful JVM-level values, such as heap memory usage and live thread count.

base_memory_usedHeap_bytes
base_thread_count

These values are not business features, but they are critical during troubleshooting. A sudden increase in response time may correlate with memory pressure, thread growth, or other runtime symptoms.

Collecting Metrics with Prometheus

Exposing metrics is only the first step. The data needs to be collected and stored.

Prometheus is described as a widely used framework for this role. It collects metrics from systems that expose compatible endpoints, stores them in a time series database, supports querying, and can provide alerts. It also has a built-in interface and can integrate with Grafana.

A practical architecture looks like this:

Quarkus service
  |
  v
/q/metrics/application
  |
  v
Prometheus
  |
  v
Time series database
  |
  v
Grafana dashboard or alert

The service owns instrumentation. Prometheus owns collection and time-based storage. Grafana or a similar frontend owns the dashboard presentation.

Add Custom Business Metrics

Technical metrics are useful for scaling and troubleshooting. Business metrics can be just as important.

Examples for a payment platform include:

payments_created_total
payment_authorization_time
active_payment_users
failed_payment_attempts
transaction_amount_total

The source chapter mentions that custom metrics can track use-case-specific figures such as the number of payments or transactions.

A practical rule is to expose metrics that someone can act on.

Useful metric:
payment_authorization_time

Why:
It can be connected to user experience, service performance, and downstream issues.

Weak metric:
internalCounter42

Why:
It has no clear owner, meaning, or action.

Metrics should support decisions, not only fill dashboards.

Health Checks

Health checks are a special kind of monitoring signal. They do not expose trends. They answer a simpler question: is this application healthy enough for a platform action?

In cloud and PaaS environments, especially Kubernetes, health checks can drive self-healing and traffic routing.

The chapter describes three health check concepts:

Check	Practical meaning
Liveness	Is the application up and running?
Readiness	Is the application ready to receive requests?
Startup	Has the first startup phase been completed?

These checks may sound similar, but they are used differently.

A liveness failure can tell the platform to restart the process.

A readiness failure can tell the platform to stop sending traffic temporarily.

A startup check can allow a slow application to complete initialization before liveness or readiness decisions become strict.

Quarkus Health Check Endpoints

Quarkus exposes health probes by default under paths like:

/q/health/live
/q/health/ready
/q/health/started

The responses are formatted as JSON.

A basic liveness check can be implemented with the @Liveness annotation:

@Liveness
public class MyLiveHealthCheck implements HealthCheck {
    @Override
    public HealthCheckResponse call() {
        return HealthCheckResponse.up("Everything works");
    }
}

A readiness check can use @Readiness:

@Readiness
public class MyReadyHealthCheck implements HealthCheck {
    @Override
    public HealthCheckResponse call() {
        return HealthCheckResponse.up("Ready to take calls");
    }
}

A startup check can use @Startup:

@Startup
public class MyStartedHealthCheck implements HealthCheck {
    @Override
    public HealthCheckResponse call() {
        return HealthCheckResponse.up("Startup completed");
    }
}

The API is intentionally simple. The real value comes from deciding what each check should test.

What Should a Health Check Verify?

A trivial check that always returns up is useful only to prove that the framework endpoint works. A production check should inspect something meaningful.

Possible checks include:

Liveness:
The process event loop or main runtime is not stuck.

Readiness:
The service can reach the required dependencies.

Startup:
Initial configuration and warmup have been completed.

For example, readiness may check whether a database connection is available. If the database is temporarily unavailable, the service may still be alive but not ready to process requests correctly.

Health checks can also return a negative result through a down response. Multiple checks can be chained, so the cumulative answer is up only when every required check is up. Metadata can be included to make the response more useful.

Practical Workflow

Identify the most important user-facing operations.
Add counters to important calls.
Add timers where latency matters.
Expose custom metrics for core business activity.
Keep metric names readable and stable.
Collect metrics through Prometheus or a similar collector.
Build dashboards around symptoms the team can act on.
Implement liveness, readiness, and startup checks separately.
Let readiness fail when required dependencies are unavailable.
Use alerts sparingly and connect them to response actions.

Common Mistakes

The first mistake is collecting metrics with no purpose. Every important metric should support troubleshooting, scaling, reporting, or business insight.

The second mistake is treating health checks as a single endpoint. Liveness, readiness, and startup checks exist for different reasons.

The third mistake is making liveness depend on every external system. If a database fails, restarting every application instance may not help. That is often a readiness issue, not necessarily a liveness issue.

The fourth mistake is exposing business metrics without ownership. If no one knows what a metric means, no one can act on it.

The fifth mistake is alerting on everything. Too many alerts train teams to ignore the monitoring system.

Checklist

Application metrics are exposed through stable endpoints.
Metrics include names and descriptions.
Endpoint call counts and timings are measured where useful.
Base JVM or runtime metrics are collected.
Custom business metrics are defined for important flows.
Prometheus or another collector stores time-based metrics.
Dashboards connect metrics to real operations.
Liveness, readiness, and startup checks are implemented separately.
Health checks test meaningful conditions.
Down responses include enough context to troubleshoot.
Alerts are tied to clear actions.

Conclusion

Metrics and health checks turn a Java service from a black box into an observable production component.

Use metrics to measure behavior over time. Use health checks to let the platform decide whether an instance should receive traffic, keep running, or finish startup. In Quarkus, MicroProfile-style annotations and endpoints make the mechanics simple. The architecture work is choosing the right signals and connecting them to operational decisions.