3 minute read

Background

A significant part of system stability is supported by monitoring. Large companies usually have well-established monitoring and operations teams to build the monitoring infrastructure. From a layered perspective, monitoring generally includes the following aspects:

Monitoring Dimension Middleware Selection Reason for Selection
Metric Monitoring Prometheus + Grafana Supports multiple Exporters, rich ecosystem, easy to configure alerts and visualizations
Log Monitoring Loki + Promtail/Fluent Bit Lightweight log aggregation solution, seamlessly integrates with Grafana
Distributed Tracing OpenTelemetry + Jaeger Open standard for distributed tracing, supports multiple languages
Database Monitoring Exporter (e.g., MySQL Exporter) Prometheus maintained by official or community, supports mainstream databases
Network Monitoring Blackbox Exporter Supports multi-protocol health checks like HTTP, TCP
Alerting and Notification Alertmanager Supports multi-channel notifications (email, Slack, Webhook, SMS, etc.)

Best Practices for Selection

Small and medium-sized companies can quickly build a monitoring system that suits their business characteristics. Prometheus has already become the standard for real-time monitoring. We can quickly set up our own monitoring system based on Prometheus:

Monitoring Dimension Middleware Selection Reason for Selection
Metric Monitoring Prometheus + Grafana Supports multiple Exporters, rich ecosystem, easy to configure alerts and visualizations
Log Monitoring Loki + Promtail/Fluent Bit Lightweight log aggregation solution, seamlessly integrates with Grafana
Distributed Tracing OpenTelemetry + Jaeger Open standard for distributed tracing, supports multiple languages
Database Monitoring Exporter (e.g., MySQL Exporter, Redis Exporter) Prometheus maintained by official or community, supports mainstream databases
Network Monitoring Blackbox Exporter Supports multi-protocol health checks like HTTP, TCP
Alerting and Notification Alertmanager Supports multi-channel notifications (email, Slack, Webhook, SMS, etc.)

System Architecture Design

graph TD; A[Prometheus] --> B[Exporters] A --> C[Blackbox Exporter] A --> D[Alertmanager] B --> E[Grafana] C --> E D --> E F[Loki] --> G[Promtail/Fluent Bit] G --> E H[OpenTelemetry] --> I[Jaeger] I --> E

Defining Refined Monitoring Metrics

JVM Monitoring

JVM monitoring is used to track important JVM metrics, including GC (Garbage Collection) instant metrics, heap memory metrics, non-heap memory metrics, metaspace metrics, direct buffer metrics, JVM thread count, etc. This section introduces JVM monitoring and how to view JVM monitoring metrics.

JVM monitoring can track the following metrics:

  • GC (Garbage Collection) instant and cumulative details
    • FullGC count
    • YoungGC count
    • FullGC duration
    • YoungGC duration
  • Heap Memory Details
    • Total heap memory
    • Old generation heap memory size
    • Young generation Survivor area size
    • Young generation Eden area size
  • Metaspace

    Metaspace size

  • Non-Heap Memory
    • Maximum non-heap memory size
    • Used non-heap memory size
  • Direct Buffer
    • Total DirectBuffer size (bytes)
    • Used DirectBuffer size (bytes)
  • JVM Thread Count
    • Total number of threads
    • Number of deadlocked threads
    • Number of newly created threads
    • Number of blocked threads
    • Number of runnable threads
    • Number of terminated threads
    • Number of threads in timed wait
    • Number of threads in waiting state
mindmap root((Java Process Memory Usage)) JVM Memory Heap Memory Young Generation Old Generation Non-Heap Memory Metaspace Compressed Class Space Virtual Machine Thread Stack Native Thread Stack Code Cache Direct Buffers Non-JVM Memory Native Runtime Libraries JNI Native Code

Host Monitoring

Host monitoring tracks various metrics such as CPU, memory, disk, load, network traffic, and network packet metrics. This section introduces host monitoring and how to view host monitoring metrics.

Host monitoring can track the following metrics:

  • CPU
    • Total CPU usage
    • System CPU usage
    • User CPU usage
    • CPU usage waiting for I/O completion
  • Physical Memory
    • Total system memory
    • Free system memory
    • Used system memory
    • Memory in PageCache
    • Memory in BufferCache
  • Disk
    • Total system disk size
    • Free system disk size
    • Used system disk size
  • Load

    System load average

  • Network Traffic
    • Network received bytes
    • Network sent bytes
  • Network Packets
    • Number of received packets per minute
    • Number of sent packets per minute
    • Number of network errors per minute
    • Number of dropped packets per minute

SQL Call Analysis

View SQL call analysis to understand SQL call patterns in applications.

Error Code Monitoring

For core business systems, such as payment systems, error code monitoring is essential.

Here’s how to install Prometheus step by step in English. If you use the docker you can use this to setup, this is the easy way. springboot-promethenus-grafana.

Install Article list