The monitoring paradox
Every startup that reaches out to us after a major incident says the same thing: "We knew we should have set up monitoring, but we never got around to it." The problem is that monitoring feels low-priority when everything is working. And by the time you need it, you're debugging blind in the middle of a crisis.
Here's the thing: a basic monitoring and alerting stack takes a single afternoon to set up. You don't need Datadog's enterprise plan or a dedicated SRE team. Open-source tools like Prometheus and Grafana are production-ready, free, and battle-tested at scale far beyond what most startups need.
Let's set it up.
The monitoring stack
We're building with four components:
- Prometheus — collects and stores time-series metrics from your application and infrastructure
- Grafana — visualizes metrics and lets you build dashboards
- prom-client — instruments your Node.js application to expose metrics
- pino — structured logging so you can actually search and analyze your logs
Docker Compose — the monitoring infrastructure
Start by standing up Prometheus and Grafana alongside your application:
# docker-compose.monitoring.yml
version: "3.8"
services:
prometheus:
image: prom/prometheus:v2.51.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./monitoring/prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.retention.time=30d"
- "--web.enable-lifecycle"
restart: unless-stopped
grafana:
image: grafana/grafana:10.4.0
container_name: grafana
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-changeme}
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning
depends_on:
- prometheus
restart: unless-stopped
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./monitoring/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
# monitoring/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert-rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "app"
metrics_path: "/metrics"
scrape_interval: 10s
static_configs:
- targets: ["host.docker.internal:3000"]
labels:
service: "api"
environment: "production"
# monitoring/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ["alertname", "service"]
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: "slack-notifications"
routes:
- match:
severity: critical
receiver: "slack-critical"
repeat_interval: 5m
receivers:
- name: "slack-notifications"
slack_configs:
- api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
channel: "#alerts"
title: "{{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
send_resolved: true
- name: "slack-critical"
slack_configs:
- api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
channel: "#alerts-critical"
title: "CRITICAL: {{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
send_resolved: true
Start it up:
# Create the directory structure
mkdir -p monitoring/{prometheus,grafana/provisioning/datasources,alertmanager}
# Start the monitoring stack
docker compose -f docker-compose.monitoring.yml up -d
# Verify everything is running
docker compose -f docker-compose.monitoring.yml ps
Grafana will be available at http://localhost:3001 and Prometheus at http://localhost:9090. Change the default Grafana password immediately.
Application metrics with prom-client
Now instrument your Node.js application to expose the metrics Prometheus will scrape:
npm install prom-client pino pino-http
// src/metrics.ts
import client from "prom-client";
// Create a custom registry
const register = new client.Registry();
// Add default metrics (CPU, memory, event loop, etc.)
client.collectDefaultMetrics({
register,
prefix: "app_",
});
// HTTP request duration histogram
export const httpRequestDuration = new client.Histogram({
name: "http_request_duration_seconds",
help: "Duration of HTTP requests in seconds",
labelNames: ["method", "route", "status_code"] as const,
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
registers: [register],
});
// HTTP requests total counter
export const httpRequestsTotal = new client.Counter({
name: "http_requests_total",
help: "Total number of HTTP requests",
labelNames: ["method", "route", "status_code"] as const,
registers: [register],
});
// Active connections gauge
export const activeConnections = new client.Gauge({
name: "active_connections",
help: "Number of active connections",
registers: [register],
});
// Business metrics
export const userRegistrations = new client.Counter({
name: "user_registrations_total",
help: "Total number of user registrations",
labelNames: ["source"] as const,
registers: [register],
});
export const paymentProcessed = new client.Counter({
name: "payments_processed_total",
help: "Total number of payments processed",
labelNames: ["status", "provider"] as const,
registers: [register],
});
export const dbQueryDuration = new client.Histogram({
name: "db_query_duration_seconds",
help: "Duration of database queries in seconds",
labelNames: ["operation", "table"] as const,
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
registers: [register],
});
export { register };
// src/middleware/metrics.ts
import type { Request, Response, NextFunction } from "express";
import {
httpRequestDuration,
httpRequestsTotal,
activeConnections,
} from "../metrics";
export function metricsMiddleware(
req: Request,
res: Response,
next: NextFunction
): void {
// Skip metrics endpoint itself to avoid recursion
if (req.path === "/metrics") {
next();
return;
}
activeConnections.inc();
const startTime = process.hrtime.bigint();
res.on("finish", () => {
activeConnections.dec();
const duration = Number(process.hrtime.bigint() - startTime) / 1e9;
// Normalize route to avoid high-cardinality labels
const route = normalizeRoute(req.route?.path || req.path);
const labels = {
method: req.method,
route,
status_code: res.statusCode.toString(),
};
httpRequestDuration.observe(labels, duration);
httpRequestsTotal.inc(labels);
});
next();
}
function normalizeRoute(path: string): string {
// Replace dynamic segments to keep cardinality manageable
return path
.replace(/\/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi, "/:id")
.replace(/\/\d+/g, "/:id")
.replace(/\/[A-Za-z0-9_-]{20,}/g, "/:id");
}
// src/routes/metrics.ts — expose the /metrics endpoint
import { Router } from "express";
import { register } from "../metrics";
const router = Router();
router.get("/metrics", async (_req, res) => {
try {
res.set("Content-Type", register.contentType);
res.end(await register.metrics());
} catch (error) {
res.status(500).end(String(error));
}
});
export default router;
The route normalization in the metrics middleware is critical. Without it, endpoints like /users/abc123 and /users/def456 become separate metric labels, which explodes Prometheus's cardinality and memory usage. Normalizing dynamic segments to /:id keeps your metrics clean and your Prometheus server stable.
The business metrics (registrations, payments) are just as important as the technical ones. When your payment success rate drops, you want to know before your revenue dashboard shows a dip.
Alert rules — know when things break
Define the conditions that should wake someone up:
# monitoring/prometheus/alert-rules.yml
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "More than 5% of requests are returning 5xx errors for the last 2 minutes. Current rate: {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High request latency"
description: "95th percentile latency is above 2 seconds for the last 5 minutes. Current p95: {{ $value }}s"
- alert: HighMemoryUsage
expr: |
(app_process_resident_memory_bytes / 1024 / 1024) > 512
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Application memory usage is above 512MB. Current: {{ $value }}MB"
- alert: DatabaseSlowQueries
expr: |
histogram_quantile(0.95,
sum(rate(db_query_duration_seconds_bucket[5m])) by (le, operation)
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Slow database queries detected"
description: "95th percentile database query duration is above 1 second. Operation: {{ $labels.operation }}"
- alert: ServiceDown
expr: up{job="app"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Application is down"
description: "The application has been unreachable for more than 1 minute."
- alert: HighEventLoopLag
expr: app_nodejs_eventloop_lag_p99_seconds > 0.5
for: 2m
labels:
severity: warning
annotations:
summary: "High event loop lag"
description: "Node.js event loop p99 lag is above 500ms, indicating potential CPU saturation."
These rules follow the principle of alerting on symptoms, not causes. You alert on "error rate is high" (symptom), not "CPU is at 80%" (cause). High CPU might be fine if your error rate and latency are normal. The for clause prevents one-off spikes from triggering pages — a 2-minute sustained error rate is a real problem; a 5-second spike usually isn't.
Health check endpoint
Every service needs a health check that your load balancer, orchestrator, and monitoring system can hit:
// src/routes/health.ts
import { Router } from "express";
import { Pool } from "pg";
const router = Router();
interface HealthStatus {
status: "healthy" | "degraded" | "unhealthy";
timestamp: string;
uptime: number;
checks: {
database: { status: string; latencyMs?: number; error?: string };
memory: { status: string; usedMb: number; totalMb: number };
};
}
const pool = new Pool({
connectionString: process.env.DATABASE_URL,
});
router.get("/health", async (_req, res) => {
const health: HealthStatus = {
status: "healthy",
timestamp: new Date().toISOString(),
uptime: process.uptime(),
checks: {
database: { status: "ok" },
memory: {
status: "ok",
usedMb: Math.round(process.memoryUsage().rss / 1024 / 1024),
totalMb: Math.round(
require("os").totalmem() / 1024 / 1024
),
},
},
};
// Check database connectivity
try {
const start = Date.now();
await pool.query("SELECT 1");
health.checks.database.latencyMs = Date.now() - start;
} catch (error) {
health.status = "unhealthy";
health.checks.database = {
status: "error",
error: error instanceof Error ? error.message : "Unknown error",
};
}
// Check memory pressure
const memoryUsage = process.memoryUsage();
const heapUsedPercent = memoryUsage.heapUsed / memoryUsage.heapTotal;
if (heapUsedPercent > 0.9) {
health.status = health.status === "unhealthy" ? "unhealthy" : "degraded";
health.checks.memory.status = "warning";
}
const statusCode = health.status === "unhealthy" ? 503 : 200;
res.status(statusCode).json(health);
});
// Lightweight liveness probe for orchestrators
router.get("/health/live", (_req, res) => {
res.status(200).json({ status: "ok" });
});
// Readiness probe — is the service ready to accept traffic?
router.get("/health/ready", async (_req, res) => {
try {
await pool.query("SELECT 1");
res.status(200).json({ status: "ready" });
} catch {
res.status(503).json({ status: "not ready" });
}
});
export default router;
Three endpoints serve different purposes. The main /health endpoint returns detailed status for dashboards and debugging. The /health/live endpoint is a simple liveness check for Kubernetes or ECS — it just confirms the process is running. The /health/ready endpoint confirms the service can actually handle requests (database is reachable, etc.). Your load balancer should use the readiness probe to decide whether to route traffic to an instance.
Structured logging with pino
Structured logs (JSON) are searchable and analyzable. Unstructured logs (console.log) are not. The difference matters the moment you need to debug a production issue:
// src/logger.ts
import pino from "pino";
import pinoHttp from "pino-http";
export const logger = pino({
level: process.env.LOG_LEVEL || "info",
transport:
process.env.NODE_ENV === "development"
? { target: "pino-pretty", options: { colorize: true } }
: undefined,
base: {
service: "api",
environment: process.env.NODE_ENV || "development",
},
serializers: {
err: pino.stdSerializers.err,
req: pino.stdSerializers.req,
res: pino.stdSerializers.res,
},
redact: {
paths: [
"req.headers.authorization",
"req.headers.cookie",
"body.password",
"body.token",
"body.creditCard",
],
censor: "[REDACTED]",
},
});
export const httpLogger = pinoHttp({
logger,
customLogLevel: (_req, res, error) => {
if (res.statusCode >= 500 || error) return "error";
if (res.statusCode >= 400) return "warn";
return "info";
},
customSuccessMessage: (req, res) => {
return `${req.method} ${req.url} ${res.statusCode}`;
},
customErrorMessage: (req, _res, error) => {
return `${req.method} ${req.url} failed: ${error.message}`;
},
});
// Usage examples:
// logger.info({ userId: "123", action: "login" }, "User logged in");
// logger.error({ err, orderId: "456" }, "Payment processing failed");
// logger.warn({ queueDepth: 1500 }, "Job queue depth exceeding threshold");
// src/app.ts — add to your Express app
import express from "express";
import { httpLogger, logger } from "./logger";
const app = express();
app.use(httpLogger);
// Example: structured error logging in a route handler
app.post("/api/orders", async (req, res) => {
try {
const order = await createOrder(req.body);
logger.info(
{ orderId: order.id, amount: order.amount },
"Order created successfully"
);
res.json(order);
} catch (error) {
logger.error(
{ err: error, body: req.body },
"Failed to create order"
);
res.status(500).json({ error: "Failed to create order" });
}
});
The redact configuration is essential — it automatically censors sensitive fields so passwords and tokens never appear in your logs. In development, pino-pretty gives you readable colored output. In production, raw JSON streams into your log aggregation system (CloudWatch, Datadog, ELK) where it's fully searchable.
The key discipline is always logging with context: include the userId, orderId, or whatever identifiers let you trace a request through your system. When you're debugging a production issue at midnight, "Payment processing failed" is useless. "Payment processing failed for orderId=456, userId=123, provider=stripe" tells you exactly where to look.
The bottom line
Monitoring and alerting are not a luxury for companies with SRE teams. They're a baseline requirement for any service running in production. The stack we've walked through — Prometheus, Grafana, structured logging, and alert rules — takes an afternoon to set up and costs nothing to run.
Set it up before your first incident, not after. The best time to install a smoke detector is before the fire.