Setting Up Monitoring and Alerting for Your Startup (Before You Need It)

The monitoring paradox

Every startup that reaches out to us after a major incident says the same thing: "We knew we should have set up monitoring, but we never got around to it." The problem is that monitoring feels low-priority when everything is working. And by the time you need it, you're debugging blind in the middle of a crisis.

Here's the thing: a basic monitoring and alerting stack takes a single afternoon to set up. You don't need Datadog's enterprise plan or a dedicated SRE team. Open-source tools like Prometheus and Grafana are production-ready, free, and battle-tested at scale far beyond what most startups need.

Let's set it up.

The monitoring stack

We're building with four components:

Prometheus — collects and stores time-series metrics from your application and infrastructure
Grafana — visualizes metrics and lets you build dashboards
prom-client — instruments your Node.js application to expose metrics
pino — structured logging so you can actually search and analyze your logs

Docker Compose — the monitoring infrastructure

Start by standing up Prometheus and Grafana alongside your application:

yaml

# docker-compose.monitoring.yml
version: "3.8"

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./monitoring/prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"
      - "--web.enable-lifecycle"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.4.0
    container_name: grafana
    ports:
      - "3001:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-changeme}
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning
    depends_on:
      - prometheus
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./monitoring/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

yaml

# monitoring/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert-rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "app"
    metrics_path: "/metrics"
    scrape_interval: 10s
    static_configs:
      - targets: ["host.docker.internal:3000"]
        labels:
          service: "api"
          environment: "production"

yaml

# monitoring/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ["alertname", "service"]
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: "slack-notifications"

  routes:
    - match:
        severity: critical
      receiver: "slack-critical"
      repeat_interval: 5m

receivers:
  - name: "slack-notifications"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
        channel: "#alerts"
        title: "{{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
        send_resolved: true

  - name: "slack-critical"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
        channel: "#alerts-critical"
        title: "CRITICAL: {{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
        send_resolved: true

Start it up:

bash

# Create the directory structure
mkdir -p monitoring/{prometheus,grafana/provisioning/datasources,alertmanager}

# Start the monitoring stack
docker compose -f docker-compose.monitoring.yml up -d

# Verify everything is running
docker compose -f docker-compose.monitoring.yml ps

Grafana will be available at http://localhost:3001 and Prometheus at http://localhost:9090. Change the default Grafana password immediately.

Application metrics with prom-client

Now instrument your Node.js application to expose the metrics Prometheus will scrape:

bash

npm install prom-client pino pino-http

typescript

// src/metrics.ts
import client from "prom-client";

// Create a custom registry
const register = new client.Registry();

// Add default metrics (CPU, memory, event loop, etc.)
client.collectDefaultMetrics({
  register,
  prefix: "app_",
});

// HTTP request duration histogram
export const httpRequestDuration = new client.Histogram({
  name: "http_request_duration_seconds",
  help: "Duration of HTTP requests in seconds",
  labelNames: ["method", "route", "status_code"] as const,
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [register],
});

// HTTP requests total counter
export const httpRequestsTotal = new client.Counter({
  name: "http_requests_total",
  help: "Total number of HTTP requests",
  labelNames: ["method", "route", "status_code"] as const,
  registers: [register],
});

// Active connections gauge
export const activeConnections = new client.Gauge({
  name: "active_connections",
  help: "Number of active connections",
  registers: [register],
});

// Business metrics
export const userRegistrations = new client.Counter({
  name: "user_registrations_total",
  help: "Total number of user registrations",
  labelNames: ["source"] as const,
  registers: [register],
});

export const paymentProcessed = new client.Counter({
  name: "payments_processed_total",
  help: "Total number of payments processed",
  labelNames: ["status", "provider"] as const,
  registers: [register],
});

export const dbQueryDuration = new client.Histogram({
  name: "db_query_duration_seconds",
  help: "Duration of database queries in seconds",
  labelNames: ["operation", "table"] as const,
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
  registers: [register],
});

export { register };

typescript

// src/middleware/metrics.ts
import type { Request, Response, NextFunction } from "express";
import {
  httpRequestDuration,
  httpRequestsTotal,
  activeConnections,
} from "../metrics";

export function metricsMiddleware(
  req: Request,
  res: Response,
  next: NextFunction
): void {
  // Skip metrics endpoint itself to avoid recursion
  if (req.path === "/metrics") {
    next();
    return;
  }

  activeConnections.inc();
  const startTime = process.hrtime.bigint();

  res.on("finish", () => {
    activeConnections.dec();

    const duration = Number(process.hrtime.bigint() - startTime) / 1e9;

    // Normalize route to avoid high-cardinality labels
    const route = normalizeRoute(req.route?.path || req.path);
    const labels = {
      method: req.method,
      route,
      status_code: res.statusCode.toString(),
    };

    httpRequestDuration.observe(labels, duration);
    httpRequestsTotal.inc(labels);
  });

  next();
}

function normalizeRoute(path: string): string {
  // Replace dynamic segments to keep cardinality manageable
  return path
    .replace(/\/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi, "/:id")
    .replace(/\/\d+/g, "/:id")
    .replace(/\/[A-Za-z0-9_-]{20,}/g, "/:id");
}

typescript

// src/routes/metrics.ts — expose the /metrics endpoint
import { Router } from "express";
import { register } from "../metrics";

const router = Router();

router.get("/metrics", async (_req, res) => {
  try {
    res.set("Content-Type", register.contentType);
    res.end(await register.metrics());
  } catch (error) {
    res.status(500).end(String(error));
  }
});

export default router;

The route normalization in the metrics middleware is critical. Without it, endpoints like /users/abc123 and /users/def456 become separate metric labels, which explodes Prometheus's cardinality and memory usage. Normalizing dynamic segments to /:id keeps your metrics clean and your Prometheus server stable.

The business metrics (registrations, payments) are just as important as the technical ones. When your payment success rate drops, you want to know before your revenue dashboard shows a dip.

Alert rules — know when things break

Define the conditions that should wake someone up:

yaml

# monitoring/prometheus/alert-rules.yml
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "More than 5% of requests are returning 5xx errors for the last 2 minutes. Current rate: {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High request latency"
          description: "95th percentile latency is above 2 seconds for the last 5 minutes. Current p95: {{ $value }}s"

      - alert: HighMemoryUsage
        expr: |
          (app_process_resident_memory_bytes / 1024 / 1024) > 512
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Application memory usage is above 512MB. Current: {{ $value }}MB"

      - alert: DatabaseSlowQueries
        expr: |
          histogram_quantile(0.95,
            sum(rate(db_query_duration_seconds_bucket[5m])) by (le, operation)
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow database queries detected"
          description: "95th percentile database query duration is above 1 second. Operation: {{ $labels.operation }}"

      - alert: ServiceDown
        expr: up{job="app"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Application is down"
          description: "The application has been unreachable for more than 1 minute."

      - alert: HighEventLoopLag
        expr: app_nodejs_eventloop_lag_p99_seconds > 0.5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High event loop lag"
          description: "Node.js event loop p99 lag is above 500ms, indicating potential CPU saturation."

These rules follow the principle of alerting on symptoms, not causes. You alert on "error rate is high" (symptom), not "CPU is at 80%" (cause). High CPU might be fine if your error rate and latency are normal. The for clause prevents one-off spikes from triggering pages — a 2-minute sustained error rate is a real problem; a 5-second spike usually isn't.

Health check endpoint

Every service needs a health check that your load balancer, orchestrator, and monitoring system can hit:

typescript

// src/routes/health.ts
import { Router } from "express";
import { Pool } from "pg";

const router = Router();

interface HealthStatus {
  status: "healthy" | "degraded" | "unhealthy";
  timestamp: string;
  uptime: number;
  checks: {
    database: { status: string; latencyMs?: number; error?: string };
    memory: { status: string; usedMb: number; totalMb: number };
  };
}

const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
});

router.get("/health", async (_req, res) => {
  const health: HealthStatus = {
    status: "healthy",
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    checks: {
      database: { status: "ok" },
      memory: {
        status: "ok",
        usedMb: Math.round(process.memoryUsage().rss / 1024 / 1024),
        totalMb: Math.round(
          require("os").totalmem() / 1024 / 1024
        ),
      },
    },
  };

  // Check database connectivity
  try {
    const start = Date.now();
    await pool.query("SELECT 1");
    health.checks.database.latencyMs = Date.now() - start;
  } catch (error) {
    health.status = "unhealthy";
    health.checks.database = {
      status: "error",
      error: error instanceof Error ? error.message : "Unknown error",
    };
  }

  // Check memory pressure
  const memoryUsage = process.memoryUsage();
  const heapUsedPercent = memoryUsage.heapUsed / memoryUsage.heapTotal;
  if (heapUsedPercent > 0.9) {
    health.status = health.status === "unhealthy" ? "unhealthy" : "degraded";
    health.checks.memory.status = "warning";
  }

  const statusCode = health.status === "unhealthy" ? 503 : 200;
  res.status(statusCode).json(health);
});

// Lightweight liveness probe for orchestrators
router.get("/health/live", (_req, res) => {
  res.status(200).json({ status: "ok" });
});

// Readiness probe — is the service ready to accept traffic?
router.get("/health/ready", async (_req, res) => {
  try {
    await pool.query("SELECT 1");
    res.status(200).json({ status: "ready" });
  } catch {
    res.status(503).json({ status: "not ready" });
  }
});

export default router;

Three endpoints serve different purposes. The main /health endpoint returns detailed status for dashboards and debugging. The /health/live endpoint is a simple liveness check for Kubernetes or ECS — it just confirms the process is running. The /health/ready endpoint confirms the service can actually handle requests (database is reachable, etc.). Your load balancer should use the readiness probe to decide whether to route traffic to an instance.

Structured logging with pino

Structured logs (JSON) are searchable and analyzable. Unstructured logs (console.log) are not. The difference matters the moment you need to debug a production issue:

typescript

// src/logger.ts
import pino from "pino";
import pinoHttp from "pino-http";

export const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  transport:
    process.env.NODE_ENV === "development"
      ? { target: "pino-pretty", options: { colorize: true } }
      : undefined,
  base: {
    service: "api",
    environment: process.env.NODE_ENV || "development",
  },
  serializers: {
    err: pino.stdSerializers.err,
    req: pino.stdSerializers.req,
    res: pino.stdSerializers.res,
  },
  redact: {
    paths: [
      "req.headers.authorization",
      "req.headers.cookie",
      "body.password",
      "body.token",
      "body.creditCard",
    ],
    censor: "[REDACTED]",
  },
});

export const httpLogger = pinoHttp({
  logger,
  customLogLevel: (_req, res, error) => {
    if (res.statusCode >= 500 || error) return "error";
    if (res.statusCode >= 400) return "warn";
    return "info";
  },
  customSuccessMessage: (req, res) => {
    return `${req.method} ${req.url} ${res.statusCode}`;
  },
  customErrorMessage: (req, _res, error) => {
    return `${req.method} ${req.url} failed: ${error.message}`;
  },
});

// Usage examples:
// logger.info({ userId: "123", action: "login" }, "User logged in");
// logger.error({ err, orderId: "456" }, "Payment processing failed");
// logger.warn({ queueDepth: 1500 }, "Job queue depth exceeding threshold");

typescript

// src/app.ts — add to your Express app
import express from "express";
import { httpLogger, logger } from "./logger";

const app = express();
app.use(httpLogger);

// Example: structured error logging in a route handler
app.post("/api/orders", async (req, res) => {
  try {
    const order = await createOrder(req.body);
    logger.info(
      { orderId: order.id, amount: order.amount },
      "Order created successfully"
    );
    res.json(order);
  } catch (error) {
    logger.error(
      { err: error, body: req.body },
      "Failed to create order"
    );
    res.status(500).json({ error: "Failed to create order" });
  }
});

The redact configuration is essential — it automatically censors sensitive fields so passwords and tokens never appear in your logs. In development, pino-pretty gives you readable colored output. In production, raw JSON streams into your log aggregation system (CloudWatch, Datadog, ELK) where it's fully searchable.

The key discipline is always logging with context: include the userId, orderId, or whatever identifiers let you trace a request through your system. When you're debugging a production issue at midnight, "Payment processing failed" is useless. "Payment processing failed for orderId=456, userId=123, provider=stripe" tells you exactly where to look.

The bottom line

Monitoring and alerting are not a luxury for companies with SRE teams. They're a baseline requirement for any service running in production. The stack we've walked through — Prometheus, Grafana, structured logging, and alert rules — takes an afternoon to set up and costs nothing to run.

Set it up before your first incident, not after. The best time to install a smoke detector is before the fire.

Setting Up Monitoring and Alerting for Your Startup (Before You Need It)

The monitoring paradox

The monitoring stack

Docker Compose — the monitoring infrastructure

Application metrics with prom-client

Alert rules — know when things break

Health check endpoint

Structured logging with pino

The bottom line

Need help implementing this?

Get engineering insights delivered

More articles

Building for Scale: A Startup CTO's Technology Playbook

7 Infrastructure Mistakes Every Startup Makes (And How to Fix Them)