Application Monitoring Setup Guide: From Zero to Production Visibility

Most developers add monitoring after their first production incident. A user files a support ticket at 2 AM. You spend three hours grepping through logs on a server that may or may not still be running. You fix the issue, then promise yourself you'll add proper monitoring tomorrow.

Tomorrow rarely comes until the next incident.

The best time to set up application monitoring is before you have paying users. The second best time is right now. This monitoring guide walks through a practical observability stack for web applications — from infrastructure metrics to application-level errors to uptime checks. You can implement it incrementally, starting with what your platform already provides for free.

The Three Pillars of Observability

Observability is the ability to understand the internal state of your application from its external outputs. Without it, production systems are black boxes. You see symptoms but not causes. You react instead of anticipate.

The observability setup framework consists of three pillars:

Metrics answer the question: what is happening? Metrics are numerical measurements over time — CPU usage at 87%, request rate at 340 req/s, error rate at 2.3%. They tell you the current state of your system and let you spot trends before they become problems.

Logs answer the question: why did it happen? Logs are discrete events with context — a user authentication failure at 14:32:07, a database timeout on query ID 4821, a payment webhook received for order 9923. They provide the narrative thread you follow to diagnose specific failures.

Traces answer the question: how did it happen? Traces follow a single request as it flows through multiple services — from the API gateway through the authentication service to the database and back. They reveal latency at each step and make distributed systems debuggable.

You don't need all three on day one. A well-run production application monitoring setup often starts with metrics and logs, then adds traces when distributed systems make single-service debugging insufficient.

The sections below follow a five-level progression. Start at Level 1. Each level adds meaningful production visibility without requiring the previous level to be perfect.

Level 1: Platform Metrics (Built-In)

Before you write a single line of instrumentation code, your deployment platform provides a substantial foundation for production monitoring. On Out Plane, this visibility is automatic — no configuration required.

Infrastructure Metrics

The Out Plane metrics dashboard captures the following for every application:

CPU usage per pod: Percentage of allocated CPU consumed by each running instance. Spikes here indicate compute-bound workloads or unexpected processing loops.
Memory usage (working set): Active memory consumption in bytes. A steadily rising line that never drops is a memory leak. A flat line near the ceiling means you need more instances or a larger instance size.
Network ingress and egress: Bytes per second flowing in and out of your application. Unusual spikes in ingress can indicate traffic attacks or webhook floods. Egress spikes may reveal unintended data exports or logging loops.
Disk read/write operations: I/O activity across storage. High disk reads with slow response times often point to missing database indexes or cache misses.

The dashboard retains 7 days of data and supports multi-app comparison — useful when you're debugging whether an issue is isolated to one service or affecting the entire platform.

HTTP Request Logs

Every incoming HTTP request is logged automatically. This means you have a complete audit trail from the moment you deploy, covering:

Status code distribution: Filter by 2xx (success), 3xx (redirects), 4xx (client errors), and 5xx (server errors). A rising 5xx rate is the most direct signal that something is broken.
Response time per request: Per-request latency lets you identify slow endpoints without APM tools. Sort by response time to find your worst offenders.
Method and path filtering: Narrow down to specific routes. When a user reports a problem submitting a form, filter to POST requests on that path and see exactly what happened.

HTTP logs give you the ability to answer "did this request succeed?" without touching application code. That's valuable during incidents when you need to move fast.

Runtime Logs

Application stdout and stderr are captured and streamed in real time. The interface provides:

Severity filtering: Filter by Error, Warning, Info, Debug, and Trace levels. During incidents, filter to Error and Warning to cut noise.
Source filtering: Separate logs by application component or deployment.
Time range selection: View Live (streaming), last 1 hour, 6 hours, 24 hours, or 7 days. Live mode with pause/resume is useful during active debugging sessions.

Runtime logs are your first line of defense. Even before you implement structured logging, every console.error(), print(), or log.Fatal() call lands here immediately.

Level 2: Application Logging (Your Code)

Platform logs capture what happened. Application logs capture why it happened. The difference is context. A platform log says a 500 error occurred at 14:32:07. A well-written application log says which user triggered it, which function failed, and what the input values were.

Structured Logging

Plain text logs are difficult to query at scale. Structured logging formats log output as JSON, making every field filterable and parseable by log management tools.

A minimal structured log entry should include:

timestamp: ISO 8601 format for unambiguous time comparison
level: severity classification (error, warn, info, debug)
message: human-readable description of the event
context fields: application-specific data relevant to the event

Node.js with Winston:

javascript

const winston = require('winston');

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  transports: [new winston.transports.Console()],
});

// Log a business event with context
logger.info('User created', {
  userId: user.id,
  email: user.email,
  plan: user.plan,
});

// Log an error with stack trace
logger.error('Payment processing failed', {
  orderId: order.id,
  amount: order.total,
  error: err.message,
  stack: err.stack,
});

Python with structlog:

python

import structlog

logger = structlog.get_logger()

# Bind context that applies to all subsequent log calls
log = logger.bind(request_id=request_id, user_id=user_id)

log.info("user_created", email=user.email, plan=user.plan)
log.error("payment_failed", order_id=order.id, error=str(e))

Go with zerolog:

import "github.com/rs/zerolog/log"

log.Info().
    Str("user_id", user.ID).
    Str("email", user.Email).
    Msg("user created")

log.Error().
    Err(err).
    Str("order_id", order.ID).
    Float64("amount", order.Total).
    Msg("payment processing failed")

All three produce machine-readable JSON output. That output lands in your platform's runtime logs and can be forwarded to any log management service.

What to Log

Focus on events that matter for debugging and auditing:

Request and response metadata: Method, path, status code, response time, user ID. Not the full request body.
Authentication events: Login attempts (success and failure), logout, token refresh, password changes.
Business-critical actions: Payments processed, subscriptions created or cancelled, user signups, email sends.
Error details: Exception message, stack trace, relevant request context. Log at ERROR level so they're easy to find.
Performance-sensitive operations: Database queries over 100ms, external API calls, file operations. Log duration with context.

What NOT to Log

Some data should never appear in logs, regardless of how useful it seems during debugging:

Personal data: Names, email addresses, IP addresses (beyond truncated versions), and any data covered by GDPR or similar regulations. Log user IDs instead — you can look up the associated data separately if needed.
Credentials: Passwords, API keys, OAuth tokens, session identifiers. A compromised log file should not be a compromised system.
Full request bodies: Most request bodies contain sensitive data. Log field names and lengths instead of values.
High-volume debug logs in production: Debug logging at high request rates produces GB of logs per day. Use the LOG_LEVEL environment variable to control verbosity per environment.

Level 3: Error Tracking

Platform logs capture errors when they happen. Error tracking services aggregate, deduplicate, and prioritize those errors so you can focus on what matters.

The critical difference from logs: error tracking tells you "this error has occurred 847 times in the last hour, affecting 34 distinct users" instead of requiring you to count manually.

Sentry Integration

Sentry is the most widely used error tracking service for web applications. It captures unhandled exceptions with full stack traces, groups similar errors into issues, and tracks which errors are new versus recurring.

Node.js:

javascript

const Sentry = require('@sentry/node');

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  tracesSampleRate: 0.1, // 10% of requests for performance monitoring
});

// Sentry automatically captures unhandled exceptions.
// For manual capture:
try {
  await processPayment(order);
} catch (err) {
  Sentry.captureException(err, {
    extra: { orderId: order.id, amount: order.total },
  });
  throw err;
}

Python:

python

import sentry_sdk

sentry_sdk.init(
    dsn=os.environ["SENTRY_DSN"],
    environment=os.environ.get("ENVIRONMENT", "production"),
    traces_sample_rate=0.1,
)

Set SENTRY_DSN as an environment variable in the Out Plane console. Sentry starts capturing exceptions immediately — no code changes beyond initialization.

Alternatives to Sentry

Bugsnag: Similar feature set, strong mobile SDK support
Rollbar: Good for teams that want real-time alerting on new error types
Honeybadger: Simpler interface, lower cost at moderate error volumes
GlitchTip: Self-hosted, Sentry-compatible API. Deploy it on Out Plane using the GlitchTip template for complete data ownership.

Error tracking is the single highest-return investment in this entire monitoring guide. It converts silent failures into actionable notifications. You find out about errors before users report them.

Level 4: Uptime Monitoring

Uptime monitoring answers one question: is your application reachable right now? Everything else in this guide assumes the application is running. Uptime monitoring catches when it isn't.

External uptime checks run from locations outside your infrastructure every 1 to 5 minutes. When your application fails to respond, you receive an alert within minutes — not after the first user complaint.

Uptime Monitoring Services

UptimeRobot: Free tier covers 50 monitors at 5-minute intervals. Sufficient for most production applications.
Better Uptime: More detailed incident management, status pages, and on-call scheduling. Appropriate for teams with SLAs.
Pingdom: Enterprise-focused with global check locations and detailed performance analytics.

Health Check Endpoint

Uptime monitors need a URL to check. A basic HTTP 200 response confirms your server is running. A health check endpoint that verifies critical dependencies confirms your application is actually functional.

javascript

// Express.js
app.get('/health', async (req, res) => {
  try {
    // Verify database connectivity
    await db.query('SELECT 1');

    res.status(200).json({
      status: 'ok',
      timestamp: new Date().toISOString(),
      version: process.env.APP_VERSION || 'unknown',
    });
  } catch (error) {
    // Return 503 so uptime monitors treat this as a failure
    res.status(503).json({
      status: 'error',
      message: 'Database unreachable',
      timestamp: new Date().toISOString(),
    });
  }
});

python

# FastAPI
from fastapi import HTTPException
from datetime import datetime, timezone

@app.get("/health")
async def health_check():
    try:
        await database.execute("SELECT 1")
        return {
            "status": "ok",
            "timestamp": datetime.now(timezone.utc).isoformat(),
        }
    except Exception as e:
        raise HTTPException(
            status_code=503,
            detail={"status": "error", "message": "Database unreachable"},
        )

Point your uptime monitor at /health. A 503 response triggers an alert. Configure the monitor to check from at least two geographic locations to avoid false positives from regional network issues.

Keep /health lightweight. It should not run expensive queries or trigger side effects. Its purpose is fast confirmation that critical dependencies are reachable.

Level 5: Performance Monitoring (APM)

Application Performance Monitoring captures detailed timing data for every operation your application performs — database queries, external API calls, cache lookups, and rendering. It answers the question: "The endpoint is slow. Which part is slow?"

APM is appropriate when you have performance issues that logs alone can't diagnose. If you're seeing slow response times in your HTTP logs but can't determine whether the bottleneck is the database, an external API, or application code, APM provides that answer.

OpenTelemetry

OpenTelemetry is the vendor-neutral standard for application instrumentation. It provides auto-instrumentation for common frameworks and a consistent API for manual spans. You instrument once and can send data to any compatible backend — Grafana Tempo, Jaeger, Honeycomb, Datadog, or others.

Node.js auto-instrumentation:

javascript

// tracing.js — load before your application code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Auto-instrumentation captures HTTP requests, database queries (PostgreSQL, MySQL, MongoDB, Redis), and outbound HTTP calls without manual span creation. For custom operations, wrap them in manual spans:

javascript

const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');

async function processOrder(orderId) {
  return tracer.startActiveSpan('process-order', async (span) => {
    try {
      span.setAttribute('order.id', orderId);
      const result = await doWork(orderId);
      span.setAttribute('order.status', result.status);
      return result;
    } catch (err) {
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}

When to Add APM

Add APM when you have performance problems you can't diagnose with HTTP logs and application logging. A p95 latency of 4 seconds with no obvious error trail is the signal. Don't add APM preemptively — it adds overhead and operational complexity. Start with it when you need it.

The Monitoring Stack by Stage

Different application stages need different monitoring depth. This table maps each stage to the appropriate observability setup:

Stage	What to Monitor	Tools
MVP / Side Project	Uptime + Platform metrics	UptimeRobot + Out Plane built-in
Early Production	+ Error tracking + Structured logging	+ Sentry + Winston / structlog
Growing Product	+ APM + Custom metrics	+ OpenTelemetry + Grafana
Scale	+ Distributed tracing + Alerting	+ Grafana / Datadog + PagerDuty

The key principle: each level builds on the previous. Don't skip directly to Datadog for a side project. Don't run a product with $10,000 MRR with no error tracking. Match the investment to the risk and scale.

Setting Up Alerts

Monitoring without alerts is incomplete. You can't stare at dashboards all day. Alerts tell you when something needs attention.

Alert on Symptoms, Not Causes

The most effective alerts describe user-visible problems, not internal system states. Examples:

Error rate above 5%: Users are seeing failures. Investigate immediately.
p95 response time above 2 seconds: Most users are experiencing slow responses.
Uptime check failure: Application is unreachable.
Memory usage above 90% for 10 consecutive minutes: Instance is approaching OOM. Scale or investigate.

Avoid alerting on causes like "CPU usage above 70%." High CPU may be completely normal during a traffic spike. You want to know when users are affected, not when infrastructure is busy.

Avoiding Alert Fatigue

An alert that fires daily becomes background noise. Background noise becomes ignored noise. An ignored alert during a real incident is a missed incident.

Keep alert counts low. Five well-defined alerts that always indicate real problems are more valuable than fifty alerts that fire on any anomaly. Review alert history monthly. If an alert fired without requiring action, either raise its threshold or remove it.

The best alert is one that tells you about a problem before your users do. That's the standard worth measuring against.

Alert Delivery

Uptime failures: Immediate SMS or phone call. Every minute of downtime is user impact.
Error rate spikes: Slack or email with 5-minute response expectation during business hours.
Performance degradation: Slack message. Important, but not emergency-level.
Capacity warnings: Email. Proactive signals that need attention but not immediate response.

Monitoring Checklist for Production

Use this checklist before launching any application into production. Items are ordered by impact:

Run through this checklist for every new application before it receives production traffic. The cost of setting up monitoring is 2 to 4 hours. The cost of missing a production incident because you had no visibility is measured in user trust and engineering time.

Summary

Application monitoring is not a single tool or a single decision. It's a stack of complementary capabilities that you build incrementally as your application and team mature.

Start with what your platform already provides. On Out Plane, you have infrastructure metrics, HTTP request logs, and runtime log streaming from the moment you deploy — no configuration required. That baseline gives you the signal to diagnose most common issues.

Layer in uptime monitoring and structured logging before your first users arrive. Add error tracking as soon as the application has value worth protecting. Build toward APM and distributed tracing when your scale makes lower-level tools insufficient.

The objective throughout is the same: reduce the time between when a problem occurs and when you know about it. Every level of this monitoring guide moves that time from hours to minutes to seconds.

For related topics, the zero downtime deployment guide covers how to deploy changes without service interruptions, and the Docker deployment guide walks through containerizing applications before deploying to production. If you're running multiple application instances, the horizontal scaling guide for Node.js covers the infrastructure side of growing past a single instance.

Ready to deploy a monitored production application? Out Plane provides built-in metrics, real-time logs, and per-second billing with no infrastructure to manage. Start at console.outplane.com.

Application Monitoring Setup Guide: From Zero to Production Visibility

The Three Pillars of Observability

Level 1: Platform Metrics (Built-In)

Infrastructure Metrics

HTTP Request Logs

Runtime Logs

Level 2: Application Logging (Your Code)

Structured Logging

What to Log

What NOT to Log

Level 3: Error Tracking

Sentry Integration

Alternatives to Sentry

Level 4: Uptime Monitoring

Uptime Monitoring Services

Health Check Endpoint

Level 5: Performance Monitoring (APM)

OpenTelemetry

When to Add APM

The Monitoring Stack by Stage

Setting Up Alerts

Alert on Symptoms, Not Causes

Avoiding Alert Fatigue

Alert Delivery

Monitoring Checklist for Production

Summary

Start deploying in minutes