Most developers add monitoring after their first production incident. A user files a support ticket at 2 AM. You spend three hours grepping through logs on a server that may or may not still be running. You fix the issue, then promise yourself you'll add proper monitoring tomorrow.
Tomorrow rarely comes until the next incident.
The best time to set up application monitoring is before you have paying users. The second best time is right now. This monitoring guide walks through a practical observability stack for web applications — from infrastructure metrics to application-level errors to uptime checks. You can implement it incrementally, starting with what your platform already provides for free.
The Three Pillars of Observability
Observability is the ability to understand the internal state of your application from its external outputs. Without it, production systems are black boxes. You see symptoms but not causes. You react instead of anticipate.
The observability setup framework consists of three pillars:
Metrics answer the question: what is happening? Metrics are numerical measurements over time — CPU usage at 87%, request rate at 340 req/s, error rate at 2.3%. They tell you the current state of your system and let you spot trends before they become problems.
Logs answer the question: why did it happen? Logs are discrete events with context — a user authentication failure at 14:32:07, a database timeout on query ID 4821, a payment webhook received for order 9923. They provide the narrative thread you follow to diagnose specific failures.
Traces answer the question: how did it happen? Traces follow a single request as it flows through multiple services — from the API gateway through the authentication service to the database and back. They reveal latency at each step and make distributed systems debuggable.
You don't need all three on day one. A well-run production application monitoring setup often starts with metrics and logs, then adds traces when distributed systems make single-service debugging insufficient.
The sections below follow a five-level progression. Start at Level 1. Each level adds meaningful production visibility without requiring the previous level to be perfect.
Level 1: Platform Metrics (Built-In)
Before you write a single line of instrumentation code, your deployment platform provides a substantial foundation for production monitoring. On Out Plane, this visibility is automatic — no configuration required.
Infrastructure Metrics
The Out Plane metrics dashboard captures the following for every application:
- CPU usage per pod: Percentage of allocated CPU consumed by each running instance. Spikes here indicate compute-bound workloads or unexpected processing loops.
- Memory usage (working set): Active memory consumption in bytes. A steadily rising line that never drops is a memory leak. A flat line near the ceiling means you need more instances or a larger instance size.
- Network ingress and egress: Bytes per second flowing in and out of your application. Unusual spikes in ingress can indicate traffic attacks or webhook floods. Egress spikes may reveal unintended data exports or logging loops.
- Disk read/write operations: I/O activity across storage. High disk reads with slow response times often point to missing database indexes or cache misses.
The dashboard retains 7 days of data and supports multi-app comparison — useful when you're debugging whether an issue is isolated to one service or affecting the entire platform.
HTTP Request Logs
Every incoming HTTP request is logged automatically. This means you have a complete audit trail from the moment you deploy, covering:
- Status code distribution: Filter by 2xx (success), 3xx (redirects), 4xx (client errors), and 5xx (server errors). A rising 5xx rate is the most direct signal that something is broken.
- Response time per request: Per-request latency lets you identify slow endpoints without APM tools. Sort by response time to find your worst offenders.
- Method and path filtering: Narrow down to specific routes. When a user reports a problem submitting a form, filter to POST requests on that path and see exactly what happened.
HTTP logs give you the ability to answer "did this request succeed?" without touching application code. That's valuable during incidents when you need to move fast.
Runtime Logs
Application stdout and stderr are captured and streamed in real time. The interface provides:
- Severity filtering: Filter by Error, Warning, Info, Debug, and Trace levels. During incidents, filter to Error and Warning to cut noise.
- Source filtering: Separate logs by application component or deployment.
- Time range selection: View Live (streaming), last 1 hour, 6 hours, 24 hours, or 7 days. Live mode with pause/resume is useful during active debugging sessions.
Runtime logs are your first line of defense. Even before you implement structured logging, every console.error(), print(), or log.Fatal() call lands here immediately.
Level 2: Application Logging (Your Code)
Platform logs capture what happened. Application logs capture why it happened. The difference is context. A platform log says a 500 error occurred at 14:32:07. A well-written application log says which user triggered it, which function failed, and what the input values were.
Structured Logging
Plain text logs are difficult to query at scale. Structured logging formats log output as JSON, making every field filterable and parseable by log management tools.
A minimal structured log entry should include:
- timestamp: ISO 8601 format for unambiguous time comparison
- level: severity classification (error, warn, info, debug)
- message: human-readable description of the event
- context fields: application-specific data relevant to the event
Node.js with Winston:
const winston = require('winston');
const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
transports: [new winston.transports.Console()],
});
// Log a business event with context
logger.info('User created', {
userId: user.id,
email: user.email,
plan: user.plan,
});
// Log an error with stack trace
logger.error('Payment processing failed', {
orderId: order.id,
amount: order.total,
error: err.message,
stack: err.stack,
});Python with structlog:
import structlog
logger = structlog.get_logger()
# Bind context that applies to all subsequent log calls
log = logger.bind(request_id=request_id, user_id=user_id)
log.info("user_created", email=user.email, plan=user.plan)
log.error("payment_failed", order_id=order.id, error=str(e))Go with zerolog:
import "github.com/rs/zerolog/log"
log.Info().
Str("user_id", user.ID).
Str("email", user.Email).
Msg("user created")
log.Error().
Err(err).
Str("order_id", order.ID).
Float64("amount", order.Total).
Msg("payment processing failed")All three produce machine-readable JSON output. That output lands in your platform's runtime logs and can be forwarded to any log management service.
What to Log
Focus on events that matter for debugging and auditing:
- Request and response metadata: Method, path, status code, response time, user ID. Not the full request body.
- Authentication events: Login attempts (success and failure), logout, token refresh, password changes.
- Business-critical actions: Payments processed, subscriptions created or cancelled, user signups, email sends.
- Error details: Exception message, stack trace, relevant request context. Log at ERROR level so they're easy to find.
- Performance-sensitive operations: Database queries over 100ms, external API calls, file operations. Log duration with context.
What NOT to Log
Some data should never appear in logs, regardless of how useful it seems during debugging:
- Personal data: Names, email addresses, IP addresses (beyond truncated versions), and any data covered by GDPR or similar regulations. Log user IDs instead — you can look up the associated data separately if needed.
- Credentials: Passwords, API keys, OAuth tokens, session identifiers. A compromised log file should not be a compromised system.
- Full request bodies: Most request bodies contain sensitive data. Log field names and lengths instead of values.
- High-volume debug logs in production: Debug logging at high request rates produces GB of logs per day. Use the
LOG_LEVELenvironment variable to control verbosity per environment.
Level 3: Error Tracking
Platform logs capture errors when they happen. Error tracking services aggregate, deduplicate, and prioritize those errors so you can focus on what matters.
The critical difference from logs: error tracking tells you "this error has occurred 847 times in the last hour, affecting 34 distinct users" instead of requiring you to count manually.
Sentry Integration
Sentry is the most widely used error tracking service for web applications. It captures unhandled exceptions with full stack traces, groups similar errors into issues, and tracks which errors are new versus recurring.
Node.js:
const Sentry = require('@sentry/node');
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
tracesSampleRate: 0.1, // 10% of requests for performance monitoring
});
// Sentry automatically captures unhandled exceptions.
// For manual capture:
try {
await processPayment(order);
} catch (err) {
Sentry.captureException(err, {
extra: { orderId: order.id, amount: order.total },
});
throw err;
}Python:
import sentry_sdk
sentry_sdk.init(
dsn=os.environ["SENTRY_DSN"],
environment=os.environ.get("ENVIRONMENT", "production"),
traces_sample_rate=0.1,
)Set SENTRY_DSN as an environment variable in the Out Plane console. Sentry starts capturing exceptions immediately — no code changes beyond initialization.
Alternatives to Sentry
- Bugsnag: Similar feature set, strong mobile SDK support
- Rollbar: Good for teams that want real-time alerting on new error types
- Honeybadger: Simpler interface, lower cost at moderate error volumes
- GlitchTip: Self-hosted, Sentry-compatible API. Deploy it on Out Plane using the GlitchTip template for complete data ownership.
Error tracking is the single highest-return investment in this entire monitoring guide. It converts silent failures into actionable notifications. You find out about errors before users report them.
Level 4: Uptime Monitoring
Uptime monitoring answers one question: is your application reachable right now? Everything else in this guide assumes the application is running. Uptime monitoring catches when it isn't.
External uptime checks run from locations outside your infrastructure every 1 to 5 minutes. When your application fails to respond, you receive an alert within minutes — not after the first user complaint.
Uptime Monitoring Services
- UptimeRobot: Free tier covers 50 monitors at 5-minute intervals. Sufficient for most production applications.
- Better Uptime: More detailed incident management, status pages, and on-call scheduling. Appropriate for teams with SLAs.
- Pingdom: Enterprise-focused with global check locations and detailed performance analytics.
Health Check Endpoint
Uptime monitors need a URL to check. A basic HTTP 200 response confirms your server is running. A health check endpoint that verifies critical dependencies confirms your application is actually functional.
// Express.js
app.get('/health', async (req, res) => {
try {
// Verify database connectivity
await db.query('SELECT 1');
res.status(200).json({
status: 'ok',
timestamp: new Date().toISOString(),
version: process.env.APP_VERSION || 'unknown',
});
} catch (error) {
// Return 503 so uptime monitors treat this as a failure
res.status(503).json({
status: 'error',
message: 'Database unreachable',
timestamp: new Date().toISOString(),
});
}
});# FastAPI
from fastapi import HTTPException
from datetime import datetime, timezone
@app.get("/health")
async def health_check():
try:
await database.execute("SELECT 1")
return {
"status": "ok",
"timestamp": datetime.now(timezone.utc).isoformat(),
}
except Exception as e:
raise HTTPException(
status_code=503,
detail={"status": "error", "message": "Database unreachable"},
)Point your uptime monitor at /health. A 503 response triggers an alert. Configure the monitor to check from at least two geographic locations to avoid false positives from regional network issues.
Keep /health lightweight. It should not run expensive queries or trigger side effects. Its purpose is fast confirmation that critical dependencies are reachable.
Level 5: Performance Monitoring (APM)
Application Performance Monitoring captures detailed timing data for every operation your application performs — database queries, external API calls, cache lookups, and rendering. It answers the question: "The endpoint is slow. Which part is slow?"
APM is appropriate when you have performance issues that logs alone can't diagnose. If you're seeing slow response times in your HTTP logs but can't determine whether the bottleneck is the database, an external API, or application code, APM provides that answer.
OpenTelemetry
OpenTelemetry is the vendor-neutral standard for application instrumentation. It provides auto-instrumentation for common frameworks and a consistent API for manual spans. You instrument once and can send data to any compatible backend — Grafana Tempo, Jaeger, Honeycomb, Datadog, or others.
Node.js auto-instrumentation:
// tracing.js — load before your application code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();Auto-instrumentation captures HTTP requests, database queries (PostgreSQL, MySQL, MongoDB, Redis), and outbound HTTP calls without manual span creation. For custom operations, wrap them in manual spans:
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');
async function processOrder(orderId) {
return tracer.startActiveSpan('process-order', async (span) => {
try {
span.setAttribute('order.id', orderId);
const result = await doWork(orderId);
span.setAttribute('order.status', result.status);
return result;
} catch (err) {
span.recordException(err);
throw err;
} finally {
span.end();
}
});
}When to Add APM
Add APM when you have performance problems you can't diagnose with HTTP logs and application logging. A p95 latency of 4 seconds with no obvious error trail is the signal. Don't add APM preemptively — it adds overhead and operational complexity. Start with it when you need it.
The Monitoring Stack by Stage
Different application stages need different monitoring depth. This table maps each stage to the appropriate observability setup:
| Stage | What to Monitor | Tools |
|---|---|---|
| MVP / Side Project | Uptime + Platform metrics | UptimeRobot + Out Plane built-in |
| Early Production | + Error tracking + Structured logging | + Sentry + Winston / structlog |
| Growing Product | + APM + Custom metrics | + OpenTelemetry + Grafana |
| Scale | + Distributed tracing + Alerting | + Grafana / Datadog + PagerDuty |
The key principle: each level builds on the previous. Don't skip directly to Datadog for a side project. Don't run a product with $10,000 MRR with no error tracking. Match the investment to the risk and scale.
Setting Up Alerts
Monitoring without alerts is incomplete. You can't stare at dashboards all day. Alerts tell you when something needs attention.
Alert on Symptoms, Not Causes
The most effective alerts describe user-visible problems, not internal system states. Examples:
- Error rate above 5%: Users are seeing failures. Investigate immediately.
- p95 response time above 2 seconds: Most users are experiencing slow responses.
- Uptime check failure: Application is unreachable.
- Memory usage above 90% for 10 consecutive minutes: Instance is approaching OOM. Scale or investigate.
Avoid alerting on causes like "CPU usage above 70%." High CPU may be completely normal during a traffic spike. You want to know when users are affected, not when infrastructure is busy.
Avoiding Alert Fatigue
An alert that fires daily becomes background noise. Background noise becomes ignored noise. An ignored alert during a real incident is a missed incident.
Keep alert counts low. Five well-defined alerts that always indicate real problems are more valuable than fifty alerts that fire on any anomaly. Review alert history monthly. If an alert fired without requiring action, either raise its threshold or remove it.
The best alert is one that tells you about a problem before your users do. That's the standard worth measuring against.
Alert Delivery
- Uptime failures: Immediate SMS or phone call. Every minute of downtime is user impact.
- Error rate spikes: Slack or email with 5-minute response expectation during business hours.
- Performance degradation: Slack message. Important, but not emergency-level.
- Capacity warnings: Email. Proactive signals that need attention but not immediate response.
Monitoring Checklist for Production
Use this checklist before launching any application into production. Items are ordered by impact:
- Platform metrics enabled (automatic on Out Plane — verify dashboard is accessible)
- Uptime monitoring configured with at least one external service
- Health check endpoint implemented and tested with a 503 response when the database is down
- Error tracking service configured (Sentry DSN set as environment variable)
- Structured logging implemented in application code
- Alert thresholds defined for error rate, response time, and uptime
- Alert delivery configured and tested (send a test notification)
- Log retention policy confirmed (Out Plane retains 7 days of metrics; confirm your log service retention)
- No sensitive data in logs (audit logging calls for PII, passwords, tokens)
-
LOG_LEVELenvironment variable set per environment (debug in staging, info or warn in production)
Run through this checklist for every new application before it receives production traffic. The cost of setting up monitoring is 2 to 4 hours. The cost of missing a production incident because you had no visibility is measured in user trust and engineering time.
Summary
Application monitoring is not a single tool or a single decision. It's a stack of complementary capabilities that you build incrementally as your application and team mature.
Start with what your platform already provides. On Out Plane, you have infrastructure metrics, HTTP request logs, and runtime log streaming from the moment you deploy — no configuration required. That baseline gives you the signal to diagnose most common issues.
Layer in uptime monitoring and structured logging before your first users arrive. Add error tracking as soon as the application has value worth protecting. Build toward APM and distributed tracing when your scale makes lower-level tools insufficient.
The objective throughout is the same: reduce the time between when a problem occurs and when you know about it. Every level of this monitoring guide moves that time from hours to minutes to seconds.
For related topics, the zero downtime deployment guide covers how to deploy changes without service interruptions, and the Docker deployment guide walks through containerizing applications before deploying to production. If you're running multiple application instances, the horizontal scaling guide for Node.js covers the infrastructure side of growing past a single instance.
Ready to deploy a monitored production application? Out Plane provides built-in metrics, real-time logs, and per-second billing with no infrastructure to manage. Start at console.outplane.com.