Most production AI applications never touch a GPU. They call LLM APIs from OpenAI or Anthropic, store embeddings in a vector database, and serve results through a REST endpoint. The deployment story for these applications should be simple. When you deploy an AI application built on API calls and vector search, you need a platform that handles the infrastructure so you can focus on the product.
This guide walks through building and deploying a complete AI application: a FastAPI service with LLM-powered chat, embedding storage in PostgreSQL with pgvector, and semantic search. You will go from code to production in under 10 minutes.
AI Application Architecture Patterns
Before writing code, it helps to understand the two main deployment patterns for AI applications.
Pattern 1: API-calling applications. Your application sends requests to hosted LLM providers (OpenAI, Anthropic, Cohere) and processes the responses. The LLM runs on the provider's infrastructure. Your server handles request routing, context assembly, prompt management, and result storage. This is how the vast majority of production AI applications work today.
Pattern 2: Self-hosted model inference. You run the model weights directly on your own hardware. This requires GPU instances, significant memory, and careful optimization. This pattern makes sense for fine-tuned models, strict data residency requirements, or extreme latency sensitivity.
This guide covers Pattern 1. It is the right choice for most teams building AI products. The architecture looks like this:
Client Request
|
v
FastAPI Server (Out Plane)
|
+--> OpenAI/Anthropic API (chat completion)
|
+--> PostgreSQL + pgvector (Out Plane Managed DB)
|
+--> Store embeddings
+--> Semantic similarity search
Your FastAPI server acts as the orchestration layer. It receives user queries, retrieves relevant context from pgvector, sends augmented prompts to the LLM API, and returns structured responses. This is the core of Retrieval Augmented Generation (RAG).
What You'll Need
Before starting, make sure you have:
- Python 3.11+ installed on your machine
- A GitHub account
- An OpenAI API key (or Anthropic API key)
- Basic familiarity with FastAPI
No GPU required. No CUDA setup. No model weight downloads.
Building the AI Application
Here is the complete application. Create a project directory and add these files.
Create main.py:
import os
import time
from contextlib import asynccontextmanager
import asyncpg
import httpx
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
# --- Configuration ---
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
DATABASE_URL = os.environ.get("DATABASE_URL")
EMBEDDING_MODEL = "text-embedding-3-small"
CHAT_MODEL = "gpt-4o"
EMBEDDING_DIMENSIONS = 1536
# --- Database setup ---
pool: asyncpg.Pool | None = None
async def init_db():
global pool
pool = await asyncpg.create_pool(DATABASE_URL, min_size=2, max_size=10)
async with pool.acquire() as conn:
await conn.execute("CREATE EXTENSION IF NOT EXISTS vector")
await conn.execute("""
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536),
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ DEFAULT NOW()
)
""")
await conn.execute("""
CREATE INDEX IF NOT EXISTS documents_embedding_idx
ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100)
""")
@asynccontextmanager
async def lifespan(app: FastAPI):
await init_db()
yield
if pool:
await pool.close()
# --- App ---
app = FastAPI(
title="AI Application",
description="RAG-powered AI service with pgvector",
version="1.0.0",
lifespan=lifespan,
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
# --- Request/Response models ---
class ChatRequest(BaseModel):
message: str
use_rag: bool = True
class ChatResponse(BaseModel):
reply: str
sources: list[str]
latency_ms: float
class EmbedRequest(BaseModel):
content: str
metadata: dict = {}
class EmbedResponse(BaseModel):
id: int
dimensions: int
class SearchRequest(BaseModel):
query: str
limit: int = 5
class SearchResult(BaseModel):
content: str
similarity: float
metadata: dict
# --- OpenAI helpers ---
async def get_embedding(text: str) -> list[float]:
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.openai.com/v1/embeddings",
headers={"Authorization": f"Bearer {OPENAI_API_KEY}"},
json={"input": text, "model": EMBEDDING_MODEL},
timeout=30.0,
)
response.raise_for_status()
return response.json()["data"][0]["embedding"]
async def chat_completion(messages: list[dict]) -> str:
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.openai.com/v1/chat/completions",
headers={"Authorization": f"Bearer {OPENAI_API_KEY}"},
json={"model": CHAT_MODEL, "messages": messages, "temperature": 0.7},
timeout=60.0,
)
response.raise_for_status()
return response.json()["choices"][0]["message"]["content"]
# --- Endpoints ---
@app.get("/health")
async def health():
return {"status": "healthy", "database": pool is not None}
@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
start = time.monotonic()
sources = []
if request.use_rag and pool:
# Retrieve relevant context from pgvector
query_embedding = await get_embedding(request.message)
async with pool.acquire() as conn:
rows = await conn.fetch(
"""
SELECT content, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE 1 - (embedding <=> $1::vector) > 0.7
ORDER BY embedding <=> $1::vector
LIMIT 3
""",
str(query_embedding),
)
sources = [row["content"][:200] for row in rows]
# Build augmented prompt
context = "\n\n".join(row["content"] for row in rows)
messages = [
{
"role": "system",
"content": (
"Answer the user's question using the provided context. "
"If the context doesn't contain relevant information, "
"say so and answer from your general knowledge.\n\n"
f"Context:\n{context}"
),
},
{"role": "user", "content": request.message},
]
else:
messages = [{"role": "user", "content": request.message}]
reply = await chat_completion(messages)
latency = (time.monotonic() - start) * 1000
return ChatResponse(reply=reply, sources=sources, latency_ms=round(latency, 2))
@app.post("/embed", response_model=EmbedResponse)
async def embed(request: EmbedRequest):
if not pool:
raise HTTPException(status_code=503, detail="Database not connected")
embedding = await get_embedding(request.content)
async with pool.acquire() as conn:
doc_id = await conn.fetchval(
"""
INSERT INTO documents (content, embedding, metadata)
VALUES ($1, $2::vector, $3::jsonb)
RETURNING id
""",
request.content,
str(embedding),
request.metadata,
)
return EmbedResponse(id=doc_id, dimensions=len(embedding))
@app.post("/search", response_model=list[SearchResult])
async def search(request: SearchRequest):
if not pool:
raise HTTPException(status_code=503, detail="Database not connected")
query_embedding = await get_embedding(request.query)
async with pool.acquire() as conn:
rows = await conn.fetch(
"""
SELECT content, metadata,
1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT $2
""",
str(query_embedding),
request.limit,
)
return [
SearchResult(
content=row["content"],
similarity=round(float(row["similarity"]), 4),
metadata=dict(row["metadata"]) if row["metadata"] else {},
)
for row in rows
]This gives you three working endpoints: /chat for LLM-powered conversation with optional RAG, /embed for storing documents with their vector embeddings, and /search for semantic similarity queries against your document store.
Create requirements.txt:
fastapi==0.115.6
uvicorn[standard]==0.32.0
asyncpg==0.30.0
httpx==0.28.1
pydantic==2.10.4
pgvector==0.3.6The Dockerfile
A multi-stage build keeps the production image small. Create Dockerfile:
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /install /usr/local
COPY . .
EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]The multi-stage build separates dependency installation from the runtime image. The final image contains only the compiled packages and your application code.
Push these files to a GitHub repository.
Step 1: Set Up Your Database
Your AI application needs PostgreSQL with the pgvector extension for storing and querying embeddings. Out Plane provides managed PostgreSQL with pgvector pre-installed.
- Go to console.outplane.com
- Navigate to Databases in the sidebar
- Click Create Database
- Select PostgreSQL and your preferred region
- Copy the connection URL
The connection URL looks like this:
postgresql://user:password@host:5432/databaseThe pgvector extension is available on all Out Plane managed PostgreSQL instances. The application code above runs CREATE EXTENSION IF NOT EXISTS vector on startup, so it activates automatically.
Step 2: Configure Environment Variables
Your AI application needs two environment variables: the OpenAI API key and the database connection URL. In the Out Plane console, environment variables are masked by default, so your API keys stay protected.
You will add these variables during application creation in the next step:
OPENAI_API_KEY=sk-your-openai-api-key
DATABASE_URL=postgresql://user:password@host:5432/databaseIf you use Anthropic instead of OpenAI, swap the API helper functions and set ANTHROPIC_API_KEY instead. The deployment process is identical.
Step 3: Deploy Your AI Application
- Go to console.outplane.com
- Sign in with your GitHub account
- Select your AI application repository
- Set the build method to Dockerfile
- Set the port to
8080 - Add your environment variables (
OPENAI_API_KEY,DATABASE_URL) - Choose an instance type (start with op-30 for 1 vCPU, 1 GB RAM)
- Click Deploy Application
The build process takes about 60 seconds:
- Queued - Waiting for resources
- Building - Running multi-stage Docker build
- Deploying - Starting FastAPI with Uvicorn workers
- Ready - Your AI application is live
Once deployed, your endpoints are available at:
https://your-app.outplane.app/docs- Interactive API documentationhttps://your-app.outplane.app/chat- LLM chat endpointhttps://your-app.outplane.app/embed- Document embedding endpointhttps://your-app.outplane.app/search- Semantic search endpoint
Every subsequent git push triggers an automatic redeployment.
Adding Semantic Search with pgvector
With your application deployed, you can start building a knowledge base. First, embed some documents:
curl -X POST https://your-app.outplane.app/embed \
-H "Content-Type: application/json" \
-d '{
"content": "Out Plane provides managed PostgreSQL with pgvector support for AI embedding storage. Databases include automatic backups and monitoring.",
"metadata": {"source": "docs", "topic": "databases"}
}'Add several documents to build your knowledge base. Then query with natural language:
curl -X POST https://your-app.outplane.app/search \
-H "Content-Type: application/json" \
-d '{
"query": "How do I store vectors in my database?",
"limit": 3
}'The response returns documents ranked by cosine similarity to your query:
[
{
"content": "Out Plane provides managed PostgreSQL with pgvector support...",
"similarity": 0.8734,
"metadata": {"source": "docs", "topic": "databases"}
}
]This is the foundation of RAG. The /chat endpoint already uses this: it retrieves relevant documents before sending the augmented prompt to the LLM, grounding responses in your actual data rather than relying solely on the model's training data.
Scaling AI Workloads
AI applications have a unique scaling challenge. LLM API calls take 500ms to 5 seconds depending on prompt length and response size. This means each request holds a worker for significantly longer than a typical web request.
Out Plane's auto-scaling handles this automatically. When concurrent requests increase, new instances spin up to maintain response times. When traffic drops, instances scale back down. You can also configure scale-to-zero for development or low-traffic applications, so you only pay for actual compute time with per-second billing.
For production AI applications, consider these instance configurations:
| Traffic Level | Instance | Workers | Handles |
|---|---|---|---|
| Development | op-20 (0.5 vCPU, 512 MB) | 2 | 5-10 concurrent requests |
| Production | op-30 (1 vCPU, 1 GB) | 4 | 15-25 concurrent requests |
| High Traffic | op-40 (2 vCPU, 2 GB) | 8 | 50-80 concurrent requests |
Each worker can handle one LLM request at a time due to the blocking nature of external API calls. Async workers with httpx help, but the bottleneck is the LLM provider's response time. Scale horizontally by increasing instance count through auto-scaling rather than vertically increasing instance size.
What About Self-Hosted Models?
A common question when deciding how to deploy an LLM app: should you self-host the model?
Use API providers (OpenAI, Anthropic) when:
- You want to start fast and iterate on your product
- You need access to frontier models (GPT-4o, Claude, etc.)
- Your team doesn't have ML infrastructure experience
- Data privacy requirements are met by the provider's terms
Consider self-hosted models when:
- You have fine-tuned a model for your specific domain
- Regulatory requirements prohibit external API calls
- You need sub-50ms inference latency
- You are processing millions of requests per day and cost is a primary factor
Self-hosted inference requires GPU instances, which is a different deployment story. Out Plane focuses on CPU-based workloads, which is exactly what API-calling AI applications need. The compute-intensive work (model inference) happens on the LLM provider's GPU clusters. Your application handles the lightweight but critical orchestration: prompt assembly, context retrieval, response processing, and user management.
For most teams building AI products, the API-calling pattern is the right starting point. Ship your product first. Optimize the inference layer later if the economics demand it.
Monitoring AI Applications
AI applications have specific monitoring needs beyond standard web metrics. After deploying, track these in the Out Plane dashboard:
Response Latency. LLM API calls introduce variable latency. Monitor your P50 and P95 response times in the metrics dashboard. If P95 exceeds your SLA, consider caching frequent queries or reducing prompt length.
Error Rates. LLM providers have rate limits and occasional outages. Your application should handle these gracefully. The httpx client in our code uses timeouts, but you should also monitor 429 (rate limit) and 5xx responses in your application logs.
Token Usage. Track token consumption at the application level by logging the usage field from OpenAI responses. This data helps you forecast costs and optimize prompts.
Add structured logging to your FastAPI application for better observability:
import logging
import json
logger = logging.getLogger("ai_app")
@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
start = time.monotonic()
# ... existing code ...
latency = (time.monotonic() - start) * 1000
logger.info(json.dumps({
"endpoint": "/chat",
"latency_ms": round(latency, 2),
"rag_enabled": request.use_rag,
"sources_found": len(sources),
}))
return ChatResponse(reply=reply, sources=sources, latency_ms=round(latency, 2))These logs appear in real-time in the Out Plane logs viewer, giving you immediate visibility into your AI application's behavior.
Production Hardening
Before taking your AI application to production traffic, add these safeguards:
Rate limiting. Protect your LLM API budget with request limits per user or API key. Use FastAPI middleware or a library like slowapi.
Input validation. LLM prompts should be bounded. Set maximum character limits on user input to control token usage and prevent prompt injection:
class ChatRequest(BaseModel):
message: str
use_rag: bool = True
@field_validator("message")
@classmethod
def validate_message_length(cls, v):
if len(v) > 4000:
raise ValueError("Message must be under 4000 characters")
return vCaching. Identical queries produce identical embeddings. Cache embedding results for repeated content to reduce API calls and latency. A simple in-memory cache or Redis works well.
Custom domains. For production AI APIs, set up a custom domain through the Out Plane console. Navigate to Domains, click Map Domain, and add your DNS records. SSL certificates are provisioned automatically.
Next Steps
Your AI application is deployed and serving production traffic. Here is where to go from here:
- Build your knowledge base: Ingest documents through the
/embedendpoint to improve RAG quality - Add authentication: Protect your endpoints with API keys or OAuth
- Explore the architecture: Read about cloud native patterns for small teams to design your system for growth
- Deploy supporting services: Add a frontend or worker service using the same Docker-based deployment workflow
- Review your infrastructure costs: Out Plane's per-second billing means you pay only for what you use
Summary
Deploying an AI application to production does not require GPU servers, complex ML pipelines, or weeks of infrastructure setup. Most AI applications call LLM APIs and store embeddings. The deployment process is straightforward:
- Build a FastAPI application with LLM API integration and pgvector storage
- Set up managed PostgreSQL with the pgvector extension
- Configure environment variables for API keys and database connection
- Deploy via GitHub push with automatic builds and SSL
The entire process takes under 10 minutes. Auto-scaling handles traffic spikes from variable LLM response times, and per-second billing keeps costs predictable.
Ready to deploy your AI application? Get started with Out Plane and receive $20 in free credit.