Most production AI applications never touch a GPU. They call LLM APIs from OpenAI or Anthropic, store embeddings in a vector database, and serve results through a REST endpoint. The deployment story for these applications should be simple. When you deploy an AI application built on API calls and vector search, you need a platform that handles the infrastructure so you can focus on the product.

This guide walks through building and deploying a complete AI application: a FastAPI service with LLM-powered chat, embedding storage in PostgreSQL with pgvector, and semantic search. You will go from code to production in under 10 minutes.

AI Application Architecture Patterns

Before writing code, it helps to understand the two main deployment patterns for AI applications.

Pattern 1: API-calling applications. Your application sends requests to hosted LLM providers (OpenAI, Anthropic, Cohere) and processes the responses. The LLM runs on the provider's infrastructure. Your server handles request routing, context assembly, prompt management, and result storage. This is how the vast majority of production AI applications work today.

Pattern 2: Self-hosted model inference. You run the model weights directly on your own hardware. This requires GPU instances, significant memory, and careful optimization. This pattern makes sense for fine-tuned models, strict data residency requirements, or extreme latency sensitivity.

This guide covers Pattern 1. It is the right choice for most teams building AI products. The architecture looks like this:

Client Request
    |
    v
FastAPI Server (Out Plane)
    |
    +--> OpenAI/Anthropic API (chat completion)
    |
    +--> PostgreSQL + pgvector (Out Plane Managed DB)
            |
            +--> Store embeddings
            +--> Semantic similarity search

Your FastAPI server acts as the orchestration layer. It receives user queries, retrieves relevant context from pgvector, sends augmented prompts to the LLM API, and returns structured responses. This is the core of Retrieval Augmented Generation (RAG).

What You'll Need

Before starting, make sure you have:

Python 3.11+ installed on your machine
A GitHub account
An OpenAI API key (or Anthropic API key)
Basic familiarity with FastAPI

No GPU required. No CUDA setup. No model weight downloads.

Building the AI Application

Here is the complete application. Create a project directory and add these files.

Create main.py:

python

import os
import time
from contextlib import asynccontextmanager

import asyncpg
import httpx
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel

# --- Configuration ---

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
DATABASE_URL = os.environ.get("DATABASE_URL")
EMBEDDING_MODEL = "text-embedding-3-small"
CHAT_MODEL = "gpt-4o"
EMBEDDING_DIMENSIONS = 1536


# --- Database setup ---

pool: asyncpg.Pool | None = None


async def init_db():
    global pool
    pool = await asyncpg.create_pool(DATABASE_URL, min_size=2, max_size=10)
    async with pool.acquire() as conn:
        await conn.execute("CREATE EXTENSION IF NOT EXISTS vector")
        await conn.execute("""
            CREATE TABLE IF NOT EXISTS documents (
                id SERIAL PRIMARY KEY,
                content TEXT NOT NULL,
                embedding vector(1536),
                metadata JSONB DEFAULT '{}',
                created_at TIMESTAMPTZ DEFAULT NOW()
            )
        """)
        await conn.execute("""
            CREATE INDEX IF NOT EXISTS documents_embedding_idx
            ON documents USING ivfflat (embedding vector_cosine_ops)
            WITH (lists = 100)
        """)


@asynccontextmanager
async def lifespan(app: FastAPI):
    await init_db()
    yield
    if pool:
        await pool.close()


# --- App ---

app = FastAPI(
    title="AI Application",
    description="RAG-powered AI service with pgvector",
    version="1.0.0",
    lifespan=lifespan,
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)


# --- Request/Response models ---

class ChatRequest(BaseModel):
    message: str
    use_rag: bool = True


class ChatResponse(BaseModel):
    reply: str
    sources: list[str]
    latency_ms: float


class EmbedRequest(BaseModel):
    content: str
    metadata: dict = {}


class EmbedResponse(BaseModel):
    id: int
    dimensions: int


class SearchRequest(BaseModel):
    query: str
    limit: int = 5


class SearchResult(BaseModel):
    content: str
    similarity: float
    metadata: dict


# --- OpenAI helpers ---

async def get_embedding(text: str) -> list[float]:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.openai.com/v1/embeddings",
            headers={"Authorization": f"Bearer {OPENAI_API_KEY}"},
            json={"input": text, "model": EMBEDDING_MODEL},
            timeout=30.0,
        )
        response.raise_for_status()
        return response.json()["data"][0]["embedding"]


async def chat_completion(messages: list[dict]) -> str:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.openai.com/v1/chat/completions",
            headers={"Authorization": f"Bearer {OPENAI_API_KEY}"},
            json={"model": CHAT_MODEL, "messages": messages, "temperature": 0.7},
            timeout=60.0,
        )
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]


# --- Endpoints ---

@app.get("/health")
async def health():
    return {"status": "healthy", "database": pool is not None}


@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    start = time.monotonic()
    sources = []

    if request.use_rag and pool:
        # Retrieve relevant context from pgvector
        query_embedding = await get_embedding(request.message)
        async with pool.acquire() as conn:
            rows = await conn.fetch(
                """
                SELECT content, 1 - (embedding <=> $1::vector) AS similarity
                FROM documents
                WHERE 1 - (embedding <=> $1::vector) > 0.7
                ORDER BY embedding <=> $1::vector
                LIMIT 3
                """,
                str(query_embedding),
            )
            sources = [row["content"][:200] for row in rows]

        # Build augmented prompt
        context = "\n\n".join(row["content"] for row in rows)
        messages = [
            {
                "role": "system",
                "content": (
                    "Answer the user's question using the provided context. "
                    "If the context doesn't contain relevant information, "
                    "say so and answer from your general knowledge.\n\n"
                    f"Context:\n{context}"
                ),
            },
            {"role": "user", "content": request.message},
        ]
    else:
        messages = [{"role": "user", "content": request.message}]

    reply = await chat_completion(messages)
    latency = (time.monotonic() - start) * 1000

    return ChatResponse(reply=reply, sources=sources, latency_ms=round(latency, 2))


@app.post("/embed", response_model=EmbedResponse)
async def embed(request: EmbedRequest):
    if not pool:
        raise HTTPException(status_code=503, detail="Database not connected")

    embedding = await get_embedding(request.content)

    async with pool.acquire() as conn:
        doc_id = await conn.fetchval(
            """
            INSERT INTO documents (content, embedding, metadata)
            VALUES ($1, $2::vector, $3::jsonb)
            RETURNING id
            """,
            request.content,
            str(embedding),
            request.metadata,
        )

    return EmbedResponse(id=doc_id, dimensions=len(embedding))


@app.post("/search", response_model=list[SearchResult])
async def search(request: SearchRequest):
    if not pool:
        raise HTTPException(status_code=503, detail="Database not connected")

    query_embedding = await get_embedding(request.query)

    async with pool.acquire() as conn:
        rows = await conn.fetch(
            """
            SELECT content, metadata,
                   1 - (embedding <=> $1::vector) AS similarity
            FROM documents
            ORDER BY embedding <=> $1::vector
            LIMIT $2
            """,
            str(query_embedding),
            request.limit,
        )

    return [
        SearchResult(
            content=row["content"],
            similarity=round(float(row["similarity"]), 4),
            metadata=dict(row["metadata"]) if row["metadata"] else {},
        )
        for row in rows
    ]

This gives you three working endpoints: /chat for LLM-powered conversation with optional RAG, /embed for storing documents with their vector embeddings, and /search for semantic similarity queries against your document store.

Create requirements.txt:

text

fastapi==0.115.6
uvicorn[standard]==0.32.0
asyncpg==0.30.0
httpx==0.28.1
pydantic==2.10.4
pgvector==0.3.6

The Dockerfile

A multi-stage build keeps the production image small. Create Dockerfile:

dockerfile

FROM python:3.12-slim AS builder

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

FROM python:3.12-slim

WORKDIR /app

COPY --from=builder /install /usr/local
COPY . .

EXPOSE 8080

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]

The multi-stage build separates dependency installation from the runtime image. The final image contains only the compiled packages and your application code.

Push these files to a GitHub repository.

Step 1: Set Up Your Database

Your AI application needs PostgreSQL with the pgvector extension for storing and querying embeddings. Out Plane provides managed PostgreSQL with pgvector pre-installed.

Go to console.outplane.com
Navigate to Databases in the sidebar
Click Create Database
Select PostgreSQL and your preferred region
Copy the connection URL

The connection URL looks like this:

text

postgresql://user:password@host:5432/database

The pgvector extension is available on all Out Plane managed PostgreSQL instances. The application code above runs CREATE EXTENSION IF NOT EXISTS vector on startup, so it activates automatically.

Step 2: Configure Environment Variables

Your AI application needs two environment variables: the OpenAI API key and the database connection URL. In the Out Plane console, environment variables are masked by default, so your API keys stay protected.

You will add these variables during application creation in the next step:

text

OPENAI_API_KEY=sk-your-openai-api-key
DATABASE_URL=postgresql://user:password@host:5432/database

If you use Anthropic instead of OpenAI, swap the API helper functions and set ANTHROPIC_API_KEY instead. The deployment process is identical.

Step 3: Deploy Your AI Application

Go to console.outplane.com
Sign in with your GitHub account
Select your AI application repository
Set the build method to Dockerfile
Set the port to 8080
Add your environment variables (OPENAI_API_KEY, DATABASE_URL)
Choose an instance type (start with op-30 for 1 vCPU, 1 GB RAM)
Click Deploy Application

The build process takes about 60 seconds:

Queued - Waiting for resources
Building - Running multi-stage Docker build
Deploying - Starting FastAPI with Uvicorn workers
Ready - Your AI application is live

Once deployed, your endpoints are available at:

https://your-app.outplane.app/docs - Interactive API documentation
https://your-app.outplane.app/chat - LLM chat endpoint
https://your-app.outplane.app/embed - Document embedding endpoint
https://your-app.outplane.app/search - Semantic search endpoint

Every subsequent git push triggers an automatic redeployment.

Adding Semantic Search with pgvector

With your application deployed, you can start building a knowledge base. First, embed some documents:

bash

curl -X POST https://your-app.outplane.app/embed \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Out Plane provides managed PostgreSQL with pgvector support for AI embedding storage. Databases include automatic backups and monitoring.",
    "metadata": {"source": "docs", "topic": "databases"}
  }'

Add several documents to build your knowledge base. Then query with natural language:

bash

curl -X POST https://your-app.outplane.app/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How do I store vectors in my database?",
    "limit": 3
  }'

The response returns documents ranked by cosine similarity to your query:

json

[
  {
    "content": "Out Plane provides managed PostgreSQL with pgvector support...",
    "similarity": 0.8734,
    "metadata": {"source": "docs", "topic": "databases"}
  }
]

This is the foundation of RAG. The /chat endpoint already uses this: it retrieves relevant documents before sending the augmented prompt to the LLM, grounding responses in your actual data rather than relying solely on the model's training data.

Scaling AI Workloads

AI applications have a unique scaling challenge. LLM API calls take 500ms to 5 seconds depending on prompt length and response size. This means each request holds a worker for significantly longer than a typical web request.

Out Plane's auto-scaling handles this automatically. When concurrent requests increase, new instances spin up to maintain response times. When traffic drops, instances scale back down. You can also configure scale-to-zero for development or low-traffic applications, so you only pay for actual compute time with per-second billing.

For production AI applications, consider these instance configurations:

Traffic Level	Instance	Workers	Handles
Development	op-20 (0.5 vCPU, 512 MB)	2	5-10 concurrent requests
Production	op-30 (1 vCPU, 1 GB)	4	15-25 concurrent requests
High Traffic	op-40 (2 vCPU, 2 GB)	8	50-80 concurrent requests

Each worker can handle one LLM request at a time due to the blocking nature of external API calls. Async workers with httpx help, but the bottleneck is the LLM provider's response time. Scale horizontally by increasing instance count through auto-scaling rather than vertically increasing instance size.

What About Self-Hosted Models?

A common question when deciding how to deploy an LLM app: should you self-host the model?

Use API providers (OpenAI, Anthropic) when:

You want to start fast and iterate on your product
You need access to frontier models (GPT-4o, Claude, etc.)
Your team doesn't have ML infrastructure experience
Data privacy requirements are met by the provider's terms

Consider self-hosted models when:

You have fine-tuned a model for your specific domain
Regulatory requirements prohibit external API calls
You need sub-50ms inference latency
You are processing millions of requests per day and cost is a primary factor

Self-hosted inference requires GPU instances, which is a different deployment story. Out Plane focuses on CPU-based workloads, which is exactly what API-calling AI applications need. The compute-intensive work (model inference) happens on the LLM provider's GPU clusters. Your application handles the lightweight but critical orchestration: prompt assembly, context retrieval, response processing, and user management.

For most teams building AI products, the API-calling pattern is the right starting point. Ship your product first. Optimize the inference layer later if the economics demand it.

Monitoring AI Applications

AI applications have specific monitoring needs beyond standard web metrics. After deploying, track these in the Out Plane dashboard:

Response Latency. LLM API calls introduce variable latency. Monitor your P50 and P95 response times in the metrics dashboard. If P95 exceeds your SLA, consider caching frequent queries or reducing prompt length.

Error Rates. LLM providers have rate limits and occasional outages. Your application should handle these gracefully. The httpx client in our code uses timeouts, but you should also monitor 429 (rate limit) and 5xx responses in your application logs.

Token Usage. Track token consumption at the application level by logging the usage field from OpenAI responses. This data helps you forecast costs and optimize prompts.

Add structured logging to your FastAPI application for better observability:

python

import logging
import json

logger = logging.getLogger("ai_app")

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    start = time.monotonic()
    # ... existing code ...
    latency = (time.monotonic() - start) * 1000

    logger.info(json.dumps({
        "endpoint": "/chat",
        "latency_ms": round(latency, 2),
        "rag_enabled": request.use_rag,
        "sources_found": len(sources),
    }))

    return ChatResponse(reply=reply, sources=sources, latency_ms=round(latency, 2))

These logs appear in real-time in the Out Plane logs viewer, giving you immediate visibility into your AI application's behavior.

Production Hardening

Before taking your AI application to production traffic, add these safeguards:

Rate limiting. Protect your LLM API budget with request limits per user or API key. Use FastAPI middleware or a library like slowapi.

Input validation. LLM prompts should be bounded. Set maximum character limits on user input to control token usage and prevent prompt injection:

python

class ChatRequest(BaseModel):
    message: str
    use_rag: bool = True

    @field_validator("message")
    @classmethod
    def validate_message_length(cls, v):
        if len(v) > 4000:
            raise ValueError("Message must be under 4000 characters")
        return v

Caching. Identical queries produce identical embeddings. Cache embedding results for repeated content to reduce API calls and latency. A simple in-memory cache or Redis works well.

Custom domains. For production AI APIs, set up a custom domain through the Out Plane console. Navigate to Domains, click Map Domain, and add your DNS records. SSL certificates are provisioned automatically.

Next Steps

Your AI application is deployed and serving production traffic. Here is where to go from here:

Build your knowledge base: Ingest documents through the /embed endpoint to improve RAG quality
Add authentication: Protect your endpoints with API keys or OAuth
Explore the architecture: Read about cloud native patterns for small teams to design your system for growth
Deploy supporting services: Add a frontend or worker service using the same Docker-based deployment workflow
Review your infrastructure costs: Out Plane's per-second billing means you pay only for what you use

Summary

Deploying an AI application to production does not require GPU servers, complex ML pipelines, or weeks of infrastructure setup. Most AI applications call LLM APIs and store embeddings. The deployment process is straightforward:

Build a FastAPI application with LLM API integration and pgvector storage
Set up managed PostgreSQL with the pgvector extension
Configure environment variables for API keys and database connection
Deploy via GitHub push with automatic builds and SSL

The entire process takes under 10 minutes. Auto-scaling handles traffic spikes from variable LLM response times, and per-second billing keeps costs predictable.

Ready to deploy your AI application? Get started with Out Plane and receive $20 in free credit.

How to Deploy an AI Application to Production: FastAPI, LLM APIs, and pgvector

AI Application Architecture Patterns

What You'll Need

Building the AI Application

The Dockerfile

Step 1: Set Up Your Database

Step 2: Configure Environment Variables

Step 3: Deploy Your AI Application

Adding Semantic Search with pgvector

Scaling AI Workloads

What About Self-Hosted Models?

Monitoring AI Applications

Production Hardening

Next Steps

Summary

Start deploying in minutes