A couple of years ago I was hired as the solo backend engineer to build the production backend for an AI product company from scratch. No legacy code, no existing schema, no team to defer to — just a frontend team waiting on an API and a product team waiting on inference. This post is about the architecture decisions I made, what worked, and what I'd do differently if I were starting today.
The shape of the problem
The product needed five things working together:
A typed API for the frontend
LLM inference with reasonable UX (streaming, cost controls, retries)
Vector search over user-supplied content
Authentication and user management
Subscription billing
Standard stuff individually. The interesting part was keeping the operational footprint small enough that one person could run the whole thing without it falling over.
Why FastAPI + Postgres + pgvector
The temptation when building an AI product is to reach for a Lambda-and-microservices architecture from day one. I went the other direction: one FastAPI service, one Postgres database, pgvector for embeddings inside that same Postgres.
The reasoning:
One database, one source of truth. User data, subscription state, and vector embeddings live in the same Postgres. No sync issues between Postgres and a separate vector DB. Backups are one operation. Migrations are one tool (Alembic).
FastAPI's typing pays for itself. Pydantic schemas become the contract with the frontend. OpenAPI generation is free. Async support means LLM calls and external APIs don't block the event loop.
One service is faster to ship and easier to debug. When something breaks at 2am, you don't have to figure out which of seven services owns the bug.
The cost: at very high scale (millions of users, billions of embeddings), pgvector and a monolithic FastAPI service would need rethinking. For an early-stage product, this was the right tradeoff.
LLM integration: the part people underestimate
Connecting to an LLM API is one line of code. Making it production-grade is most of the engineering work. The pieces:
Streaming responses. Users will not wait 15 seconds for a wall of text. Server-sent events from FastAPI through to the frontend, with proper backpressure handling.
Token accounting per user. Every request logs prompt tokens, completion tokens, and cost. Without this you have no idea who's burning your budget.
Soft caps + fallback models. Hitting a per-user daily limit doesn't return an error — it switches to a cheaper model and continues. Users get a worse experience, not a broken one.
Retry with backoff that respects rate limits. Upstream providers throttle. Naive retry loops make it worse. Exponential backoff plus reading the Retry-After header is the minimum.
Prompt templates separated from application code. Templates live in their own module, versioned, with the application code only choosing which one to invoke. Changes to wording don't require touching business logic.
Vector search inside Postgres
pgvector is one of the more underrated decisions you can make. It's a Postgres extension that adds a vector type and similarity-search operators. You get nearest-neighbor queries with the same SQL you use for everything else, the same transaction guarantees, and the same backup story.
The gotchas worth knowing:
Indexes matter a lot. A naive sequential scan over a few hundred thousand vectors will be slow. IVFFlat or HNSW indexes are not optional past a certain table size.
Chunking is where retrieval quality lives. Embedding a 50-page document as one vector is useless. Chunk strategy (size, overlap, semantic boundaries) affects results more than the embedding model choice.
Re-embedding is expensive. When you change embedding models or chunking strategy, you re-embed everything. Design the ingestion pipeline so this isn't catastrophic.
Subscription billing: webhooks are the product
Stripe handles the payment UI and the regulatory complexity. Your job is to receive its webhooks and turn them into entitlements in your database.
The non-negotiables:
Webhooks must be idempotent. Stripe will deliver the same event twice. If your subscription state changes twice, you've granted access twice or revoked it incorrectly.
Webhook signature verification on every request. Without it, anyone can POST a fake "subscription.created" event to your endpoint.
Entitlements as a separate layer. Don't check Stripe's subscription status in every API handler. Maintain a local entitlements table updated by webhooks; check that. It's faster, more reliable, and survives Stripe outages.
Reconciliation jobs. Webhooks fail. A nightly job that pulls the truth from Stripe and reconciles your local state catches the gaps before users notice.
Auth: boring is correct
Email/password with hashed credentials, OAuth for social sign-in, JWT for sessions with refresh tokens, role-based access control on routes. None of this is novel and none of it should be. The mistake people make on AI products is wanting auth to be clever. It shouldn't be. Pick well-tested libraries and move on.
What I'd do differently today
A few things have changed in the 2-3 years since:
Structured outputs. Most LLM providers now support enforced JSON schema responses. When I built this, you parsed JSON out of completions and prayed. Today I'd use structured outputs everywhere they're supported and delete a lot of validation code.
Streaming standards. The streaming UX patterns are much more mature now. Vercel AI SDK and similar libraries handle a lot of what I wrote by hand.
Embedding model choices. The embedding landscape moved fast. I'd benchmark a few current options against the actual retrieval task instead of defaulting to OpenAI embeddings.
Observability from day one. I added logging and error reporting, but I'd build proper LLM-aware observability (token usage, latency percentiles, prompt versions, retrieval quality) into the request path from the first commit.
Takeaways
If you're building an AI product backend:
One service and one database will take you further than you think
Most of the engineering work is around the LLM call, not the LLM call itself
pgvector is a serious option if you're already on Postgres
Webhook idempotency and entitlement layers will save you from a class of bugs that are hard to debug after the fact
Happy to dig into any of these in the comments — especially if you've made different architectural choices and have war stories to share.