Skip to content
Tech Nerve
LLMOPSJournal
← All journal

Shipping LLMs: the ops stack we actually use.

Every LLM product that survives contact with users runs on the same small set of infrastructure. Here is the stack we use across Tech Nerve engagements — what we chose, why, and what we changed our minds about.

Tech Nerve Engineering/Backend & DevOps·8 min

An LLM feature is only as good as the infrastructure around it. Here is the stack we run for every engagement that touches GenAI — minus the client-specific bits. Assume the feature is already designed. This is what lives under it.

Gateway

Never call vendor APIs directly from your app. Sit a gateway in front: LiteLLM, Portkey, or something built in-house. It routes between providers, adds retry-and-fallback, enforces rate limits per tenant, redacts PII on the way out, and gives you a single place to rotate secrets. Five days of work, saves months.

Retrieval

Hybrid by default. BM25 (OpenSearch or Tantivy) for recall on specific terms; dense retrieval (pgvector, Qdrant or Vespa) for semantic match. Rerank on top (Cohere rerank-english, or a fine-tuned cross-encoder if we need on-prem). Never just one — single-retriever systems fail on the queries that matter most.

Orchestration

We avoid heavyweight frameworks (LangChain in production, specifically). The pattern that ships is a thin state machine written in TypeScript or Python — a few hundred lines we own — calling typed tools. Temporal or Inngest if the workflow spans hours or needs durable state. LangGraph for complex agents where we want their graph semantics without the other thousand abstractions.

Storage

  • Postgres for transactional state. Always.
  • pgvector for the first 100k documents. Qdrant once that stops scaling or when we need metadata filters to be fast.
  • Redis for session state, rate limits, and caches of embeddings and rerank scores.
  • Object storage (S3 or R2) for raw documents and generated artefacts.

Observability

Langfuse or Braintrust for LLM-specific traces — every prompt, every completion, every token, every latency, correlated by session id. OpenTelemetry for the rest of the app, with trace context propagated through to the gateway so an LLM call appears as a span inside the parent request. Sampling rule: keep everything at first, downsample once you know what normal looks like.

Evaluation

Offline evals (gold set + LLM-judge + deterministic checks) run in CI on every PR. Online evals sample 1–2% of live traffic, score with the same rubric, feed a dashboard. Regressions page oncall. The eval set itself is version-controlled alongside the prompts.

Safety

Input side: prompt-injection detector, PII redactor, jailbreak filter. Output side: critique pass, toxicity filter, PII re-redactor. None of these are a regex. All of them are testable. None of them are optional once you are shipping to real users.

Changed our minds recently

  1. 01We used to run one vector index per tenant. Now we run one index with tenant-scoped filters — operationally much simpler, and Qdrant/pgvector can handle it.
  2. 02We used to cache at the prompt level. Now we cache at the (query, retrieved-chunks) level — invalidates correctly, saves 40–60% of latency on hot queries.
  3. 03We used to treat the gateway as optional. Now it is the first thing we ship.
Tagged
  • LLMOps
  • Infrastructure
  • Observability
  • Engineering