How to Deploy an Open Source LLM Reliably (on Kubernetes)

Question

How to Deploy an Open Source LLM Reliably (on Kubernetes)

heyyayushh posted May 3 Originally published at heyyayush.hashnode.dev 6 min read

Everyone is talking about LLMs, but very few people actually run one.

Most developers rely on APIs from OpenAI or Anthropic. It works great until you start thinking about cost, control, or what happens when that API isn’t available. That’s when a more interesting question shows up: can you run your own LLM reliably?

Not just something that works once on your laptop, but something that behaves like real infrastructure. Something observable, restartable, and usable through an actual interface. That’s what this guide is about. We’ll deploy an open-source model (TinyLlama) on Kubernetes using Kind, serve it using Ollama, build a simple chatbot UI, and monitor everything using Prometheus and Grafana. By the end, you won’t just have a working setup you’ll understand how LLMs behave as systems.

Understanding What We’re Actually Building

Before running commands, This setup is a small but complete system: a Kubernetes cluster running an LLM service, connected to a UI, with a monitoring layer observing everything. When you send a message in the chatbot, it flows through multiple layers before coming back as a response.

User → Chat UI → LLM API → Kubernetes → Metrics → Grafana

That flow is the entire point. Once you understand this, you stop seeing LLMs as magic models and start seeing them as workloads running on infrastructure.

Why Not Just Use Gemini or Any Commercial LLM?

At this point, a fair question comes up. Why go through all this effort setting up Kubernetes, deploying an LLM, adding monitoring when you can just open a browser, go to Google Gemini or use APIs from OpenAI, and get answers in seconds?And honestly, for many use cases… you should. Commercial LLMs are fast, powerful, and incredibly easy to use. They handle scaling, reliability, and updates for you. If your goal is to get high-quality answers quickly, they are the best choice.But that’s not the whole picture. Running your own LLM changes what you control.When it runs inside your infrastructure, there’s no external dependency, no per-request cost, and full control over data and behavior. The trade-off is simple: you gain control, but you take on responsibility.

This is where open-source LLMs make sense especially for internal tools, privacy-sensitive use cases, or cost-heavy workloads at scale. But smaller models like TinyLlama still lag behind commercial models in reasoning and depth. So this isn’t about replacing commercial LLMs. It’s about understanding when to own the system—and when not to.

Aspect	Open Source LLM (TinyLlama on K8s)	Commercial LLM (Gemini / OpenAI / Anthropic)
Setup	Complex (infra + deployment)	Instant (UI / API)
Cost	No per-request cost (infra only)	Pay-per-use
Latency	Very low (local)	Depends on network/API
Quality	Moderate	High
Control	Full control over system	Limited
Privacy	Fully local	Data leaves your system
Reliability	Depends on your setup	Managed by provider

The real decision isn’t which one is better it’s about which trade-offs you’re willing to make for your use case.

Clone the Project (Start Here)

Before we go deep into Kubernetes, LLMs, or observability, let’s get the system running locally. Everything in this guide is based on this repo: Clone the repository

git clone https://github.com/Ayushmore1214/llm-k8s-deployment.git
cd llm-k8s-deployment

Starting Clean

A lot of Kubernetes frustration comes from leftover state. Old configs, broken kubeconfigs, or previously installed clusters can silently interfere with your setup. So instead of debugging weird issues later, we start clean.

sudo rm -rf /etc/rancher
unset KUBECONFIG

Now we install and create a cluster using Kind. Kind is perfect here because it runs Kubernetes inside Docker, making it fast and reproducible without any cloud dependency.

kind create cluster --name drdroid-llm
kind export kubeconfig --name drdroid-llm

At this point, you already have something powerful: a working Kubernetes cluster on your machine. Everything after this builds on top of it.

Deploying the LLM

Now we deploy the actual LLM infrastructure using a Kubernetes manifest.

kubectl apply -f llm-stack.yaml

This single command creates a deployment to keep the model running, a service to expose it inside the cluster, and the container configuration that defines how the model behaves.

You can watch Kubernetes doing its job in real time:

kubectl get pods -w

Once the pod is running, we still need to load the model itself. Until this step, the system is alive—but empty.

kubectl exec -it $(kubectl get pods -l app=ollama -o name) -- ollama pull tinyllama

Now the model is actually usable. This is the moment where infrastructure turns into intelligence.

Observability Stack

Most guides stop after it works. That’s not enough. If you can’t see what your system is doing, you don’t really control it. That’s why we add observability using Prometheus and Grafana.

helm install obs prometheus-community/kube-prometheus-stack \
  --set prometheus.prometheusSpec.resources.requests.memory=300Mi \
  --set grafana.adminPassword=admin

This adds a full monitoring layer to your cluster. Prometheus collects metrics, and Grafana lets you visualize them. Instead of guessing what’s happening, you can now see CPU usage, memory consumption, and pod health in real time.

Accessing the System

By default, Kubernetes services are internal. To interact with them locally, we forward ports.

kubectl port-forward svc/ollama-service 11434:11434
kubectl port-forward svc/obs-grafana 3000:80

Now your LLM API is available on port 11434, and Grafana is accessible on port 3000. At this point, everything is running but it’s still not very interactive. That’s where the UI comes in.

Adding a Chatbot UI

Instead of calling APIs manually, we build a simple chatbot interface using Streamlit.

pip install streamlit requests
python3 -m streamlit run app.py

Open the UI in your browser and start asking questions. Things like:

“Which model are you?”
“What’s your latest knowledge?”
“Explain Kubernetes simply”

This is where the system finally feels complete. You’re interacting with your own LLM, running on your own infrastructure.

Watching the System Under Load

Now open Grafana at http://localhost:3000 and log in using admin/admin. Navigate to the Kubernetes dashboards and select the Ollama pod.

Go back to your chatbot and start sending prompts. You’ll notice something interesting immediately. Every request creates visible activity—CPU spikes, memory usage increases, and resource patterns start forming. The model isn’t just answering questions. It’s consuming compute. This is the shift in understanding that matters. LLMs are not just software. They are workloads.

What This Demo Actually Teaches You ?

This project might look simple, but it introduces some core ideas that show up in real systems:

Kubernetes ensures your service stays alive even if it crashes
LLM inference consumes real, measurable resources
Observability is what turns debugging into something practical

Once you see these pieces working together, the abstraction breaks in a good way. You stop treating LLMs as black boxes and start understanding how they behave under the hood.

Conclusion

Running an LLM locally is cool. Running it on Kubernetes with a full observability stack? That’s how you actually learn how AI behaves in the real world. This setup is just the beginning. It’s containerized, it’s orchestrated, and it’s observable this is the exact pattern that powers production AI at scale. Once you see the CPU spikes in Grafana and manage your own cluster, you stop looking at AI as a magic API and start seeing it as workload engineering.

The best way to learn isn’t by reading it’s by breaking stuff. I built this project to be a playground. Go ahead: delete a pod, change things in llm-stack.yaml, or try to crash the model with a massive prompt. See if your Grafana dashboard catches it.

If this guide helped you make sense of the chaos, give the repo a ⭐, and let’s connect on LinkedIn. I’m always looking for people who want to build, ship, and break things. Thank You for Reading!!!

1 Comment

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

Pocket Portfolio · Answer 1 · 2026-05-03T10:01:44+0000

Great deep dive into the 'infrastructure' side of LLMs. You’ve highlighted the exact pivot point the industry is facing: Privacy vs. Utility. Most firms think they have to choose between the Easy/Powerful cloud APIs (OpenAI/Gemini) and the Secure/Complex local setups (K8s/Ollama). But at Pocket Portfolio, we’ve spent the last few months architecting a third way: the Split-Brain.

Instead of moving the Model to the Data (which involves the K8s complexity you described), we move a Sanitized Context to a Stateless Model.

The Air Gap: We use a 'local-first' browser boundary where raw ledgers stay in the user's IndexedDB.
The Edge Compiler: Our pure-function SDK standardizes and sanitizes that data locally before anything hits the wire.
Stateless Reasoning: We can then use any model—even the frontier-class commercial ones—because the 'Brain' is structurally incapable of seeing PII.

You mentioned that LLMs are 'workloads'. We agree, but for regulated wealth-tech, we believe the goal is to make that workload Stateless so the audit perimeter shrinks to almost zero.

Would love to discuss how 'Split-Brain' logic might change your trade-off table between Open Source and Commercial LLMs!

	I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt Karol Modelskiverified - Mar 19
	Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download) Pocket Portfolio - Apr 1
	The Privacy Gap: Why sending financial ledgers to OpenAI is broken Pocket Portfolio - Feb 23
	Architecting a Local-First Hybrid RAG for Finance Pocket Portfolio - Feb 25
	AI Reliability Gap: Why Large Language Models are not for Safety-Critical Systems praneeth - Mar 31

How to Deploy an Open Source LLM Reliably (on Kubernetes)

Understanding What We’re Actually Building

Why Not Just Use Gemini or Any Commercial LLM?

Clone the Project (Start Here)

Starting Clean

Deploying the LLM

Observability Stack

Accessing the System

Adding a Chatbot UI

Watching the System Under Load

What This Demo Actually Teaches You ?

Conclusion

1 Comment

Please log in to add a comment.

Please log in to comment on this post.

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

The Privacy Gap: Why sending financial ledgers to OpenAI is broken

Architecting a Local-First Hybrid RAG for Finance

AI Reliability Gap: Why Large Language Models are not for Safety-Critical Systems

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,297 amazing developers

Don't have an account? Sign up

OR

How to Deploy an Open Source LLM Reliably (on Kubernetes)

Understanding What We’re Actually Building

Why Not Just Use Gemini or Any Commercial LLM?

Clone the Project (Start Here)

Starting Clean

Deploying the LLM

Observability Stack

Accessing the System

Adding a Chatbot UI

Watching the System Under Load

What This Demo Actually Teaches You ?

Conclusion

1 Comment

Please log in to add a comment.

Please log in to comment on this post.

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

The Privacy Gap: Why sending financial ledgers to OpenAI is broken

Architecting a Local-First Hybrid RAG for Finance

AI Reliability Gap: Why Large Language Models are not for Safety-Critical Systems

Related Jobs

Commenters (This Week)