How to Deploy an Open Source LLM Reliably (on Kubernetes)

posted Originally published at heyyayush.hashnode.dev 6 min read

Everyone is talking about LLMs, but very few people actually run one.

Most developers rely on APIs from OpenAI or Anthropic. It works great until you start thinking about cost, control, or what happens when that API isn’t available. That’s when a more interesting question shows up: can you run your own LLM reliably?

Not just something that works once on your laptop, but something that behaves like real infrastructure. Something observable, restartable, and usable through an actual interface. That’s what this guide is about. We’ll deploy an open-source model (TinyLlama) on Kubernetes using Kind, serve it using Ollama, build a simple chatbot UI, and monitor everything using Prometheus and Grafana. By the end, you won’t just have a working setup you’ll understand how LLMs behave as systems.


Understanding What We’re Actually Building

Before running commands, This setup is a small but complete system: a Kubernetes cluster running an LLM service, connected to a UI, with a monitoring layer observing everything. When you send a message in the chatbot, it flows through multiple layers before coming back as a response.

User → Chat UI → LLM API → Kubernetes → Metrics → Grafana

That flow is the entire point. Once you understand this, you stop seeing LLMs as magic models and start seeing them as workloads running on infrastructure.


Why Not Just Use Gemini or Any Commercial LLM?

At this point, a fair question comes up. Why go through all this effort setting up Kubernetes, deploying an LLM, adding monitoring when you can just open a browser, go to Google Gemini or use APIs from OpenAI, and get answers in seconds?And honestly, for many use cases… you should. Commercial LLMs are fast, powerful, and incredibly easy to use. They handle scaling, reliability, and updates for you. If your goal is to get high-quality answers quickly, they are the best choice.But that’s not the whole picture. Running your own LLM changes what you control.When it runs inside your infrastructure, there’s no external dependency, no per-request cost, and full control over data and behavior. The trade-off is simple: you gain control, but you take on responsibility.

This is where open-source LLMs make sense especially for internal tools, privacy-sensitive use cases, or cost-heavy workloads at scale. But smaller models like TinyLlama still lag behind commercial models in reasoning and depth. So this isn’t about replacing commercial LLMs. It’s about understanding when to own the system—and when not to.

Aspect Open Source LLM (TinyLlama on K8s) Commercial LLM (Gemini / OpenAI / Anthropic)
Setup Complex (infra + deployment) Instant (UI / API)
Cost No per-request cost (infra only) Pay-per-use
Latency Very low (local) Depends on network/API
Quality Moderate High
Control Full control over system Limited
Privacy Fully local Data leaves your system
Reliability Depends on your setup Managed by provider

The real decision isn’t which one is better it’s about which trade-offs you’re willing to make for your use case.

Clone the Project (Start Here)

Before we go deep into Kubernetes, LLMs, or observability, let’s get the system running locally. Everything in this guide is based on this repo: Clone the repository

git clone https://github.com/Ayushmore1214/llm-k8s-deployment.git
cd llm-k8s-deployment

Starting Clean

A lot of Kubernetes frustration comes from leftover state. Old configs, broken kubeconfigs, or previously installed clusters can silently interfere with your setup. So instead of debugging weird issues later, we start clean.

sudo rm -rf /etc/rancher
unset KUBECONFIG

Now we install and create a cluster using Kind. Kind is perfect here because it runs Kubernetes inside Docker, making it fast and reproducible without any cloud dependency.

kind create cluster --name drdroid-llm
kind export kubeconfig --name drdroid-llm

At this point, you already have something powerful: a working Kubernetes cluster on your machine. Everything after this builds on top of it.


Deploying the LLM

Now we deploy the actual LLM infrastructure using a Kubernetes manifest.

kubectl apply -f llm-stack.yaml

This single command creates a deployment to keep the model running, a service to expose it inside the cluster, and the container configuration that defines how the model behaves.

You can watch Kubernetes doing its job in real time:

kubectl get pods -w

Once the pod is running, we still need to load the model itself. Until this step, the system is alive—but empty.

kubectl exec -it $(kubectl get pods -l app=ollama -o name) -- ollama pull tinyllama

Now the model is actually usable. This is the moment where infrastructure turns into intelligence.


Observability Stack

Most guides stop after it works. That’s not enough. If you can’t see what your system is doing, you don’t really control it. That’s why we add observability using Prometheus and Grafana.

helm install obs prometheus-community/kube-prometheus-stack \
  --set prometheus.prometheusSpec.resources.requests.memory=300Mi \
  --set grafana.adminPassword=admin

This adds a full monitoring layer to your cluster. Prometheus collects metrics, and Grafana lets you visualize them. Instead of guessing what’s happening, you can now see CPU usage, memory consumption, and pod health in real time.


Accessing the System

By default, Kubernetes services are internal. To interact with them locally, we forward ports.

kubectl port-forward svc/ollama-service 11434:11434
kubectl port-forward svc/obs-grafana 3000:80

Now your LLM API is available on port 11434, and Grafana is accessible on port 3000. At this point, everything is running but it’s still not very interactive. That’s where the UI comes in.


Adding a Chatbot UI

Instead of calling APIs manually, we build a simple chatbot interface using Streamlit.

pip install streamlit requests
python3 -m streamlit run app.py

Open the UI in your browser and start asking questions. Things like:

  • “Which model are you?”

  • “What’s your latest knowledge?”

  • “Explain Kubernetes simply”

This is where the system finally feels complete. You’re interacting with your own LLM, running on your own infrastructure.


Watching the System Under Load

Now open Grafana at http://localhost:3000 and log in using admin/admin. Navigate to the Kubernetes dashboards and select the Ollama pod.

Go back to your chatbot and start sending prompts. You’ll notice something interesting immediately. Every request creates visible activity—CPU spikes, memory usage increases, and resource patterns start forming. The model isn’t just answering questions. It’s consuming compute. This is the shift in understanding that matters. LLMs are not just software. They are workloads.


What This Demo Actually Teaches You ?

This project might look simple, but it introduces some core ideas that show up in real systems:

  • Kubernetes ensures your service stays alive even if it crashes

  • LLM inference consumes real, measurable resources

  • Observability is what turns debugging into something practical

Once you see these pieces working together, the abstraction breaks in a good way. You stop treating LLMs as black boxes and start understanding how they behave under the hood.


Conclusion

Running an LLM locally is cool. Running it on Kubernetes with a full observability stack? That’s how you actually learn how AI behaves in the real world. This setup is just the beginning. It’s containerized, it’s orchestrated, and it’s observable this is the exact pattern that powers production AI at scale. Once you see the CPU spikes in Grafana and manage your own cluster, you stop looking at AI as a magic API and start seeing it as workload engineering.

The best way to learn isn’t by reading it’s by breaking stuff. I built this project to be a playground. Go ahead: delete a pod, change things in llm-stack.yaml, or try to crash the model with a massive prompt. See if your Grafana dashboard catches it.

If this guide helped you make sense of the chaos, give the repo a ⭐, and let’s connect on LinkedIn. I’m always looking for people who want to build, ship, and break things. Thank You for Reading!!!

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

Pocket Portfolioverified - Apr 1

The Privacy Gap: Why sending financial ledgers to OpenAI is broken

Pocket Portfolioverified - Feb 23

AI Reliability Gap: Why Large Language Models are not for Safety-Critical Systems

praneeth - Mar 31

Architecting a Local-First Hybrid RAG for Finance

Pocket Portfolioverified - Feb 25
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!