Everyone is talking about LLMs, but very few people actually run one.
Most developers rely on APIs from OpenAI or Anthropic. It works great until you start thinking about cost, control, or what happens when that API isn’t available. That’s when a more interesting question shows up: can you run your own LLM reliably?
Not just something that works once on your laptop, but something that behaves like real infrastructure. Something observable, restartable, and usable through an actual interface. That’s what this guide is about. We’ll deploy an open-source model (TinyLlama) on Kubernetes using Kind, serve it using Ollama, build a simple chatbot UI, and monitor everything using Prometheus and Grafana. By the end, you won’t just have a working setup you’ll understand how LLMs behave as systems.
Understanding What We’re Actually Building
Before running commands, This setup is a small but complete system: a Kubernetes cluster running an LLM service, connected to a UI, with a monitoring layer observing everything. When you send a message in the chatbot, it flows through multiple layers before coming back as a response.
User → Chat UI → LLM API → Kubernetes → Metrics → Grafana

That flow is the entire point. Once you understand this, you stop seeing LLMs as magic models and start seeing them as workloads running on infrastructure.
Why Not Just Use Gemini or Any Commercial LLM?
At this point, a fair question comes up. Why go through all this effort setting up Kubernetes, deploying an LLM, adding monitoring when you can just open a browser, go to Google Gemini or use APIs from OpenAI, and get answers in seconds?And honestly, for many use cases… you should. Commercial LLMs are fast, powerful, and incredibly easy to use. They handle scaling, reliability, and updates for you. If your goal is to get high-quality answers quickly, they are the best choice.But that’s not the whole picture. Running your own LLM changes what you control.When it runs inside your infrastructure, there’s no external dependency, no per-request cost, and full control over data and behavior. The trade-off is simple: you gain control, but you take on responsibility.
This is where open-source LLMs make sense especially for internal tools, privacy-sensitive use cases, or cost-heavy workloads at scale. But smaller models like TinyLlama still lag behind commercial models in reasoning and depth. So this isn’t about replacing commercial LLMs. It’s about understanding when to own the system—and when not to.
| Aspect | Open Source LLM (TinyLlama on K8s) | Commercial LLM (Gemini / OpenAI / Anthropic) |
| Setup | Complex (infra + deployment) | Instant (UI / API) |
| Cost | No per-request cost (infra only) | Pay-per-use |
| Latency | Very low (local) | Depends on network/API |
| Quality | Moderate | High |
| Control | Full control over system | Limited |
| Privacy | Fully local | Data leaves your system |
| Reliability | Depends on your setup | Managed by provider |
The real decision isn’t which one is better it’s about which trade-offs you’re willing to make for your use case.
Clone the Project (Start Here)
Before we go deep into Kubernetes, LLMs, or observability, let’s get the system running locally. Everything in this guide is based on this repo: Clone the repository
git clone https://github.com/Ayushmore1214/llm-k8s-deployment.git
cd llm-k8s-deployment
Starting Clean
A lot of Kubernetes frustration comes from leftover state. Old configs, broken kubeconfigs, or previously installed clusters can silently interfere with your setup. So instead of debugging weird issues later, we start clean.
sudo rm -rf /etc/rancher
unset KUBECONFIG
Now we install and create a cluster using Kind. Kind is perfect here because it runs Kubernetes inside Docker, making it fast and reproducible without any cloud dependency.
kind create cluster --name drdroid-llm
kind export kubeconfig --name drdroid-llm
At this point, you already have something powerful: a working Kubernetes cluster on your machine. Everything after this builds on top of it.
Deploying the LLM
Now we deploy the actual LLM infrastructure using a Kubernetes manifest.
kubectl apply -f llm-stack.yaml
This single command creates a deployment to keep the model running, a service to expose it inside the cluster, and the container configuration that defines how the model behaves.
You can watch Kubernetes doing its job in real time:
kubectl get pods -w
Once the pod is running, we still need to load the model itself. Until this step, the system is alive—but empty.
kubectl exec -it $(kubectl get pods -l app=ollama -o name) -- ollama pull tinyllama
Now the model is actually usable. This is the moment where infrastructure turns into intelligence.
Observability Stack
Most guides stop after it works. That’s not enough. If you can’t see what your system is doing, you don’t really control it. That’s why we add observability using Prometheus and Grafana.
helm install obs prometheus-community/kube-prometheus-stack \
--set prometheus.prometheusSpec.resources.requests.memory=300Mi \
--set grafana.adminPassword=admin
This adds a full monitoring layer to your cluster. Prometheus collects metrics, and Grafana lets you visualize them. Instead of guessing what’s happening, you can now see CPU usage, memory consumption, and pod health in real time.
Accessing the System
By default, Kubernetes services are internal. To interact with them locally, we forward ports.
kubectl port-forward svc/ollama-service 11434:11434
kubectl port-forward svc/obs-grafana 3000:80
Now your LLM API is available on port 11434, and Grafana is accessible on port 3000. At this point, everything is running but it’s still not very interactive. That’s where the UI comes in.
Adding a Chatbot UI
Instead of calling APIs manually, we build a simple chatbot interface using Streamlit.
pip install streamlit requests
python3 -m streamlit run app.py
Open the UI in your browser and start asking questions. Things like:
This is where the system finally feels complete. You’re interacting with your own LLM, running on your own infrastructure.
Watching the System Under Load
Now open Grafana at http://localhost:3000 and log in using admin/admin. Navigate to the Kubernetes dashboards and select the Ollama pod.

Go back to your chatbot and start sending prompts. You’ll notice something interesting immediately. Every request creates visible activity—CPU spikes, memory usage increases, and resource patterns start forming. The model isn’t just answering questions. It’s consuming compute. This is the shift in understanding that matters. LLMs are not just software. They are workloads.
What This Demo Actually Teaches You ?
This project might look simple, but it introduces some core ideas that show up in real systems:
Kubernetes ensures your service stays alive even if it crashes
LLM inference consumes real, measurable resources
Observability is what turns debugging into something practical
Once you see these pieces working together, the abstraction breaks in a good way. You stop treating LLMs as black boxes and start understanding how they behave under the hood.
Conclusion
Running an LLM locally is cool. Running it on Kubernetes with a full observability stack? That’s how you actually learn how AI behaves in the real world. This setup is just the beginning. It’s containerized, it’s orchestrated, and it’s observable this is the exact pattern that powers production AI at scale. Once you see the CPU spikes in Grafana and manage your own cluster, you stop looking at AI as a magic API and start seeing it as workload engineering.
The best way to learn isn’t by reading it’s by breaking stuff. I built this project to be a playground. Go ahead: delete a pod, change things in llm-stack.yaml, or try to crash the model with a massive prompt. See if your Grafana dashboard catches it.
If this guide helped you make sense of the chaos, give the repo a ⭐, and let’s connect on LinkedIn. I’m always looking for people who want to build, ship, and break things. Thank You for Reading!!!