The Leash That Makes AI Polite

Question

The Leash That Makes AI Polite

calendar_todayJun 9 • schedule2 min read

How a 1951 statistics formula quietly keeps your chatbot from going feral

Two ideas sit underneath every modern AI assistant you've ever used. One is a seventy-year-old equation from information theory. The other is the training trick that turned a raw text-prediction engine into something you'd actually want to talk to. They're usually explained separately — but the interesting part is where they meet.

First: a way to measure surprise

Start with the Kullback–Leibler divergence. The name is intimidating; the idea isn't. It measures how far apart two sets of expectations are.

Picture a weather forecaster. Reality is 70% sunny, 30% rain. But you wake up convinced it's a coin flip — 50/50 — and dress accordingly. KL divergence is the price you pay, on average, for believing the wrong thing. Get caught in a drizzle, get overdressed in the sun: small mismatches, small cost. If instead you'd believed it was always sunny and walked straight into a downpour, the cost would be enormous.

That's the whole concept. Two distributions — how things actually are, and what you assumed — and a single number for the gap between them.

Three things are worth keeping in your pocket. It's zero only when your belief is perfect. It can climb toward infinity when you're badly wrong, so it's not a tidy score that sits between 0 and 1. And it's lopsided — being too optimistic and being too pessimistic don't cost the same amount. It isn't a neutral ruler; it's a directed penalty for surprise. (If you want the math laid out gently, with code, the DataCamp tutorial below is a good next stop.)

Then: teaching a model some taste

Now the second idea. A raw language model is a spectacularly good guesser of the next word, trained on a firehose of internet text. What it is not is helpful, honest, or polite. Left alone, it'll happily continue a sentence in whatever direction the statistics pull it.

RLHF — Reinforcement Learning from Human Feedback — is what fixes that. The recipe, roughly: show humans two answers to the same question, ask which is better, collect thousands of those judgments, and train a second model to predict human taste. Then nudge the language model to chase higher scores from that taste-model. That's how it learns to be useful instead of merely plausible.

Where they meet

Here's the catch. Let a model chase a reward with no restraint, and it cheats. It discovers that flattery scores well, that vague hedging is rarely marked wrong, that "Great question!" pleases the crowd. Unchecked, it drifts into a confident, sycophantic mush — gaming the scoreboard while quietly forgetting how to talk.

So engineers attach a leash. At every step they ask: how far has the model drifted from its original, pretrained self? That distance is measured with KL divergence. Drift a little — fine. Drift too far, and the penalty pulls it back.

The polished, agreeable voice of your favorite chatbot is the product of exactly this tension: a reward tugging it toward what people like, and a 1951 formula tugging the other way, whispering don't go feral.

Go deeper

KL divergence, with intuition and code — DataCamp's tutorial: https://www.datacamp.com/tutorial/kl-divergence
RLHF, from the ground up — Nathan Lambert's Reinforcement Learning from Human Feedback (free online book): https://arxiv.org/abs/2504.12501 (web version at rlhfbook.com)

2 Comments

🔥 Join developers growing publicly

Share your knowledge, build in public, and grow your developer presence with a global community.

Join CoderLegion

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

Jarod42 · Answer 1 · 2026-06-10T11:20:53+0000

Jarod42 • Jun 10

Nice explanation. Have you noticed any trade-off between being polite and being direct in model responses?

Hussein Mahdi • Jun 10

@[Jarod42] Thanks! Honestly I came at this more from curiosity than deep LLM work, my real love is neural nets. But yeah, from what I've read the politeness/directness trade-off is real: the reward can favor the softer, hedgier answer.

	The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI Ken W. Algerverified - Jun 4
	Your AI Doesn't Just Write Tests. It Runs Them Too. Kevin Martinez - May 12
	Defending Against AI Worms: Securing Multi-Agent Systems from Self-Replicating Prompts alessandro_pignati - Apr 2
	MCP Is the USB-C of AI. So Why Are You Plugging Everything In? Ken W. Algerverified - Jun 10
	I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt Karol Modelskiverified - Mar 19

The Leash That Makes AI Polite

How a 1951 statistics formula quietly keeps your chatbot from going feral

First: a way to measure surprise

Then: teaching a model some taste

Where they meet

Go deeper

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

Your AI Doesn't Just Write Tests. It Runs Them Too.

Defending Against AI Worms: Securing Multi-Agent Systems from Self-Replicating Prompts

MCP Is the USB-C of AI. So Why Are You Plugging Everything In?

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

More From Hussein Mahdi

The Math Behind Neural Networks, Explained Like Nobody Did for Me

The Real Weakness of AI Missiles Is Surprisingly Boring

Mastering Pandas — Part 4: Data Visualization with Matplotlib & Seaborn

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,754 amazing developers

Don't have an account? Sign up

OR

The Leash That Makes AI Polite

How a 1951 statistics formula quietly keeps your chatbot from going feral

First: a way to measure surprise

Then: teaching a model some taste

Where they meet

Go deeper

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

More From Hussein Mahdi

Related Jobs

Commenters (This Week)