Fundamentals of Large Language Models: Understanding LLM Architectures

Question

Fundamentals of Large Language Models: Understanding LLM Architectures

calendar_todayDec 2, 2025 • schedule5 min read

— Originally published at dev.to

What is an LLM?

An LLM (Large Language Model) is fundamentally a probabilistic model that predicts distributions over vocabulary tokens. At its core, an LLM understands a fixed set of words called a vocabulary and assigns probabilities to each word appearing in a given context.

The "Large" in LLM refers to the number of parameters the model contains. These models can have billions or even trillions of parameters, allowing them to capture complex patterns in language. While there's no universally agreed-upon threshold for what constitutes "large," modern LLMs typically range from hundreds of millions to hundreds of billions of parameters.

Understanding Vocabulary Size

Recent research has revealed that vocabulary size plays a crucial role in LLM performance, with optimal vocabulary sizes depending on the compute budget used for training. Most modern LLMs use vocabulary sizes ranging from 30,000 to 100,000 tokens, though this varies significantly:

BERT: 30,000 tokens (WordPiece tokenization)
GPT-3: ~50,000 tokens
Llama 2: 32,000 tokens
RoBERTa: 50,265 tokens (Byte-Pair Encoding)
T5: 32,128 tokens

Research suggests that many existing LLMs actually use suboptimal vocabulary sizes - for example, Llama2-70B's optimal vocabulary size should have been at least 216,000 tokens, seven times larger than its actual 32,000-token vocabulary.

LLM Architectures: The Three Main Types

Modern transformer models use one of three fundamental architectures: encoder-only, decoder-only, or encoder-decoder (sequence-to-sequence). Each architecture is optimized for different types of tasks.

1. Encoder-Only Models

Encoders are designed to convert sequences of words into vector representations (embeddings) that can be used for various predictive modeling tasks such as classification. These models use bidirectional attention, meaning they can look at context from both directions simultaneously.

Key Characteristics:

Use bidirectional attention to access all words in the input sentence
Specialized in understanding and analyzing text
Pretrained using masked language modeling objectives

Popular Encoder Models:

BERT (Bidirectional Encoder Representations from Transformers): Introduced in October 2018 by Google researchers, BERT uses 12-24 layers (depending on the variant) and dramatically improved the state of the art for many NLP tasks. The base model has 110M parameters, while the large model has 340M parameters.
RoBERTa (Robustly Optimized BERT Approach): An improved version of BERT that modifies the training procedure by removing the next-sentence prediction task, using larger mini-batches, and training on 160GB of text (10x more than BERT). It has 355M parameters in its large version.
DistilBERT: A distilled version that retains 95% of BERT's performance with only 60% of its parameters (66M).
DeBERTa: Uses disentangled attention mechanisms for improved performance.
ModernBERT: Released in 2024 as a state-of-the-art replacement for BERT, featuring an 8,192 token sequence length and significantly faster processing.

Primary Use Cases:

Text classification (sentiment analysis, topic classification)
Named entity recognition (NER)
Question answering
Semantic similarity tasks
Embedding generation for retrieval systems

2. Decoder-Only Models

Decoders are designed to generate new text by predicting the next word in a sequence. They use masked (causal) self-attention, which only allows tokens to attend to previous tokens in the sequence, ensuring autoregressive generation.

Key Characteristics:

Use unidirectional (causal) attention
Generate text one token at a time
Excel at creative text generation and completion

Popular Decoder Models:

GPT Series (GPT, GPT-2, GPT-3, GPT-4): The GPT series pioneered the decoder-only architecture, with models ranging from 117M parameters (GPT-2 small) to hundreds of billions in GPT-4. GPT models became state of the art in natural language generation starting in 2018.
Llama (1, 2, 3): Meta's open-source models that have become foundational for many smaller LLM projects, with the Llama 2 family ranging up to 70 billion parameters.
Falcon: A series of open-source models trained on refined web data.
BLOOM: A multilingual decoder-only model.
Mistral: High-performance open-weight models with efficient architectures.

Primary Use Cases:

Text generation and completion
Conversational AI and chatbots
Creative writing assistance
Code generation
Question answering in chat format

3. Encoder-Decoder Models

Encoder-decoder models combine both architectures, using an encoder to understand the input and a decoder to generate appropriate output text. This makes them perfect for tasks that transform one sequence into another, like translation or summarization.

Key Characteristics:

Bidirectional understanding through the encoder
Autoregressive generation through the decoder
Connected via cross-attention mechanism
Ideal for sequence-to-sequence tasks

Popular Encoder-Decoder Models:

T5 (Text-to-Text Transfer Transformer): Developed by Google, T5 treats every NLP task as a text-to-text problem, ranging from 60M to 11B parameters. It uses task-specific prefixes (e.g., "translate English to German:", "summarize:") to handle different tasks with the same architecture.
BART (Bidirectional and Auto-Regressive Transformers): Developed by Facebook (Meta) in 2019, BART combines strengths of BERT and GPT. It's pretrained by corrupting text in various ways (deleting words, shuffling sentences, masking tokens) and learning to reconstruct the original.
UL2: A unified framework for language understanding and generation.
mT5: Multilingual variant of T5 supporting 100+ languages.

Primary Use Cases:

Machine translation
Text summarization
Question answering with context
Text simplification
Paraphrasing
Data-to-text generation

The Transformer Foundation

All these architectures are built on the transformer architecture, which uses self-attention mechanisms to process input sequences. The original transformer was proposed in the 2017 paper "Attention Is All You Need" by Google researchers.

Key Components of Transformers:

Tokenization: Converting text into discrete tokens using algorithms like Byte-Pair Encoding (BPE), WordPiece, or SentencePiece.
Embeddings: Converting tokens into dense vector representations.
Self-Attention Layers: Mechanisms that allow each token to attend to other tokens in the sequence, capturing contextual relationships.
Feed-Forward Networks: Processing representations through neural networks.
Layer Normalization: Stabilizing training and improving convergence.
Positional Encodings: Adding position information since attention has no inherent notion of sequence order.

Choosing the Right Architecture

When selecting an architecture for a specific task, consider whether you need bidirectional understanding (encoder), text generation (decoder), or sequence transformation (encoder-decoder).

Decision Framework:

Need deep understanding of text? → Use encoder-only models
Need to generate creative or conversational text? → Use decoder-only models
Need to transform one text form to another? → Use encoder-decoder models

Recent Trends and Future Directions

Almost every major LLM since GPT-3 has adopted the decoder-only architecture due to its simplicity and effectiveness at scale. However, encoder models remain critical for tasks like retrieval-augmented generation (RAG), classification, and entity extraction, with billions of downloads per month.

Recent developments include reasoning models like OpenAI's o1 and DeepSeek-R1, which generate step-by-step analysis before producing answers, achieving better performance on complex tasks.

Alternative architectures like Mamba (state space models) are emerging as potential challengers to transformer dominance, offering linear-time processing for long sequences.

Understanding the fundamental differences between encoder, decoder, and encoder-decoder architectures is essential for working effectively with LLMs. Each architecture serves distinct purposes:

Encoders excel at understanding and analyzing text
Decoders shine at generating creative and coherent text
Encoder-decoders bridge both worlds for transformation tasks

As the field continues to evolve rapidly, these foundational concepts remain crucial for anyone working with or building upon large language models.

Are you working with LLMs in your projects? Which architecture have you found most useful for your use case? Share your experiences in the comments below

2 Comments

🔥 Join developers growing publicly

Share your knowledge, build in public, and grow your developer presence with a global community.

Join CoderLegion

chevron_left

Derrick Ryan Giggs

6.7k Points • 239 Badges

Kenya

71Posts

23Comments

6Connections

Aspiring Data Engineer | Learning Python, Java & Oracle Databases

On an exciting journey to become ... Show more

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

Henry Paul · Answer 1 · 2025-12-02T14:11:14+0000

Interesting breakdown, didn’t realize vocabulary size could be such a hidden bottleneck—curious how Anthony thinks this will evolve with future LLMs.

Saptarshi Sarkar · Answer 2 · 2026-02-20T01:48:01+0000

Saptarshi Sarkar • Feb 19

Great summarization of transformer architecture.

	AI Reliability Gap: Why Large Language Models are not for Safety-Critical Systems praneeth - Mar 31
	Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download) Pocket Portfolio - Apr 1
	Everyone says DeepSeek is cheaper, but I got tired of guessing the exact math. So I built a calculat abarth23 - Apr 27
	Architecting a Local-First Hybrid RAG for Finance Pocket Portfolio - Feb 25
	Just completed another large-scale WordPress migration — and the client left this saqib_devmorph - Apr 7

Fundamentals of Large Language Models: Understanding LLM Architectures

What is an LLM?

Understanding Vocabulary Size

LLM Architectures: The Three Main Types

1. Encoder-Only Models

2. Decoder-Only Models

3. Encoder-Decoder Models

The Transformer Foundation

Choosing the Right Architecture

Recent Trends and Future Directions

2 Comments

Please log in to add a comment.

Please log in to add a comment.

Please log in to comment on this post.

More Posts

AI Reliability Gap: Why Large Language Models are not for Safety-Critical Systems

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

Everyone says DeepSeek is cheaper, but I got tired of guessing the exact math. So I built a calculat

Architecting a Local-First Hybrid RAG for Finance

Just completed another large-scale WordPress migration — and the client left this

More From Derrick Ryan

I Built a Real-Time Crypto Analytics Pipeline for $0.01/Month — Here's the Full Architecture

Oracle GoldenGate 23ai: Powering Distributed AI with Real-Time Data Replication

Building the Sovereign Debt Observatory: An End-to-End ELT Pipeline on World Bank Debt Data for Low

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,631 amazing developers

Don't have an account? Sign up

OR

Fundamentals of Large Language Models: Understanding LLM Architectures

What is an LLM?

Understanding Vocabulary Size

LLM Architectures: The Three Main Types

1. Encoder-Only Models

2. Decoder-Only Models

3. Encoder-Decoder Models

The Transformer Foundation

Choosing the Right Architecture

Recent Trends and Future Directions

2 Comments

Please log in to add a comment.

Please log in to add a comment.

Please log in to comment on this post.

More Posts

More From Derrick Ryan

Related Jobs

Commenters (This Week)