Fundamentals of Large Language Models: Understanding LLM Architectures

Fundamentals of Large Language Models: Understanding LLM Architectures

Leader posted Originally published at dev.to 5 min read

What is an LLM?

An LLM (Large Language Model) is fundamentally a probabilistic model that predicts distributions over vocabulary tokens. At its core, an LLM understands a fixed set of words called a vocabulary and assigns probabilities to each word appearing in a given context.

The "Large" in LLM refers to the number of parameters the model contains. These models can have billions or even trillions of parameters, allowing them to capture complex patterns in language. While there's no universally agreed-upon threshold for what constitutes "large," modern LLMs typically range from hundreds of millions to hundreds of billions of parameters.

Understanding Vocabulary Size

Recent research has revealed that vocabulary size plays a crucial role in LLM performance, with optimal vocabulary sizes depending on the compute budget used for training. Most modern LLMs use vocabulary sizes ranging from 30,000 to 100,000 tokens, though this varies significantly:

  • BERT: 30,000 tokens (WordPiece tokenization)
  • GPT-3: ~50,000 tokens
  • Llama 2: 32,000 tokens
  • RoBERTa: 50,265 tokens (Byte-Pair Encoding)
  • T5: 32,128 tokens

Research suggests that many existing LLMs actually use suboptimal vocabulary sizes - for example, Llama2-70B's optimal vocabulary size should have been at least 216,000 tokens, seven times larger than its actual 32,000-token vocabulary.

LLM Architectures: The Three Main Types

Modern transformer models use one of three fundamental architectures: encoder-only, decoder-only, or encoder-decoder (sequence-to-sequence). Each architecture is optimized for different types of tasks.

1. Encoder-Only Models

Encoders are designed to convert sequences of words into vector representations (embeddings) that can be used for various predictive modeling tasks such as classification. These models use bidirectional attention, meaning they can look at context from both directions simultaneously.

Key Characteristics:

  • Use bidirectional attention to access all words in the input sentence
  • Specialized in understanding and analyzing text
  • Pretrained using masked language modeling objectives

Popular Encoder Models:

  • BERT (Bidirectional Encoder Representations from Transformers): Introduced in October 2018 by Google researchers, BERT uses 12-24 layers (depending on the variant) and dramatically improved the state of the art for many NLP tasks. The base model has 110M parameters, while the large model has 340M parameters.

  • RoBERTa (Robustly Optimized BERT Approach): An improved version of BERT that modifies the training procedure by removing the next-sentence prediction task, using larger mini-batches, and training on 160GB of text (10x more than BERT). It has 355M parameters in its large version.

  • DistilBERT: A distilled version that retains 95% of BERT's performance with only 60% of its parameters (66M).

  • DeBERTa: Uses disentangled attention mechanisms for improved performance.

  • ModernBERT: Released in 2024 as a state-of-the-art replacement for BERT, featuring an 8,192 token sequence length and significantly faster processing.

Primary Use Cases:

  • Text classification (sentiment analysis, topic classification)
  • Named entity recognition (NER)
  • Question answering
  • Semantic similarity tasks
  • Embedding generation for retrieval systems

2. Decoder-Only Models

Decoders are designed to generate new text by predicting the next word in a sequence. They use masked (causal) self-attention, which only allows tokens to attend to previous tokens in the sequence, ensuring autoregressive generation.

Key Characteristics:

  • Use unidirectional (causal) attention
  • Generate text one token at a time
  • Excel at creative text generation and completion

Popular Decoder Models:

  • GPT Series (GPT, GPT-2, GPT-3, GPT-4): The GPT series pioneered the decoder-only architecture, with models ranging from 117M parameters (GPT-2 small) to hundreds of billions in GPT-4. GPT models became state of the art in natural language generation starting in 2018.

  • Llama (1, 2, 3): Meta's open-source models that have become foundational for many smaller LLM projects, with the Llama 2 family ranging up to 70 billion parameters.

  • Falcon: A series of open-source models trained on refined web data.

  • BLOOM: A multilingual decoder-only model.

  • Mistral: High-performance open-weight models with efficient architectures.

Primary Use Cases:

  • Text generation and completion
  • Conversational AI and chatbots
  • Creative writing assistance
  • Code generation
  • Question answering in chat format

3. Encoder-Decoder Models

Encoder-decoder models combine both architectures, using an encoder to understand the input and a decoder to generate appropriate output text. This makes them perfect for tasks that transform one sequence into another, like translation or summarization.

Key Characteristics:

  • Bidirectional understanding through the encoder
  • Autoregressive generation through the decoder
  • Connected via cross-attention mechanism
  • Ideal for sequence-to-sequence tasks

Popular Encoder-Decoder Models:

  • T5 (Text-to-Text Transfer Transformer): Developed by Google, T5 treats every NLP task as a text-to-text problem, ranging from 60M to 11B parameters. It uses task-specific prefixes (e.g., "translate English to German:", "summarize:") to handle different tasks with the same architecture.

  • BART (Bidirectional and Auto-Regressive Transformers): Developed by Facebook (Meta) in 2019, BART combines strengths of BERT and GPT. It's pretrained by corrupting text in various ways (deleting words, shuffling sentences, masking tokens) and learning to reconstruct the original.

  • UL2: A unified framework for language understanding and generation.

  • mT5: Multilingual variant of T5 supporting 100+ languages.

Primary Use Cases:

  • Machine translation
  • Text summarization
  • Question answering with context
  • Text simplification
  • Paraphrasing
  • Data-to-text generation

The Transformer Foundation

All these architectures are built on the transformer architecture, which uses self-attention mechanisms to process input sequences. The original transformer was proposed in the 2017 paper "Attention Is All You Need" by Google researchers.

Key Components of Transformers:

  1. Tokenization: Converting text into discrete tokens using algorithms like Byte-Pair Encoding (BPE), WordPiece, or SentencePiece.

  2. Embeddings: Converting tokens into dense vector representations.

  3. Self-Attention Layers: Mechanisms that allow each token to attend to other tokens in the sequence, capturing contextual relationships.

  4. Feed-Forward Networks: Processing representations through neural networks.

  5. Layer Normalization: Stabilizing training and improving convergence.

  6. Positional Encodings: Adding position information since attention has no inherent notion of sequence order.

Choosing the Right Architecture

When selecting an architecture for a specific task, consider whether you need bidirectional understanding (encoder), text generation (decoder), or sequence transformation (encoder-decoder).

Decision Framework:

  • Need deep understanding of text? → Use encoder-only models
  • Need to generate creative or conversational text? → Use decoder-only models
  • Need to transform one text form to another? → Use encoder-decoder models

Recent Trends and Future Directions

Almost every major LLM since GPT-3 has adopted the decoder-only architecture due to its simplicity and effectiveness at scale. However, encoder models remain critical for tasks like retrieval-augmented generation (RAG), classification, and entity extraction, with billions of downloads per month.

Recent developments include reasoning models like OpenAI's o1 and DeepSeek-R1, which generate step-by-step analysis before producing answers, achieving better performance on complex tasks.

Alternative architectures like Mamba (state space models) are emerging as potential challengers to transformer dominance, offering linear-time processing for long sequences.

Understanding the fundamental differences between encoder, decoder, and encoder-decoder architectures is essential for working effectively with LLMs. Each architecture serves distinct purposes:

  • Encoders excel at understanding and analyzing text
  • Decoders shine at generating creative and coherent text
  • Encoder-decoders bridge both worlds for transformation tasks

As the field continues to evolve rapidly, these foundational concepts remain crucial for anyone working with or building upon large language models.

Are you working with LLMs in your projects? Which architecture have you found most useful for your use case? Share your experiences in the comments below

1 Comment

0 votes

More Posts

Understanding Natural Language Processing (NLP): Evolution, Applications, and Future Trends

Ashutosh Kumar - May 7

Building Natural Language Command Interfaces: A Bridge Between LLMs and Deterministic Systems

Pronab Pal - Feb 27

Hallucinations in Large Language Models: Understanding and Mitigating the Risks

bugnificent - Apr 26

The Emergence of Memory exploiting along with explaining the ai architecture behind it

okerew - Mar 20

Smart Reasoning: Mastering Multiple-Choice Question-Answering with Vision-Language Models

Souradip Pal - May 11
chevron_left