Forget Prompt Engineering: Context Engineering is the New King of AI Development

Forget Prompt Engineering: Context Engineering is the New King of AI Development

posted Originally published at github.com 6 min read

Introduction

We are witnessing a silent paradigm shift in the world of Artificial Intelligence. For the past two years, developers and enthusiasts have obsessed over "Prompt Engineering"—the art of crafting the perfect sentence to coax a specific answer from a chatbot. But as we transition from simple chatbots to autonomous Agent Systems, prompt engineering is hitting a wall.

The industry is moving toward a more complex, architectural discipline: Context Engineering.

If prompt engineering is about writing a good email, context engineering is about managing the entire filing system, memory, and attention span of an intelligent employee. It addresses the holistic curation of all information that enters the model's limited "attention budget." In this guide, we will explore the principles of Context Engineering, referencing the groundbreaking Agent Skills for Context Engineering framework, and build a practical Python tool to manage it.

The Myth of Infinite Context

Modern Large Language Models (LLMs) boast massive context windows—128k, 1M, or even 2M tokens. It is tempting to think, "I'll just dump my entire database into the prompt." However, empirical research and practical experience show that bigger is not always better.

The "Lost-in-the-Middle" Phenomenon

Models suffer from what researchers call "Attention Scarcity." As you fill the context window, the model's ability to retrieve specific facts degrades. This often manifests in distinct patterns:

  • Primacy Bias: The model remembers the beginning of the prompt well (System Instructions).
  • Recency Bias: The model remembers the end of the prompt well (The latest user question).
  • The Trough: Information buried in the middle is often ignored or hallucinated.

Context Engineering is the discipline of optimizing this limited budget to ensure high-signal tokens are prioritized, preventing your agent from becoming "senile" as the conversation grows.

Building a Context Manager in Python

To build robust agents, we need to move beyond simple "history appending" and implement active management strategies. We will build a Context Budget Manager. This Python class helps prevent context overflow by enforcing a "Token Budget" and intelligently trimming history.

Prerequisites and Setup

We will use the tiktoken library, which is the industry standard for counting tokens for OpenAI models (and acts as a good proxy for others).

pip install tiktoken

The ContextManager Class

The following code demonstrates how to strictly manage an agent's memory. It ensures that the System Prompt (the agent's identity) is never lost, while dynamically managing the conversation history.

import tiktoken

class ContextManager:
    def __init__(self, model="gpt-4", max_tokens=4000):
        """
        Initialize the manager with a specific model and token limit.
        args:
            model (str): The model name to load encoding for.
            max_tokens (int): The hard limit for the context window.
        """
        try:
            self.encoder = tiktoken.encoding_for_model(model)
        except KeyError:
            # Fallback to cl100k_base if model not found
            self.encoder = tiktoken.get_encoding("cl100k_base")
            
        self.max_tokens = max_tokens
        self.system_prompt = ""
        self.history = []

    def set_system_prompt(self, prompt):
        """Sets the immutable system prompt."""
        self.system_prompt = prompt

    def add_message(self, role, content):
        """Adds a message to the history."""
        self.history.append({"role": role, "content": content})

    def count_tokens(self, text):
        """Returns the number of tokens in a string."""
        return len(self.encoder.encode(text))

    def get_optimized_context(self):
        """
        Constructs the context, ensuring it fits within the max_tokens budget.
        Strategy:
        1. Always include System Prompt.
        2. Include as much recent history as possible (Reverse Chronological).
        """
        # Calculate tokens reserved for system prompt
        system_tokens = self.count_tokens(self.system_prompt)
        current_tokens = system_tokens
        
        valid_history = []
        
        # Iterate backwards (newest to oldest)
        for msg in reversed(self.history):
            # Format roughly as the API expects for token counting
            msg_content = f"{msg['role']}: {msg['content']}\n"
            msg_tokens = self.count_tokens(msg_content)
            
            if current_tokens + msg_tokens <= self.max_tokens:
                valid_history.insert(0, msg) # Add to front to restore order
                current_tokens += msg_tokens
            else:
                # If we hit the limit, we stop adding older messages
                break 
        
        # Return the structured list
        return [
            {"role": "system", "content": self.system_prompt}
        ] + valid_history

# Example Usage
if __name__ == "__main__":
    # Initialize manager with a small limit to demonstrate trimming
    manager = ContextManager(max_tokens=100) 

    # 1. Set the Identity
    manager.set_system_prompt("You are a helpful Python coding assistant.")

    # 2. Simulate a long conversation
    manager.add_message("user", "Hi, I am learning Python.")
    manager.add_message("assistant", "That is great! Python is a powerful language.")
    manager.add_message("user", "What is a variable?")
    manager.add_message("assistant", "A variable is like a container for data.")
    manager.add_message("user", "Okay, show me a code example of a loop.")

    # 3. Retrieve Optimized Context
    final_messages = manager.get_optimized_context()

    print(f"Total History Count: {len(manager.history)}")
    print(f"Optimized Context Count: {len(final_messages) - 1}") # Excluding system
    print("\n--- Content Sent to LLM ---")
    for msg in final_messages:
        print(f"[{msg['role'].upper()}]: {msg['content']}")

Code Explanation

  1. tiktoken Integration: We use encoding_for_model to ensure our counts align with how the LLM sees text.
  2. Reverse Iteration Strategy: The get_optimized_context method works backwards. In a conversation, the most recent message is usually the most relevant to the immediate query. We add messages from newest to oldest until our token bucket is full.
  3. Protection of System Instructions: The system_prompt is added outside the loop. This ensures that no matter how long the user talks, the agent never forgets its core instructions (e.g., "You are a coding assistant").

::: Note
Note on Trimming: In a production environment, you wouldn't just silently drop messages. You might use a "Summarization Agent" to compress the dropped messages into a concise summary string and inject that back into the context.
:::

Advanced Strategy: The BDI Model

One of the most powerful concepts found in the Agent Skills repository is the implementation of cognitive architectures, specifically the BDI (Beliefs, Desires, Intentions) model.

Standard agents often hallucinate because they lack a "Mental State." They just predict the next word. By explicitly engineering sections of the context window to represent BDI, we ground the agent.

  • Beliefs: What the agent knows to be objectively true about the current environment (derived from RAG or file analysis).
  • Desires: The ultimate goals of the user or the system (e.g., "Refactor this code to be SOLID compliant").
  • Intentions: The immediate, step-by-step plan the agent has committed to executing in this turn.

::: Tip
Tip: Instead of a generic system prompt, structure your context like a state machine. Before the agent answers, force it to output a <thought> block where it updates its current Intentions based on its Beliefs.
:::

Frequently Asked Questions

Q: Why not just use a Vector Database (RAG)?
RAG (Retrieval Augmented Generation) is a retrieval mechanism, not a context management strategy. RAG finds the documents, but Context Engineering decides how much of that text fits in the window, where to place it (start vs. middle), and how to format it for maximum attention.

Q: Does this matter for GPT-4 or Claude 3?
Yes. While newer models have larger windows, they charge by the token. Sending 100k tokens of irrelevant history for every query is incredibly expensive and slow. Context Engineering reduces costs and latency while improving accuracy.

Q: Is "Lost in the Middle" solved?
Not entirely. Even the best models (like Gemini 1.5 or GPT-4o) show performance drops when the "needle" (the answer) is buried in massive "haystacks" of irrelevant text. Curating your context is always superior to dumping raw data.

Conclusion

As we move from chatty assistants to agents that perform actual work, the ability to write a clever prompt is becoming secondary to the ability to architect a clean context.

By treating the context window as a scarce, valuable resource—an "Attention Budget"—we can build agents that are cheaper, faster, and significantly smarter. Start by implementing simple token counting and history management (like the Python code above), and then explore advanced patterns like the BDI model.

Context is the new code. Manage it wisely.

1 Comment

2 votes
0

More Posts

Prompt Engineering Is Dead. Context Engineering Is the Future

Jaideep Parashar - Dec 19, 2025

AI literacy is the new basic education

Jaideep Parashar - Jan 9

Smart Reasoning: Mastering Multiple-Choice Question-Answering with Vision-Language Models

Souradip Pal - May 11, 2025

Prompt Engineering Is Not Dead: But It’s Not Enough Anymore

Jaideep Parashar - Dec 23, 2025

The Sustainability Question Around AI Models

Kaycee - Oct 22, 2025
chevron_left