By
April 15, 2026
9 min read
Context Engineering: The Real Skill Behind Production AI Agents



Why Prompt Engineering Isn't Enough Anymore
For a while, "write better prompts" was the answer to most AI quality problems. If the model gave bad output, you tweaked the instructions. If it forgot something, you added it to the system message. If it hallucinated, you told it not to.
That worked when AI was a single-turn tool. When you asked a question and got an answer, prompt quality was basically the whole game.
Agents broke that model. A multi-step agent working through a task over minutes or hours doesn't have a single prompt — it has a constantly evolving stream of information going into an LLM over and over again. The system prompt is one piece. But so is the conversation history, the outputs of tools the agent called, retrieved documents, structured state, error messages, and the instructions it passed to subagents. What the model actually sees at each inference call is the product of a dozen decisions you made at the system level, not just what you wrote in the system message.
This is what context engineering addresses. Andrej Karpathy put it simply in a post that got over 14,000 likes: "I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task."
The term has since exploded in usage — search interest grew over 6,000% year-over-year, and teams building serious agentic systems are treating it as the primary design discipline. This post breaks down what context engineering actually is, what lives inside a well-designed context, and where teams consistently get it wrong.
What Context Engineering Actually Means
Context engineering is the discipline of designing and managing what information an AI model receives at each inference call — not just the prompt text, but the entire information environment the model operates in.
If prompt engineering is writing the right question, context engineering is building the right library before the question gets asked.
The formal definition from Anthropic's engineering blog frames it as "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference." LangChain's take is slightly broader: "building dynamic systems to provide the right information and tools in the right format such that the LLM can plausibly accomplish the task."
Both definitions agree on one thing: it's a systems problem, not a text problem. You're not writing a better prompt. You're architecting the information pipeline that feeds every inference call your agent makes.
What Actually Lives in a Context Window
To do context engineering well, you first need to know what you're working with. A production agent's context window is not just a chat transcript. At any given moment, it typically contains some combination of the following.
The System Prompt
The foundational instructions — what the agent is, what it can do, how it should behave, what tools it has access to. Most teams treat this as a static block of text, but in well-designed systems it's templated and dynamically populated. The persona, the available tools, the constraints — all of these can vary depending on the user, the task, or the state of the world.
Conversation History
The previous turns of interaction. This is where most naive implementations go wrong. If you include the full conversation history every time, you burn through tokens fast and introduce noise. If you truncate naively, you lose critical context — the goal set three messages ago, the user constraint mentioned in passing, the error that happened earlier. Good context engineering here means deciding which history to include, in what form, and what to summarize versus preserve verbatim.
Tool Outputs and Observations
When an agent calls a tool — searches the web, queries a database, runs code — the result comes back into the context. These outputs can be large. A web search result, a database query returning fifty rows, a code execution output with a stack trace. How you format and truncate these before they enter the context has significant downstream effects on the model's reasoning quality.
Retrieved Documents
Most production agents use some form of retrieval-augmented generation. Retrieved chunks land in the context alongside everything else. The question isn't just what to retrieve — it's how to present retrieved content, how much of it to include, whether to deduplicate, and how to signal to the model which sources are more authoritative than others.
Structured State
Agentic tasks often have explicit state: a plan being executed, a list of completed steps, an intermediate result being built up over time. Some teams serialize this into a structured format and include it explicitly in every context. Others derive it implicitly from history. The explicit approach is usually more reliable, even if it costs tokens.
Injected Background Knowledge
Domain knowledge that isn't in the model's training data, or isn't reliable enough to trust from training alone: company-specific terminology, user preferences, current policies, live data. This is often the most underinvested piece — teams spend effort on retrieval systems but neglect baseline factual grounding.
Context Engineering vs. Prompt Engineering vs. RAG
These three are related but distinct. Prompt engineering is about the specific instructions and examples you provide at query time. RAG is a retrieval mechanism — a method for finding relevant documents and inserting them into context. Context engineering is the broader discipline that encompasses both, plus everything else that determines what the model sees.
A reasonable mental model: RAG is one tool within context engineering. Prompt engineering addresses one component of the context (the instruction layer). Context engineering is the full architectural question of how you populate, manage, and maintain the model's information environment across an entire task.
The important implication: you can have excellent RAG and poor context engineering. If your retrieved documents are high quality but they're being inserted alongside a bloated conversation history, redundant tool outputs, and a system prompt that hasn't been maintained, the model is still working with bad context. Retrieval quality and context quality are separate problems that require separate attention.
The Core Patterns That Work
Write to External Memory, Read Selectively
One of the most reliable patterns for long-horizon agents: don't try to keep everything in the context window. Have the agent explicitly write important facts, decisions, and intermediate results to an external store — a key-value store, a vector database, or a structured file — and retrieve only what's needed at each step.
This mirrors how experienced professionals work: you don't hold everything in working memory. You write things down and look them up when needed. LlamaIndex's context engineering guide covers this pattern in detail, including the tradeoffs between different memory store types.
Summarize Aggressively
Long conversations should be summarized, not just truncated. When the model needs to know what happened in the first ten messages of a forty-message thread, a four-sentence summary is often more useful than the verbatim text — and costs far fewer tokens. The key is knowing when to trigger summarization and what to preserve verbatim. Direct quotes from users tend to matter more than verbatim filler responses.
Format Tool Outputs for Readability, Not Completeness
Raw tool outputs are not context-ready. A database result returning two hundred rows doesn't help the model unless you've shaped it. Extract the relevant fields, truncate large values, add brief labels explaining what each field means. Models reason better on formatted evidence than on raw data dumps — this is consistently observable in output quality, not just a theoretical preference.
Make State Explicit
If your agent is working through a multi-step task, include an explicit state block in every context refresh. Something like a structured summary of: what the goal is, what has been completed, what the current step is, and what constraints apply. Models lose track of goals when state is only implicit in conversation history. Making it explicit costs tokens but pays off substantially in reliability over long tasks.
Match Context to Task Phase
Not every inference call in a multi-step task needs the same context. During planning, the model needs goals and constraints. During execution of a specific step, it needs the current task and the relevant tools. During reflection, it needs the full trajectory. Context that's appropriate for one phase is often noise in another. Agents that switch context profiles as the task progresses tend to outperform those with a fixed context structure.
Where Teams Get Context Engineering Wrong
Most failures in production agentic systems trace back to context problems. A few patterns come up consistently.
Including Everything by Default
The naive implementation appends everything to the context and hopes the model figures out what matters. This works in demos. In production, over long tasks, it saturates the context window, introduces noise, and degrades output quality. Active context management — deciding what to remove or compress — is not optional for multi-step agents.
Trusting the Model to Ignore Irrelevant Information
A common assumption: "the model is smart enough to filter out the irrelevant parts." This is demonstrably false. LLMs are sensitive to context composition in ways that aren't obvious. Including irrelevant tool outputs, for example, measurably increases error rates on reasoning tasks. The model attends to what's present, not just what you intended it to focus on. This has been documented in research on lost-in-the-middle phenomena, where model performance degrades significantly when relevant information is buried in a large context.
Static System Prompts for Dynamic Situations
A fixed system prompt that says "you have access to tools X, Y, and Z" but doesn't adapt when the available tools change is a reliability hazard. The same applies to policies, user preferences, and available data sources. System prompts that don't reflect current reality create agent behavior that contradicts the actual environment — and the model has no way to know the discrepancy exists.
Ignoring Token Budget as a Design Constraint
Context engineering is partly an optimization problem. Every token in the context costs money at inference time and contributes to latency. Teams building agentic systems need to treat token budgets the same way they'd treat memory constraints in systems programming — as a first-class design constraint, not an afterthought. Anthropic's engineering blog specifically discusses budgeting context space across the different types of information an agent needs.
Conflating Retrieval Quality with Context Quality
Improving your embedding model or retrieval strategy is not the same as improving context engineering. A team might have excellent semantic search and still feed results into a chaotic, unstructured context that undermines the model's reasoning. Retrieval is one input to context. How that input gets composed with everything else is where context engineering lives.
Context Engineering in Agentic Systems
Single-turn LLM use is forgiving. Agentic systems are not. When a model makes ten, twenty, or fifty inference calls to complete a task, each call's context depends on the outputs of previous calls. Errors compound. Context pollution from one step degrades every subsequent step.
This is why Philipp Schmid's widely-shared post on context engineering frames it as the discipline of "designing and building dynamic systems" rather than a prompting technique. The system design matters more than any individual prompt.
Production agentic systems benefit from treating context as a first-class architectural concern. That means designing explicit context schemas that define what information each agent type receives at each task phase. It means building logging and inspection tooling that lets you see exactly what the model received at each inference call, not just what outputs came out. It means running automated context quality checks alongside output quality checks, and versioning system prompts the same way you'd version any other piece of infrastructure that agents depend on.
Most teams we see building agentic systems treat context as a byproduct of their code rather than an input to their design process. The ones getting reliable production performance have flipped that. They design the context first, then build the system that produces it. At Genta, context architecture is central to how we approach agent deployments — it's one of the primary reasons the same underlying models produce dramatically different results across different implementations.
Where to Start
Context engineering doesn't require a complete architectural overhaul. A few concrete starting points:
Audit what's actually in your context at inference time. Log the full context for a sample of agent runs and read it. You'll find things that don't belong there, redundancies, and format problems that wouldn't have been obvious from looking at outputs alone.
Measure token usage by category. How many tokens go to system prompt versus conversation history versus tool outputs versus retrieved documents? The distribution often reveals where optimization will have the most impact.
Test context variations systematically. The same task with different context structures often produces meaningfully different outputs. Treat context design as something you evaluate experimentally, not just something you write once.
Context engineering is not a new concept — good writers, good teachers, and good briefers have always known that what you include matters as much as what you say. What's new is that LLMs make the stakes explicit and measurable. Get the context right, and the model's capabilities come through. Get it wrong, and no amount of instruction tuning will compensate.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
By
April 15, 2026
9 min read
Context Engineering: The Real Skill Behind Production AI Agents



Why Prompt Engineering Isn't Enough Anymore
For a while, "write better prompts" was the answer to most AI quality problems. If the model gave bad output, you tweaked the instructions. If it forgot something, you added it to the system message. If it hallucinated, you told it not to.
That worked when AI was a single-turn tool. When you asked a question and got an answer, prompt quality was basically the whole game.
Agents broke that model. A multi-step agent working through a task over minutes or hours doesn't have a single prompt — it has a constantly evolving stream of information going into an LLM over and over again. The system prompt is one piece. But so is the conversation history, the outputs of tools the agent called, retrieved documents, structured state, error messages, and the instructions it passed to subagents. What the model actually sees at each inference call is the product of a dozen decisions you made at the system level, not just what you wrote in the system message.
This is what context engineering addresses. Andrej Karpathy put it simply in a post that got over 14,000 likes: "I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task."
The term has since exploded in usage — search interest grew over 6,000% year-over-year, and teams building serious agentic systems are treating it as the primary design discipline. This post breaks down what context engineering actually is, what lives inside a well-designed context, and where teams consistently get it wrong.
What Context Engineering Actually Means
Context engineering is the discipline of designing and managing what information an AI model receives at each inference call — not just the prompt text, but the entire information environment the model operates in.
If prompt engineering is writing the right question, context engineering is building the right library before the question gets asked.
The formal definition from Anthropic's engineering blog frames it as "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference." LangChain's take is slightly broader: "building dynamic systems to provide the right information and tools in the right format such that the LLM can plausibly accomplish the task."
Both definitions agree on one thing: it's a systems problem, not a text problem. You're not writing a better prompt. You're architecting the information pipeline that feeds every inference call your agent makes.
What Actually Lives in a Context Window
To do context engineering well, you first need to know what you're working with. A production agent's context window is not just a chat transcript. At any given moment, it typically contains some combination of the following.
The System Prompt
The foundational instructions — what the agent is, what it can do, how it should behave, what tools it has access to. Most teams treat this as a static block of text, but in well-designed systems it's templated and dynamically populated. The persona, the available tools, the constraints — all of these can vary depending on the user, the task, or the state of the world.
Conversation History
The previous turns of interaction. This is where most naive implementations go wrong. If you include the full conversation history every time, you burn through tokens fast and introduce noise. If you truncate naively, you lose critical context — the goal set three messages ago, the user constraint mentioned in passing, the error that happened earlier. Good context engineering here means deciding which history to include, in what form, and what to summarize versus preserve verbatim.
Tool Outputs and Observations
When an agent calls a tool — searches the web, queries a database, runs code — the result comes back into the context. These outputs can be large. A web search result, a database query returning fifty rows, a code execution output with a stack trace. How you format and truncate these before they enter the context has significant downstream effects on the model's reasoning quality.
Retrieved Documents
Most production agents use some form of retrieval-augmented generation. Retrieved chunks land in the context alongside everything else. The question isn't just what to retrieve — it's how to present retrieved content, how much of it to include, whether to deduplicate, and how to signal to the model which sources are more authoritative than others.
Structured State
Agentic tasks often have explicit state: a plan being executed, a list of completed steps, an intermediate result being built up over time. Some teams serialize this into a structured format and include it explicitly in every context. Others derive it implicitly from history. The explicit approach is usually more reliable, even if it costs tokens.
Injected Background Knowledge
Domain knowledge that isn't in the model's training data, or isn't reliable enough to trust from training alone: company-specific terminology, user preferences, current policies, live data. This is often the most underinvested piece — teams spend effort on retrieval systems but neglect baseline factual grounding.
Context Engineering vs. Prompt Engineering vs. RAG
These three are related but distinct. Prompt engineering is about the specific instructions and examples you provide at query time. RAG is a retrieval mechanism — a method for finding relevant documents and inserting them into context. Context engineering is the broader discipline that encompasses both, plus everything else that determines what the model sees.
A reasonable mental model: RAG is one tool within context engineering. Prompt engineering addresses one component of the context (the instruction layer). Context engineering is the full architectural question of how you populate, manage, and maintain the model's information environment across an entire task.
The important implication: you can have excellent RAG and poor context engineering. If your retrieved documents are high quality but they're being inserted alongside a bloated conversation history, redundant tool outputs, and a system prompt that hasn't been maintained, the model is still working with bad context. Retrieval quality and context quality are separate problems that require separate attention.
The Core Patterns That Work
Write to External Memory, Read Selectively
One of the most reliable patterns for long-horizon agents: don't try to keep everything in the context window. Have the agent explicitly write important facts, decisions, and intermediate results to an external store — a key-value store, a vector database, or a structured file — and retrieve only what's needed at each step.
This mirrors how experienced professionals work: you don't hold everything in working memory. You write things down and look them up when needed. LlamaIndex's context engineering guide covers this pattern in detail, including the tradeoffs between different memory store types.
Summarize Aggressively
Long conversations should be summarized, not just truncated. When the model needs to know what happened in the first ten messages of a forty-message thread, a four-sentence summary is often more useful than the verbatim text — and costs far fewer tokens. The key is knowing when to trigger summarization and what to preserve verbatim. Direct quotes from users tend to matter more than verbatim filler responses.
Format Tool Outputs for Readability, Not Completeness
Raw tool outputs are not context-ready. A database result returning two hundred rows doesn't help the model unless you've shaped it. Extract the relevant fields, truncate large values, add brief labels explaining what each field means. Models reason better on formatted evidence than on raw data dumps — this is consistently observable in output quality, not just a theoretical preference.
Make State Explicit
If your agent is working through a multi-step task, include an explicit state block in every context refresh. Something like a structured summary of: what the goal is, what has been completed, what the current step is, and what constraints apply. Models lose track of goals when state is only implicit in conversation history. Making it explicit costs tokens but pays off substantially in reliability over long tasks.
Match Context to Task Phase
Not every inference call in a multi-step task needs the same context. During planning, the model needs goals and constraints. During execution of a specific step, it needs the current task and the relevant tools. During reflection, it needs the full trajectory. Context that's appropriate for one phase is often noise in another. Agents that switch context profiles as the task progresses tend to outperform those with a fixed context structure.
Where Teams Get Context Engineering Wrong
Most failures in production agentic systems trace back to context problems. A few patterns come up consistently.
Including Everything by Default
The naive implementation appends everything to the context and hopes the model figures out what matters. This works in demos. In production, over long tasks, it saturates the context window, introduces noise, and degrades output quality. Active context management — deciding what to remove or compress — is not optional for multi-step agents.
Trusting the Model to Ignore Irrelevant Information
A common assumption: "the model is smart enough to filter out the irrelevant parts." This is demonstrably false. LLMs are sensitive to context composition in ways that aren't obvious. Including irrelevant tool outputs, for example, measurably increases error rates on reasoning tasks. The model attends to what's present, not just what you intended it to focus on. This has been documented in research on lost-in-the-middle phenomena, where model performance degrades significantly when relevant information is buried in a large context.
Static System Prompts for Dynamic Situations
A fixed system prompt that says "you have access to tools X, Y, and Z" but doesn't adapt when the available tools change is a reliability hazard. The same applies to policies, user preferences, and available data sources. System prompts that don't reflect current reality create agent behavior that contradicts the actual environment — and the model has no way to know the discrepancy exists.
Ignoring Token Budget as a Design Constraint
Context engineering is partly an optimization problem. Every token in the context costs money at inference time and contributes to latency. Teams building agentic systems need to treat token budgets the same way they'd treat memory constraints in systems programming — as a first-class design constraint, not an afterthought. Anthropic's engineering blog specifically discusses budgeting context space across the different types of information an agent needs.
Conflating Retrieval Quality with Context Quality
Improving your embedding model or retrieval strategy is not the same as improving context engineering. A team might have excellent semantic search and still feed results into a chaotic, unstructured context that undermines the model's reasoning. Retrieval is one input to context. How that input gets composed with everything else is where context engineering lives.
Context Engineering in Agentic Systems
Single-turn LLM use is forgiving. Agentic systems are not. When a model makes ten, twenty, or fifty inference calls to complete a task, each call's context depends on the outputs of previous calls. Errors compound. Context pollution from one step degrades every subsequent step.
This is why Philipp Schmid's widely-shared post on context engineering frames it as the discipline of "designing and building dynamic systems" rather than a prompting technique. The system design matters more than any individual prompt.
Production agentic systems benefit from treating context as a first-class architectural concern. That means designing explicit context schemas that define what information each agent type receives at each task phase. It means building logging and inspection tooling that lets you see exactly what the model received at each inference call, not just what outputs came out. It means running automated context quality checks alongside output quality checks, and versioning system prompts the same way you'd version any other piece of infrastructure that agents depend on.
Most teams we see building agentic systems treat context as a byproduct of their code rather than an input to their design process. The ones getting reliable production performance have flipped that. They design the context first, then build the system that produces it. At Genta, context architecture is central to how we approach agent deployments — it's one of the primary reasons the same underlying models produce dramatically different results across different implementations.
Where to Start
Context engineering doesn't require a complete architectural overhaul. A few concrete starting points:
Audit what's actually in your context at inference time. Log the full context for a sample of agent runs and read it. You'll find things that don't belong there, redundancies, and format problems that wouldn't have been obvious from looking at outputs alone.
Measure token usage by category. How many tokens go to system prompt versus conversation history versus tool outputs versus retrieved documents? The distribution often reveals where optimization will have the most impact.
Test context variations systematically. The same task with different context structures often produces meaningfully different outputs. Treat context design as something you evaluate experimentally, not just something you write once.
Context engineering is not a new concept — good writers, good teachers, and good briefers have always known that what you include matters as much as what you say. What's new is that LLMs make the stakes explicit and measurable. Get the context right, and the model's capabilities come through. Get it wrong, and no amount of instruction tuning will compensate.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
By
April 15, 2026
9 min read
Context Engineering: The Real Skill Behind Production AI Agents



Why Prompt Engineering Isn't Enough Anymore
For a while, "write better prompts" was the answer to most AI quality problems. If the model gave bad output, you tweaked the instructions. If it forgot something, you added it to the system message. If it hallucinated, you told it not to.
That worked when AI was a single-turn tool. When you asked a question and got an answer, prompt quality was basically the whole game.
Agents broke that model. A multi-step agent working through a task over minutes or hours doesn't have a single prompt — it has a constantly evolving stream of information going into an LLM over and over again. The system prompt is one piece. But so is the conversation history, the outputs of tools the agent called, retrieved documents, structured state, error messages, and the instructions it passed to subagents. What the model actually sees at each inference call is the product of a dozen decisions you made at the system level, not just what you wrote in the system message.
This is what context engineering addresses. Andrej Karpathy put it simply in a post that got over 14,000 likes: "I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task."
The term has since exploded in usage — search interest grew over 6,000% year-over-year, and teams building serious agentic systems are treating it as the primary design discipline. This post breaks down what context engineering actually is, what lives inside a well-designed context, and where teams consistently get it wrong.
What Context Engineering Actually Means
Context engineering is the discipline of designing and managing what information an AI model receives at each inference call — not just the prompt text, but the entire information environment the model operates in.
If prompt engineering is writing the right question, context engineering is building the right library before the question gets asked.
The formal definition from Anthropic's engineering blog frames it as "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference." LangChain's take is slightly broader: "building dynamic systems to provide the right information and tools in the right format such that the LLM can plausibly accomplish the task."
Both definitions agree on one thing: it's a systems problem, not a text problem. You're not writing a better prompt. You're architecting the information pipeline that feeds every inference call your agent makes.
What Actually Lives in a Context Window
To do context engineering well, you first need to know what you're working with. A production agent's context window is not just a chat transcript. At any given moment, it typically contains some combination of the following.
The System Prompt
The foundational instructions — what the agent is, what it can do, how it should behave, what tools it has access to. Most teams treat this as a static block of text, but in well-designed systems it's templated and dynamically populated. The persona, the available tools, the constraints — all of these can vary depending on the user, the task, or the state of the world.
Conversation History
The previous turns of interaction. This is where most naive implementations go wrong. If you include the full conversation history every time, you burn through tokens fast and introduce noise. If you truncate naively, you lose critical context — the goal set three messages ago, the user constraint mentioned in passing, the error that happened earlier. Good context engineering here means deciding which history to include, in what form, and what to summarize versus preserve verbatim.
Tool Outputs and Observations
When an agent calls a tool — searches the web, queries a database, runs code — the result comes back into the context. These outputs can be large. A web search result, a database query returning fifty rows, a code execution output with a stack trace. How you format and truncate these before they enter the context has significant downstream effects on the model's reasoning quality.
Retrieved Documents
Most production agents use some form of retrieval-augmented generation. Retrieved chunks land in the context alongside everything else. The question isn't just what to retrieve — it's how to present retrieved content, how much of it to include, whether to deduplicate, and how to signal to the model which sources are more authoritative than others.
Structured State
Agentic tasks often have explicit state: a plan being executed, a list of completed steps, an intermediate result being built up over time. Some teams serialize this into a structured format and include it explicitly in every context. Others derive it implicitly from history. The explicit approach is usually more reliable, even if it costs tokens.
Injected Background Knowledge
Domain knowledge that isn't in the model's training data, or isn't reliable enough to trust from training alone: company-specific terminology, user preferences, current policies, live data. This is often the most underinvested piece — teams spend effort on retrieval systems but neglect baseline factual grounding.
Context Engineering vs. Prompt Engineering vs. RAG
These three are related but distinct. Prompt engineering is about the specific instructions and examples you provide at query time. RAG is a retrieval mechanism — a method for finding relevant documents and inserting them into context. Context engineering is the broader discipline that encompasses both, plus everything else that determines what the model sees.
A reasonable mental model: RAG is one tool within context engineering. Prompt engineering addresses one component of the context (the instruction layer). Context engineering is the full architectural question of how you populate, manage, and maintain the model's information environment across an entire task.
The important implication: you can have excellent RAG and poor context engineering. If your retrieved documents are high quality but they're being inserted alongside a bloated conversation history, redundant tool outputs, and a system prompt that hasn't been maintained, the model is still working with bad context. Retrieval quality and context quality are separate problems that require separate attention.
The Core Patterns That Work
Write to External Memory, Read Selectively
One of the most reliable patterns for long-horizon agents: don't try to keep everything in the context window. Have the agent explicitly write important facts, decisions, and intermediate results to an external store — a key-value store, a vector database, or a structured file — and retrieve only what's needed at each step.
This mirrors how experienced professionals work: you don't hold everything in working memory. You write things down and look them up when needed. LlamaIndex's context engineering guide covers this pattern in detail, including the tradeoffs between different memory store types.
Summarize Aggressively
Long conversations should be summarized, not just truncated. When the model needs to know what happened in the first ten messages of a forty-message thread, a four-sentence summary is often more useful than the verbatim text — and costs far fewer tokens. The key is knowing when to trigger summarization and what to preserve verbatim. Direct quotes from users tend to matter more than verbatim filler responses.
Format Tool Outputs for Readability, Not Completeness
Raw tool outputs are not context-ready. A database result returning two hundred rows doesn't help the model unless you've shaped it. Extract the relevant fields, truncate large values, add brief labels explaining what each field means. Models reason better on formatted evidence than on raw data dumps — this is consistently observable in output quality, not just a theoretical preference.
Make State Explicit
If your agent is working through a multi-step task, include an explicit state block in every context refresh. Something like a structured summary of: what the goal is, what has been completed, what the current step is, and what constraints apply. Models lose track of goals when state is only implicit in conversation history. Making it explicit costs tokens but pays off substantially in reliability over long tasks.
Match Context to Task Phase
Not every inference call in a multi-step task needs the same context. During planning, the model needs goals and constraints. During execution of a specific step, it needs the current task and the relevant tools. During reflection, it needs the full trajectory. Context that's appropriate for one phase is often noise in another. Agents that switch context profiles as the task progresses tend to outperform those with a fixed context structure.
Where Teams Get Context Engineering Wrong
Most failures in production agentic systems trace back to context problems. A few patterns come up consistently.
Including Everything by Default
The naive implementation appends everything to the context and hopes the model figures out what matters. This works in demos. In production, over long tasks, it saturates the context window, introduces noise, and degrades output quality. Active context management — deciding what to remove or compress — is not optional for multi-step agents.
Trusting the Model to Ignore Irrelevant Information
A common assumption: "the model is smart enough to filter out the irrelevant parts." This is demonstrably false. LLMs are sensitive to context composition in ways that aren't obvious. Including irrelevant tool outputs, for example, measurably increases error rates on reasoning tasks. The model attends to what's present, not just what you intended it to focus on. This has been documented in research on lost-in-the-middle phenomena, where model performance degrades significantly when relevant information is buried in a large context.
Static System Prompts for Dynamic Situations
A fixed system prompt that says "you have access to tools X, Y, and Z" but doesn't adapt when the available tools change is a reliability hazard. The same applies to policies, user preferences, and available data sources. System prompts that don't reflect current reality create agent behavior that contradicts the actual environment — and the model has no way to know the discrepancy exists.
Ignoring Token Budget as a Design Constraint
Context engineering is partly an optimization problem. Every token in the context costs money at inference time and contributes to latency. Teams building agentic systems need to treat token budgets the same way they'd treat memory constraints in systems programming — as a first-class design constraint, not an afterthought. Anthropic's engineering blog specifically discusses budgeting context space across the different types of information an agent needs.
Conflating Retrieval Quality with Context Quality
Improving your embedding model or retrieval strategy is not the same as improving context engineering. A team might have excellent semantic search and still feed results into a chaotic, unstructured context that undermines the model's reasoning. Retrieval is one input to context. How that input gets composed with everything else is where context engineering lives.
Context Engineering in Agentic Systems
Single-turn LLM use is forgiving. Agentic systems are not. When a model makes ten, twenty, or fifty inference calls to complete a task, each call's context depends on the outputs of previous calls. Errors compound. Context pollution from one step degrades every subsequent step.
This is why Philipp Schmid's widely-shared post on context engineering frames it as the discipline of "designing and building dynamic systems" rather than a prompting technique. The system design matters more than any individual prompt.
Production agentic systems benefit from treating context as a first-class architectural concern. That means designing explicit context schemas that define what information each agent type receives at each task phase. It means building logging and inspection tooling that lets you see exactly what the model received at each inference call, not just what outputs came out. It means running automated context quality checks alongside output quality checks, and versioning system prompts the same way you'd version any other piece of infrastructure that agents depend on.
Most teams we see building agentic systems treat context as a byproduct of their code rather than an input to their design process. The ones getting reliable production performance have flipped that. They design the context first, then build the system that produces it. At Genta, context architecture is central to how we approach agent deployments — it's one of the primary reasons the same underlying models produce dramatically different results across different implementations.
Where to Start
Context engineering doesn't require a complete architectural overhaul. A few concrete starting points:
Audit what's actually in your context at inference time. Log the full context for a sample of agent runs and read it. You'll find things that don't belong there, redundancies, and format problems that wouldn't have been obvious from looking at outputs alone.
Measure token usage by category. How many tokens go to system prompt versus conversation history versus tool outputs versus retrieved documents? The distribution often reveals where optimization will have the most impact.
Test context variations systematically. The same task with different context structures often produces meaningfully different outputs. Treat context design as something you evaluate experimentally, not just something you write once.
Context engineering is not a new concept — good writers, good teachers, and good briefers have always known that what you include matters as much as what you say. What's new is that LLMs make the stakes explicit and measurable. Get the context right, and the model's capabilities come through. Get it wrong, and no amount of instruction tuning will compensate.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.