Introduction

Modern platforms for building AI agents significantly accelerate the development of such solutions. They allow you to easily select a large language model (LLM), manage prompts, and define tools that the agent can use automatically.

However, the real challenge on the developer’s side is ensuring the model receives the right context. Frameworks such as LangChain in Python or Semantic Kernel in .NET expect that with every request we provide conversation history or other data describing the current interaction state.

Intuitively, passing the entire conversation history every time seems reasonable. It should help the agent better understand context and make more accurate decisions. In practice, however, several questions arise:

Does increasing the number of tokens in the prompt always improve responses?
How do we avoid exceeding the context length limits of language models?
What about the growing processing time and cost of subsequent agent calls?

In this article, I’ll show how to manage context passed to large language models using a .NET agent built with Semantic Kernel. I’ll present approaches that help control context size, ensuring stable costs and consistent solution quality.

Context Window – Limitation of LLMs

Large Language Models have a limit on the maximum context size, known as the context window. It includes all tokens used in a single model call — both input and output. Depending on the model, this limit can be, for example, 128k tokens (OpenAI GPT-5).

In practice, when building AI agents, this limit must include:

system prompt
conversation history
tool outputs
new context (e.g., user query or system event)

Since the context window also includes output tokens, too much input data reduces space available for reasoning and response generation.

Does More Context Always Help?

Intuitively, more context should improve results — but this is not always true. The paper Lost in the Middle: How Language Models Use Long Contexts shows that increasing the number of documents in a prompt eventually stops improving answer quality. Additionally, models tend to use information at the beginning and end of the context most effectively, while performing worst on information in the middle (the “lost in the middle” effect).

How to Manage AI Agent Context

If more context does not always improve results, the natural question is: how should we manage it?

This topic is explained in more detail in the article Cutting Through the Noise: Smarter Context Management for LLM-Powered Agents, which focuses on effective context management in systems based on LLMs.

In practice, there are two main approaches:

LLM summarization – reducing the conversation history to a shorter version,
Observation masking – limiting how much data is passed to the model.

Each approach has its trade-offs. Summarization controls the context size better, but it adds extra cost and may lose some details. Masking is simpler and cheaper, but it does not stop the context from growing over time.

In practice, the best results come from a hybrid approach that combines both methods. In the next sections, I will show how to implement them using Semantic Kernel in .NET.

Context Management in Semantic Kernel (.NET)

AI Product Recommendation Agent

Imagine an online computer store where an AI agent helps users choose a laptop based on preferences such as: budget, weight, battery life and use case. The agent gathers requirements step by step, searches products, analyzes descriptions, and provides recommendations.

First, we create a basic agent, and then we implement context management strategies — observation masking and LLM summarization. We will analyze their impact on the number of input tokens and total tokens using additional analytics built in the project.

The source code used in this article is available in the GitHub repository SemanticKernelContextManagement.

Basic AI Agent

To build the agent in .NET, we will use the Semantic Kernel framework. A simple “Hello World” example can be easily created based on the official documentation.

For this experiment, we will add a ProductsPlugin that returns information about laptops. To keep things simple, the data will be stored in static JSON files inside the project and loaded into memory.

For recommendations:

A general one, which returns short information about all laptops:

[KernelFunction]
        [Description("Get all products from shop. Returns only product names and short summaries.")]
        public string GetProducts()
        {
            var productSummaries = products.Select(p => new
            {
                p.Name,
                p.ShortSummary,
            });

            return JsonSerializer.Serialize(productSummaries, ProductJsonOptions);
        }

A detailed one, for a single product (laptop):

        [Description("Get full detailed description of a product by its name.")]
        public string GetProductDetails([Description("Product name, e.g. Laptop Pro 14")] string productName)
        {
            var product = products.FirstOrDefault(p => p.Name.Equals(productName, StringComparison.OrdinalIgnoreCase));
            if (product == null)
            {
                return $"Product not found: {productName}";
            }

            return JsonSerializer.Serialize(product, ProductJsonOptions);
        }

In the first request, we also include a system prompt that tells the model to act as a shop assistant. One useful advantage of such an agent is that the assistant will always reply in the user’s language 🙂:

private const string SystemPrompt = """
            You are a shop assistant that recommends products from our catalog.
            Use the Products plugin (GetProducts, GetProductDetails) to read real catalog data before suggesting items.
            Do not invent products, prices, or stock. If nothing matches, say so clearly.
            Match the user's language in your replies.
            """;

In Semantic Kernel, we can decide whether the agent should use the tools defined in the plugin. In this case, we use FunctionChoiceBehavior.Auto, which means the agent can use the functions, but it is not required to:

        private readonly OpenAIPromptExecutionSettings openAIPromptExecutionSettings = new()
        {
            FunctionChoiceBehavior = FunctionChoiceBehavior.Auto()
        };

We build the program as a console application. We take the user’s request from the console, for example: “I want to buy a laptop.” In the basic approach, we add the question to the context as a UserMessage and send everything to the LLM. After receiving the response, we also save it in the history and return it to the user:

    public async Task<string> GetRecommendationAsync(string userInput)
    {
        ChatHistory.AddUserMessage(userInput);

        var result = await chatCompletionService.GetChatMessageContentAsync(
            ChatHistory,
            executionSettings: openAIPromptExecutionSettings,
            kernel: kernel);

        if (string.IsNullOrWhiteSpace(result.Content))
        {
            throw new InvalidOperationException($"{nameof(chatCompletionService)} did not return any content.");
        }

        ChatHistory.Add(result);
        return result.Content;
    }

Example conversation with the assistant in the console:

Observation masking

In agents built on the Semantic Kernel platform, raw tool results are added one below another to the conversation history as messages with the role AuthorRole.Tool. They are not visible to the client of our store. We return to the user a response processed by the agent, generated based on those results.

In the case of the function GetProductDetails, the result is a JSON containing full information about a specific laptop. Let’s assume that in most use cases, the same product information will be included in the response generated by the assistant. We make a deliberate decision to introduce observation masking. In this way, we reduce costs and response time in exchange for lower quality — a higher risk of hallucinations and a possible increase in the number of repeated tool calls.

We mask tool results by introducing a placeholder in their place:

private const string ObservationMaskedPlaceholder = "[TOOL_OBSERVATION_MASKED]";

We introduce the mechanism just before returning the response to the client, so that it takes effect for the next user query:

    public async Task<string> GetRecommendationAsync(string userInput)
    {
        ChatHistory.AddUserMessage(userInput);

        var result = await chatCompletionService.GetChatMessageContentAsync(
            ChatHistory,
            executionSettings: openAIPromptExecutionSettings,
            kernel: kernel);

        if (string.IsNullOrWhiteSpace(result.Content))
        {
            throw new InvalidOperationException($"{nameof(chatCompletionService)} did not return any content.");
        }

        ChatHistory.Add(result);

        // maskowanie obserwacji
        if (useObservationMasking)
        {
            MaskObservations();
        }

        return result.Content;
    }

        private void MaskObservations()
        {
            var toolMessages = ChatHistory.Where(m => m.Role == AuthorRole.Tool);
            foreach (var message in toolMessages)
            {
                message.Content = ObservationMaskedPlaceholder;

                if (message.Items.Count == 0)
                {
                    message.Items.Add(new TextContent(ObservationMaskedPlaceholder));
                    continue;
                }

                var originals = message.Items.ToArray();
                message.Items.Clear();

                foreach (var item in originals)
                {
                    if (item is FunctionResultContent functionResult)
                    {
                        message.Items.Add(new FunctionResultContent(
                            functionResult.FunctionName,
                            functionResult.PluginName,
                            functionResult.CallId,
                            ObservationMaskedPlaceholder));
                    }
                    else
                    {
                        message.Items.Add(new TextContent(ObservationMaskedPlaceholder));
                    }
                }
            }
        }

LLM summarization

Let’s imagine that customers of our store ask many questions about laptops during a single conversation. As the interaction history grows, the number of tokens sent to the model increases. As a result, subsequent responses are generated more and more slowly, the processing cost grows linearly, and the quality of the responses does not significantly improve. In this situation, we can deliberately introduce periodic AI-based summarization of the context. It allows us to control the size of the transmitted data, stabilizing both cost and response time. In the case of an asynchronous approach, the impact on latency is minimal. However, this comes at the cost of increased solution complexity and the risk of losing some information.

Next to our agent, we add a query that will be executed after every 10 turns with the user. It is worth noting that we trigger this mechanism after masking the observations, when we use the hybrid approach:

        public async Task<string> GetRecommendationAsync(string userInput)
        {
            ChatHistory.AddUserMessage(userInput);

            var result = await chatCompletionService.GetChatMessageContentAsync(
                ChatHistory,
                executionSettings: openAIPromptExecutionSettings,
                kernel: kernel);

            if (string.IsNullOrWhiteSpace(result.Content))
            {
                throw new InvalidOperationException($"{nameof(chatCompletionService)} did not return any content.");
            }

            ChatHistory.Add(result);

            if (useObservationMasking)
            {
                MaskObservations();
            }

            // Dodanie summarization
            if (useSummarization)
            {
                var numberOfTurnsWithUser = ChatHistory.Count(c => c.Role == AuthorRole.User);
                if (numberOfTurnsWithUser % 10 == 0)
                {
                    await SummarizeChatAsync();
                }
            }

            return result.Content;
        }

The summarization process can use a different, cheaper model than the main agent, because it does not require high-quality response generation, only information compression. In our case, for simplicity, we reuse Semantic Kernel. However, for this query, we send a separate system prompt:

        private const string SummarizationSystemPrompt = """
            You compress conversation transcripts for a product-recommendation assistant.
            Preserve concrete facts the user asked about, product names, prices, stock, and language the user used.
            Do not invent catalog data. Output a concise third-person summary suitable as context for continuing the chat.
            """;

we disable the ability to use tools:

        private readonly OpenAIPromptExecutionSettings summarizationPromptExecutionSettings = new()
        {
            FunctionChoiceBehavior = FunctionChoiceBehavior.None()
        };

and we also build a separate chat history as a transcript of the conversation so far between the agent and the user:

        private static string FormatConversationForSummary(ChatHistory history)
        {
            var blocks = new List<string>();
            foreach (var message in history)
            {
                if (message.Role == AuthorRole.System)
                {
                    continue;
                }

                var parts = new List<string>();
                if (!string.IsNullOrWhiteSpace(message.Content))
                {
                    parts.Add(message.Content.Trim());
                }

                foreach (var item in message.Items)
                {
                    switch (item)
                    {
                        case FunctionCallContent call:
                            parts.Add($"[called {call.PluginName}.{call.FunctionName} with arguments: {call.Arguments}]");
                            break;

                        case FunctionResultContent result:
                            parts.Add($"[result from {result.PluginName}.{result.FunctionName}: {FormatResultObject(result.Result)}]");
                            break;

                        case TextContent text when !string.IsNullOrWhiteSpace(text.Text):
                            parts.Add(text.Text.Trim());
                            break;
                    }
                }

                var body = parts.Count > 0 ? string.Join(" ", parts) : "(no text)";
                blocks.Add($"{message.Role}: {body}");
            }

            return string.Join(Environment.NewLine + Environment.NewLine, blocks);
        }

Finally, we send a request for a summary, and from the response we create a new context for the AI agent, adding the original system prompt once again:

private async Task SummarizeChatAsync()
{
    var systemMessage = ChatHistory.FirstOrDefault(m => m.Role == AuthorRole.System);
    if (systemMessage is null)
    {
        return;
    }

    var transcript = FormatConversationForSummary(ChatHistory);
    if (string.IsNullOrWhiteSpace(transcript))
    {
        return;
    }

    var summarizationChat = new ChatHistory();
    summarizationChat.AddSystemMessage(SummarizationSystemPrompt);
    summarizationChat.AddUserMessage(
        "Summarize the following conversation transcript.\n\n" + transcript);

    var summaryResponse = await chatCompletionService.GetChatMessageContentAsync(
        summarizationChat,
        executionSettings: summarizationPromptExecutionSettings,
        kernel: null);

    if (string.IsNullOrWhiteSpace(summaryResponse.Content))
    {
        throw new InvalidOperationException("Summarization did not return any content.");
    }

    var shopSystemText = systemMessage.Content ?? string.Empty;
    ChatHistory.Clear();
    ChatHistory.AddSystemMessage(shopSystemText);
    ChatHistory.AddUserMessage("Summary of the conversation so far:\n" + summaryResponse.Content);
}

Comparison of context management strategies based on a 21-turn conversation

The prepared system should be tested to prove the effectiveness of token reduction mechanisms.

I prepared a 21-turn conversation, which I ran in 4 scenarios:

without context management (Basic),
only observation masking (Omasking),
only LLM summarization (Summarization),
Observation masking and LLM summarization (OMask+Summ).

These are all the questions (translated to English):

1. List all products from the catalog with a short description of each.
2. I am looking for a laptop for programming and mobile work — what do you recommend and why?
3. I need something for gaming — which model makes sense and what are its key parameters?
4. Compare Laptop Pro 14 with DevBook 15 for a developer (.NET, VS / VS Code).
5. What is the exact technical description of the Creator Studio 16 model? List the specifications and use cases.
6. And Campus Note 14 — who is it for and how is it different from BalanceBook 14?
7. UltraLite 13 vs Silent Lite 13: which one is lighter or more “mobile” according to the catalog data?
8. Provide product details of Gaming Max 17 — CPU, RAM, display, weight if available in the description.
9. Do you have anything below a 14-inch screen? List models from the catalog that match.
10. Suggest a set: a laptop for studying at university and briefly justify each model in one sentence.
11. Let’s go back to Laptop Pro 14: repeat the most important advantages and limitations from the description.
12. Silent Lite 13 — full description with details; I am interested in the battery and display.
13. DevBook 15 — details from the catalog: what workloads is it suitable for and what is it not suitable for?
14. Does any laptop have a clear focus on silence or mobility in the description? Indicate which one and why.
15. I need something for photo/video (creator workflow) — what do you choose from the catalog and why?
16. BalanceBook 14 — full product details from the catalog.
17. Again, list all product names, but only names in one line, without descriptions.
18. Campus Note 14 — technical details and who it is for according to the full description.
19. If budget is not a concern: which one laptop would you recommend as “universal” and why?
20. Briefly compare three models: Gaming Max 17, Creator Studio 16, Laptop Pro 14 — who are they for.
21. Summarize our entire conversation: which models I considered and what you finally recommend.

I did not notice a significant difference in quality between the model’s responses in each scenario. For each of them, I measured the number of input tokens and total tokens for every interaction with the model and presented them on charts:

Example input tokens in different strategies

Absolute values:

Experiment	1	10	11	20	21
Basic	782	18652	19817	33657	17794
OMask+Summ	782	11902	2433	3005	1525
OMasking	782	11398	12045	21853	11055
Summarization	782	9651	1337	2357	2266

% compared to Basic:

Experiment	1	10	11	20	21
Basic	100.00%	100.00%	100.00%	100.00%	100.00%
OMask+Summ	100.00%	63.81%	12.28%	8.93%	8.57%
OMasking	100.00%	61.11%	60.78%	64.93%	62.13%
Summarization	100.00%	51.74%	6.75%	7.00%	12.73%

Reduction vs Basic:

Experiment	1	10	11	20	21
Basic	0.00%	0.00%	0.00%	0.00%	0.00%
OMask+Summ	0.00%	36.19%	87.72%	91.07%	91.43%
OMasking	0.00%	38.89%	39.22%	35.07%	37.87%
Summarization	0.00%	48.26%	93.25%	93.00%	87.27%

As expected, applying observation masking led to a stable reduction in the number of input tokens at the level of about 35–40% compared to the baseline approach (except for the first turn).

In turn, LLM summarization caused a clear drop in the number of tokens after the 10th turn, reaching a reduction of around 93% in later interactions.

The hybrid approach combined both effects, providing a reduction in the number of tokens both before and after performing summarization.

Conclusions

The way AI agents manage context is an architectural decision that has a direct impact on cost, response time, quality, and scalability of the solution, and it should depend on the nature of the application.

Passing the full conversation history is not a scalable approach — as the interaction grows, the number of tokens increases, which leads to higher costs and longer response times without proportional improvement in quality.

Observation masking is a simple and effective mechanism for reducing tokens, which helps limit the “noise” coming from tool responses. However, it does not solve the problem of growing context in the long term.

LLM summarization makes it possible to control the context length by compressing it. In the analyzed scenario, it allowed reducing the number of tokens by over 90%, at the cost of an additional model call and potential loss of some information.

The best results are achieved with a hybrid approach, which combines both mechanisms — it reduces the number of tokens in every interaction (except the first one) and controls long-term context growth.

Your turn

How are you handling context growth in your AI agents?
Have you tried masking or summarization in production?

If you’re working with AI systems, feel free to share your approach — I’m always curious how others solve this.

And if this kind of content is useful to you, consider following for more articles on .NET, AI, and technical decisions.

How to Efficiently Manage AI Agent Context in .NET

Introduction

Context Window – Limitation of LLMs

Does More Context Always Help?

How to Manage AI Agent Context

Context Management in Semantic Kernel (.NET)

AI Product Recommendation Agent

Basic AI Agent

Observation masking

LLM summarization

Comparison of context management strategies based on a 21-turn conversation

Example input tokens in different strategies

Conclusions

Your turn

Comments

More from this blog

Jak efektywnie zarządzać kontekstem agentów AI w .NET

Klucze naturalne w CosmosDB: szybsze odczyty i prostsze integracje

Double-checked locking w .NET – jak zatrzymać pędzące stado

Jitter w .NET – jak rozkładać fale obciążeń

Command Palette

Introduction

Context Window – Limitation of LLMs

Does More Context Always Help?

How to Manage AI Agent Context

Context Management in Semantic Kernel (.NET)

AI Product Recommendation Agent

Basic AI Agent

Observation masking

LLM summarization

Comparison of context management strategies based on a 21-turn conversation

Example input tokens in different strategies

Conclusions

Your turn

Comments

More from this blog