jguillaumesio
ai

Checking your AI's context window: what fits and what doesn't

How to verify the actual context window of an LLM, understand external vs internal context, and avoid the silent truncation problem.

Context window budget: where your 128K tokens actually go

Every LLM has a context window, the maximum amount of text it can process in a single request. But the advertised number and the effective number are often very different. And if you don’t understand the difference, you’ll hit silent failures that are incredibly hard to debug.

Advertised vs Effective Context

Model specs look clean:

ModelAdvertised Context
GPT-4o128K tokens
Claude 3.5 Sonnet200K tokens
Gemini 1.5 Pro1M tokens
Llama 3.1 70B128K tokens

But “context window” is not a single number. It’s the sum of:

Total Context = System Prompt + Conversation History + Tool Definitions + Tool Results + Current Message + Output Budget

If your system prompt is 2,000 tokens, your tool definitions are 3,000 tokens, and you want 4,000 tokens of output, your effective input budget on a 128K model is:

128,000 - 2,000 - 3,000 - 4,000 = 119,000 tokens for conversation + tool results

That’s before you’ve sent a single user message.

The Silent Truncation Problem

Here’s what makes this dangerous: most APIs don’t error when you exceed the context window. They silently truncate. usually from the beginning of the conversation. Your carefully crafted system prompt? Gone. The project context you loaded at the start? Gone. The agent just silently loses its memory and starts producing garbage.

You won’t get an error. You’ll get confident, wrong answers.

How to Verify Context Window

The simplest test: send increasingly long inputs and check when quality degrades or truncation occurs.

import tiktoken

def count_tokens(text: str, model: str = "gpt-4") -> int:
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Test with your actual payload
system_prompt = open("system_prompt.txt").read()
tool_defs = open("tool_definitions.json").read()
user_message = open("task.txt").read()

total = count_tokens(system_prompt) + count_tokens(tool_defs) + count_tokens(user_message)
print(f"Total input tokens: {total}")
print(f"Remaining for output: {128000 - total}")

But token counting isn’t enough. You need to verify semantic retention, does the model actually remember what you put in the context?

The Needle-in-a-Haystack Test

# Place a unique fact at a specific position in a long context
needle = "The secret project codename is BLUEFIN-7."
haystack = generate_long_context(position=0.5, needle=needle)  # 50% through

response = llm.complete(
    system="Answer based only on the context provided.",
    context=haystack,
    question="What is the secret project codename?"
)

assert "BLUEFIN-7" in response  # Does it actually remember?

Run this at different context lengths and positions. You’ll find that:

  • Beginning and end of context are remembered well
  • Middle of long contexts gets “lost”, the lost-in-the-middle problem
  • Effective retention is often 20-40% less than the advertised window
  • Different models degrade at different rates

External Context: The Hidden Tax

When you use RAG (Retrieval-Augmented Generation), vector search, or tool-based context loading, there’s an additional cost most people miss:

Your RAG pipeline:
  1. User asks question                    → 50 tokens
  2. Embed question for vector search      → API call (not free)
  3. Retrieve top-5 chunks                 → 5 × 500 = 2,500 tokens
  4. Inject chunks into prompt             → 2,500 tokens of context
  5. Model generates answer                → 500 tokens

Total context used: ~3,050 tokens
Total API calls: 2 (embedding + completion)

Every retrieved chunk costs tokens. If your retrieval is imprecise and returns 10 chunks instead of 5, you’ve doubled your context cost for that turn.

Practical Guidelines

1. Budget your context explicitly

CONTEXT_BUDGET = {
    "system_prompt": 2000,
    "tool_definitions": 3000,
    "conversation_history": 40000,
    "retrieved_context": 10000,
    "output_reserved": 4000,
    "total": 128000,
}
# Remaining for user input: 69,000 tokens

2. Monitor actual usage, not estimates

Most APIs return token usage in the response:

{
  "usage": {
    "prompt_tokens": 45230,
    "completion_tokens": 1200,
    "total_tokens": 46430
  }
}

Log this. Alert when usage exceeds 80% of your window.

3. Compress aggressively

  • Summarize old conversation turns instead of keeping them verbatim
  • Use structured formats (JSON) instead of prose for tool results
  • Deduplicate retrieved chunks before injecting them
  • Truncate file contents to relevant sections, not entire files

4. Test with your real payloads

Don’t trust benchmark numbers. Test with your actual system prompts, your actual tool definitions, and your actual conversation patterns. The degradation curve is specific to your use case.

The Bottom Line

Context window is not a spec sheet number. It’s a budget. and every component of your system spends from it. If you’re not tracking actual usage and testing retention, you’re flying blind.

The most expensive token is the one you waste on context the model never actually uses.