AI accuracy drops 85% with bigger prompts. Context rot is real. Learn why it happens and how to fix it with Skills.MD

Here's what nobody tells you about those massive prompts everyone's bragging about: they're making your AI worse. I learned this the hard way while building production RAG systems, and the data is so shocking that I didn't believe it until I saw it destroy my own chatbot's performance.
The "Bigger Is Better" Lie We've All Been Told
For the past year, the AI industry has been locked in an arms race over prompt size. "Feed it your entire codebase!" "Analyze this 500-page document!" Every announcement promised smarter AI, better reasoning, and more accurate responses. We all bought into it—myself included.
I was building a sophisticated RAG chatbot for a client and thought: "I'll give it every single document, every code example, every piece of context. More information must mean better answers, right?" I stuffed my prompts with what amounted to a small novel's worth of documentation. The result? My chatbot started giving wrong answers 40% of the time and made up features that didn't exist anywhere in my documentation.
Then I read the 2025 paper, "Context Length Alone Hurts LLM Performance Despite Perfect Retrieval" and my jaw hit the floor. The researchers measured this phenomenon—what they call "Context Rot"—with brutal clarity:
AI performance drops 13.9%–85% as prompt size increases, even when the AI finds exactly the right information.
The drop still happens when the extra text is just whitespace. It still happens when you tell the AI to ignore the irrelevant parts. This isn't about finding the right info. This is about fundamental limits in how AI models process large amounts of text.
The scientists tested all the major models:
The model doesn't matter. The task doesn't matter. More words = worse performance beyond a surprisingly modest threshold.
I spent two weeks trying to engineer around this. I tried:
Nothing worked. Because the problem isn't in the prompt—it's in how AI brains are wired.
When you feed an AI a huge prompt, it doesn't read everything with equal focus like a human would. It compresses, summarizes, loses nuance. The middle sections get "forgotten" as the AI prioritizes recent text and struggles to maintain relevance across massive documents.
Even worse, the AI starts pattern-matching across unrelated sections. I watched my chatbot hallucinate a "new API endpoint" that was actually a mashup of three different code snippets from different parts of my documentation. The AI wasn't retrieving information anymore—it was creatively remixing it based on what words appeared near each other.
This has brutal implications for anyone building with AI:
You think you're helping by stuffing your retriever's top-20 search results into the prompt? You're actually confusing the model. I saw my RAG system create "best practices" that existed nowhere in my documentation—just a mashup of three different sources it couldn't distinguish between.
Better approach: Only feed it the top 2-3 most relevant results. Use reranking to improve quality, not quantity.
When your AI coding assistant has your entire codebase in its prompt, it starts suggesting functions that mix patterns from completely different files. A junior on my team watched Claude suggest a React component that combined state management from our Redux files with hooks from our Next.js app—creating code that ran in neither environment.
Better approach: Load only the relevant files. Use skills to teach patterns, not dump entire directories.
When your agent chains multiple tool calls, each step adds more context. After 5-6 steps, your agent is operating in a fog of its own previous actions, making increasingly erratic decisions. Our workflow automation started calling the wrong tools because it couldn't remember which step it was on.
Better approach: Clear context between steps. Use structured state management, not prompt memory.
Feeding an AI a 200-page PDF for analysis? By page 50, it's forgotten what page 10 said. It starts drawing connections between unrelated sections and missing the document's actual thesis. I've seen it summarize legal contracts by mixing clauses from different sections into dangerous new "interpretations."
Better approach: Chunk documents logically. Analyze sections separately, then synthesize conclusions.
The solution isn't smaller prompts—it's smarter prompt building. Here's what actually works:
Instead of loading everything upfront, load only what you need for the current task. This is why I built my Agent Skills repository. When I'm writing vector search logic, the AI loads only my Upstash patterns. When I'm building UI, it loads only React components. Relevance > quantity.
Keep a lightweight agents.md with only non-negotiable rules. Don't document every pattern—just the boundaries the AI cannot cross. Let Skills.MD handle the deep expertise.
Don't dump 20 documents into the prompt. Retrieve, rank, and present only the top 2-3 most relevant pieces of context. Use reranking models to improve quality, not quantity.
When you must write long prompts:
After implementing these changes in my RAG chatbot:
The irony? By giving the AI less context but better context, it became dramatically more useful—and our AWS bill thanked us.
This is why I built my Agent Skills repository with one critical rule: each skill loads only when relevant, never everything at once. The Skills.MD format isn't just about organization—it's about respecting fundamental limits in how AI processes text.
Every skill is designed to:
This architecture is the antidote to Context Rot.
Here's what I need you to do:
The skills that implement this architecture are live now:
Context Rot is the silent killer of AI performance. The research proves it. My production experience confirms it. The solution isn't more words—it's better structure.
Stop stuffing your prompts with every document you have. Start building focused, load-on-demand expertise. Your AI will thank you with better answers, lower costs, and fewer hallucinations.
The prompt size arms race is over. The winners are the ones who use less of it.
Test Context Rot: Compare accuracy between long vs. focused prompts
Fix it: Use Skills.MD to load only relevant expertise
Share your results: Report your threshold findings
Repository with context-optimized skills: github.com/gocallum/nextjs16-agent-skills
Have you noticed your AI getting worse with bigger prompts? What's your accuracy threshold? Share your experience in the comments.