LLMs
prompting
study
review

A short review of study "How Many Instructions can LLMs Follow at Once?"

Written by Lucian Ghinda

While trying to determine how many instructions are enough in an Agents.md file, I found the following paper:

Jaroslawicz, D., Whiting, B., Shah, P., & Maamari, K. (2025). How Many Instructions Can LLMs Follow at Once? arXiv:2507.11538

Before reading what I extracted from there I think it is important to understand the limitations defined by authors:

Our study has several important limitations. We focus exclusively on professional report generation with simple keyword-inclusion instructions, which may not generalize to other task types or domains, or more complex instruction types. Our business vocabulary from SEC 10-K filings limits insights into other instruction formats common in real applications. Results are specific to English-language, business-domain instruction following, with cross-lingual performance and other paradigms requiring future investigation.

I still think the study is relevant for developers when thinking about writing guidelines LLM agents or working with an agent and writing more complex prompts.

Findings from the study “How Many Instructions can LLMs Follow at Once.”

There is a bias toward earlier instructions and lower accuracy as the instruction count approaches 500.

We evaluate 20 state-of-the-art models across seven major providers and find that even the best frontier models only achieve 68% accuracy at the max density of 500 instructions Our analysis reveals model size and reasoning capability to correlate with 3 distinct performance degradation patterns, bias towards earlier instructions, and distinct categories of instruction-following errors

Be conservative with the total number of instructions. Even under controlled conditions, performance degrades as the number of instructions increases.

Limit the context provided in AGENTS.md or CLAUDE.md files. One way to do this while still providing necessary information is to use progressive disclosure.

Few instructions are better.

Threshold decay: Performance remains stable until a threshold, then transitions to a different (steeper) degradation slope and displays increased variance. The top two models ( gemini-2.5-pro , o3 ) demonstrate this clearly, maintaining near-perfect performance through 150 or more instructions before declining. Notably, these are both reasoning models, indicating that deliberative processing architectures provide robust instruction tracking up to critical thresholds, beyond which systematic degradation occurs.

The best reasoning models in this paper, such as gemini-2.5-pro and o3, show a performance decay pattern. They maintain performance up to about 150 or more instructions before a systematic decline, especially for simple, repetitive constraints. In the paper, they used a list of keywords for a business report, as tested in the IFScale benchmark.

Models exhibit a primacy effect, meaning they are generally better at satisfying instructions that appear earlier in the list.

Primacy effects refer to the tendency of models to better satisfy instructions appearing earlier versus later in the instruction list Primacy effects display an interesting pattern across all models: they start low at minimal instruction densities indicating almost no bias for earlier instructions, peak around 150-200 instructions, then level off or decrease at extreme densities. This mid-range peak suggests that models exhibit the most bias as they begin to struggle under cognitive load at moderate densities.

One takeaway is to place the most critical instructions at the beginning of the prompt. This bias is stronger when the model is under moderate cognitive load, around 150 to 200 instructions. However, relying on ordering is less effective at extreme densities, above 300 instructions, where models are overwhelmed and fail uniformly.

My takeaways

After reading that study here are some takeaways I noted down mostly for me:

1. Be conservative with how many instructions you include in AGENTS.md/CLAUDE.md

Long instruction lists degrade model performance. Aim for under 150 instructions if possible. If you reach 200+, expect accuracy drops.

In case of prompting, avoid giant “superprompts.”

2. Put the most important rules first

Models consistently show a primacy effect: they are better at following instructions that appear earlier in the list. Always put your most critical requirements first.

Place safety rules and project-critical constraints in the first 10–20 items.

3. Use progressive disclosure

Don’t dump everything into one big file and break guidelines into layers or steps. Practical ways to do this:

  • Split instructions by themes (coding, style, architecture).
  • Send only the section relevant to the current task.
  • Use “ask-if-needed” patterns instead of hard-coding everything upfront.

4. Prefer fewer, better instructions

Redundant, verbose, or repetitive rules increase instruction count and reduce accuracy, even though they degrade in effectiveness when the number of instructions is high.

Simple, repetitive instructions hold up better than complex or varied ones, but even these degrade in effectiveness at high counts.

5. Use structured formats instead of long prose

Models track clearer instructions when the structure is consistent. Examples:

  • Use bullet points instead of paragraphs.
  • Use tables for dos/don’ts.
  • Use “If X, then Y” rules.
  • Use numbered steps

This improves recall in most normal cases.

6. Let the model summarize your instructions for internal use

One way to reduce the number of rules is to ask the model to summarize or restate your instruction set in its own words. Make sure you read them and check if they maintain the same spirit or intention as the original prompt.

#goodenoughtesting #subscribe #email

Get free samples and be notified when a new workshop is scheduled

You might get sometimes an weekly/monthly emails with updates, testing tips and articles.
I will share early bird prices and discounts with you when the courses are ready.