News

Proposal for a Robust, Standardized Benchmark for Long-Term AI Memory Systems

Proposal for a Robust, Standardized Benchmark for Long-Term AI Memory Systems

Currently, nearly all AI memory systems publish scores on benchmarks that inadequately measure what they claim to evaluate. Existing benchmarks are significantly flawed, making fair comparison difficult.

For instance, an audit of the LoCoMo benchmark revealed that 6.4% of its answer key is factually incorrect (99 errors in 1,540 questions). Furthermore, the LLM judge accepted 63% of intentionally wrong answers, and 56% of per-category system comparisons were statistically indistinguishable from noise.

Another example, LongMemEval-S, utilizes approximately 115,000 tokens per question. This volume can be entirely held within the context window of every frontier model today, making it more of a context window test than a true memory test.

Moreover, each system employs its own ingestion methods, answer generation prompts, and sometimes its own judge configurations, yet publishes scores in the same table as if adhering to a common methodology. The benchmark dispute between Mem0 and Zep perfectly illustrates this issue: two companies testing the same systems arrived at wildly different numbers, highlighting a critical lack of standardized methodology.

To address these shortcomings and establish a real benchmark for long-term AI memory systems, we propose a new set of design principles:

1. Corpus Must Exceed Context Windows: The total context should range from 1 to 2 million tokens. This scale is large enough to necessitate genuine memory retrieval while remaining economically feasible for independent researchers.

2. Corpus Must Model Real Agent Usage: Content should feature multi-session conversations between one person and an AI assistant over approximately six months. This should encompass work projects, personal preferences, corrections, and evolving facts, rather than disconnected chit-chat between strangers.

3. Ingestion is the System's Problem, But Must Be Disclosed: Each system is free to ingest data as it prefers, but it must publish details including the ingestion method, model used, embedding model, total cost, and total time.

4. Answer Generation: Standardized OR Fully Disclosed: A "Standard Track" should mandate a prescribed model and prompt, using a single-shot generation where the only variable is what memory retrieves, ensuring an "apples-to-apples" comparison. An "Open Track" allows systems to use any method, provided it is fully disclosed, and results must be reported separately, never mixed with standard track scores.

5. Equal Statistical Power Across Categories: Each category should include 400 questions. LoCoMo's smallest category, with only 96 questions, has Wilson Score margins of error so wide that any observed score differences are effectively noise.

6. Human-Verified Ground Truth: The target error rate should be less than 1%. This will be achieved through model council pre-screening, crowd-sourced review with bounties, and expert tiebreakers.

7. Adversarially Validated Judge: Prior to launch, intentionally wrong answers must be generated. The judge must reject over 95% of these. This eliminates judges that cannot distinguish vague, topically-adjacent answers from correct ones.

8. Abstention is Scored: If an answer is present in the corpus but the system responds "I don't know," it receives a score of 0.10. A confidently wrong answer receives 0.0. A system that understands its limitations should outperform one that hallucinates.

9. Multiple Scoring Dimensions: Accuracy alone obscures crucial insights. The scorecard should include: accuracy (standard + open tracks), retrieval precision (tokens per question), and latency (P50/P90).

↗ Read original source