MemPalace Benchmark Analysis: Is the 100% Score Real?

Published April 7, 2026 · Independent analysis · Not affiliated with MemPalace

TL;DR

  • The 96.6% raw score (zero API) is credible and beats all local-only competitors we could find.
  • The 100% hybrid scoreis technically real but comes with significant caveats that aren't always mentioned in the marketing.
  • Some benchmark methodology choices have legitimate concerns, particularly around LoCoMo and the “lossless” compression claims.
  • The project has real technical merit underneath the aggressive marketing. The verbatim-storage architecture is a genuinely novel approach.

The Claims

MemPalace's official benchmark results, as published in their repository's BENCHMARKS.md:

100%
LongMemEval
hybrid
96.6%
LongMemEval
raw / zero API
92.9%
ConvoMem
88.9%
LoCoMo
100% w/ reranking

MemBench score: 80.3%. All scores self-reported by the MemPalace team.

LongMemEval Deep Dive

What is LongMemEval?

LongMemEval is an academic benchmark from UC Santa Barbara (arXiv:2410.10813) designed to evaluate long-term conversational memory. It consists of 500 questions spanning five categories: single-session fact recall, multi-session fact recall, temporal reasoning, multi-hop queries, and knowledge updates. The primary metric is R@5 — whether the correct memory appears in the top 5 retrieved results.

The 96.6% Raw Score: Credible

The raw score was achieved with zero API calls — no cloud LLM reranking, no external services. MemPalace's verbatim-storage approach with ChromaDB vector search retrieved correct memories for 483 of 500 questions. This is the highest published local-only result we've found for LongMemEval, and it aligns with the theoretical advantage of storing full conversation text rather than extracted summaries. The evaluation code is published in the repository.

The 100% Hybrid Score: Engineered

The path from 96.6% to 100% raises valid methodological questions. The hybrid mode uses Claude Haiku to rerank results, which is a legitimate technique — many production systems use multi-stage retrieval. However, the final 3.4% improvement came from analyzing the 17 specific failing questions and implementing targeted patches.

This is the benchmark equivalent of teaching to the test. You identified which questions you got wrong, engineered fixes for those specific questions, and then reported your score on the same test. In academic settings, this would typically require disclosure and a separate held-out evaluation.

Fair Context: The Held-Out Test

To their credit, the MemPalace team did create a held-out test set. They split the 500 questions, trained on 50, and evaluated on the remaining 450. The held-out score: 98.4%. This is arguably the most honest number in the entire analysis — and it's still excellent. They also disclosed the patch process rather than quietly folding it into a release.

LoCoMo Concerns

The top_k=50 Problem

Penfield Labs, a competitor in the AI memory space, published a critique highlighting that MemPalace's LoCoMo evaluation used top_k=50 while the typical candidate pool contains a maximum of 32 items.

When top_k exceeds the candidate pool, you retrieve everything. At that point, you're not testing the retrieval system at all — you're testing whether Claude can find an answer when given all the text. That's reading comprehension, not memory retrieval.

This matters because LoCoMo is specifically designed to evaluate retrieval quality. If you sidestep retrieval, the score doesn't tell you what it's supposed to tell you.

Fair Context

The MemPalace team's own BENCHMARKS.md acknowledges the top_k issue and notes it as a known limitation. They also note that the LoCoMo benchmark itself has structural issues that make fair comparison difficult across systems with different architectures. Penfield Labs, as a competitor, has their own incentives in this critique.

AAAK Compression Tradeoff

“Lossless” with an Asterisk

MemPalace claims approximately 30x compression using their custom AAAK dialect. They call it “lossless” because any LLM can read the compressed output and reconstruct the original meaning without a decoder.

In strict data-compression terms, this is accurate — the semantic content is preserved. But in retrieval terms, using AAAK reduces LongMemEval accuracy from 96.6% to 84.2%. That's a 12.4 percentage point drop. The compressed text changes the vector embeddings enough to degrade search quality.

Calling this “lossless” without qualifying the retrieval impact is misleading. It's lossless for human understanding but lossy for machine retrieval.

Fair Context

84.2% with AAAK compression still matches or beats Mem0 (~85%) and Zep (~82%) at their full accuracy — while using 30x less storage and zero API costs. The compression is optional, and users can choose the verbatim 96.6% mode when accuracy matters more than disk space. This is a legitimate engineering tradeoff, just one that should be communicated more clearly.

README vs Codebase Gaps

GitHub Issue #27: Claims vs Code

Developer Leonard Lin filed GitHub Issue #27 documenting several discrepancies between the README and the actual codebase. Notably:

  • Contradiction detection:The README describes automatic contradiction detection, but a code search found zero occurrences of “contradict” in the source.
  • Some documented features appeared to be partially implemented or not yet present in the published code.
  • Several architectural claims in the README did not match the code's actual structure.

Fair Context

This is a v3.0 open-source project that grew rapidly. README drift is common in fast-moving repos — documentation often leads or lags implementation. Some features may have been planned, in development, or present in a private branch. The issue was filed constructively, and the project maintainers have been responsive to community feedback. That said, README claims are effectively marketing, and marketing should match reality.

How It Actually Compares

Honest comparison of published benchmark scores with caveats. Scores marked with “~” are estimates based on published information and may not reflect identical test conditions.

SystemLongMemEval (raw)LongMemEval (hybrid)ConvoMemLocalCostCaveats
MemPalace96.6%100%92.9%Yes$0100% score was post-patch; AAAK compression reduces accuracy
Supermemory~70%N/A~40%No$9-29/moLimited public benchmark data; cloud-only
Mem0~85%N/A30-45%Partial$0-249/moExtracts facts, discards original text; cloud features require paid tier
Zep~82%N/A~50%NoUsage-basedEnterprise-focused; limited open benchmarks
Letta~78%N/A~55%Yes$0Agent framework, not pure memory; different architecture goals

Sources: MemPalace BENCHMARKS.md, Mem0 documentation, Zep documentation, Letta evaluations. Scores for competitors are approximations based on available public data and may not reflect latest versions.

Our Bottom Line

MemPalace is a genuinely innovative project with a sound architectural principle. The insight that verbatim storage outperforms LLM-extracted summaries for retrieval tasks is not obvious, and the data backs it up. Storing everything and letting vector search find what's relevant is a simpler, more robust approach than having an LLM decide what to remember.

The 96.6% raw score is the real headline.It is the highest published local-only LongMemEval result we could find, achieved with zero API costs and fully open-source code. If you're evaluating AI memory systems, this is the number that should matter to you — not the 100%.

The 100% hybrid score, while technically measured, was engineered through a process that most benchmark-literate engineers would consider overfitting. The held-out 98.4% score is more credible and still excellent. We'd like to see the project lead with this number instead.

Benchmark engineering is common in this industry. Mem0 doesn't publish full evaluation code. Zep's benchmarks are limited. Most AI memory startups make claims that are difficult to independently verify. MemPalace, for all its marketing aggressiveness, publishes their evaluation code and acknowledges limitations in their documentation. That's more transparent than the industry norm.

Finally, let's acknowledge the elephant in the room: the celebrity angle. When Milla Jovovich's name is on a GitHub repo, it attracts 10x the scrutiny that an equivalent project from an unknown developer would get. Some of that scrutiny is healthy. Some of it is people who decided the project couldn't be legitimate before they read the code. The technical merit exists independently of who made it.

Frequently Asked Questions

Is MemPalace's 100% LongMemEval score legitimate?+
The 100% hybrid score is technically real — it was measured on the full 500-question LongMemEval dataset. However, it was achieved after targeted patches for 3 specific failing questions, then retested on the same set. The 96.6% raw score (zero API) is more representative of real-world performance. A held-out 450-question test scored 98.4%, which is arguably the most honest number.
What is LongMemEval and why does it matter?+
LongMemEval is an academic benchmark with 500 questions testing long-term memory across five categories: fact recall, temporal reasoning, multi-hop queries, knowledge updates, and multi-session synthesis. It uses R@5 recall as its primary metric. It's considered the most comprehensive test of AI memory systems, published by UC Santa Barbara researchers.
Why do some people criticize MemPalace's benchmark results?+
There are three main criticisms: (1) The 100% LongMemEval score was achieved by fixing specific failing questions and retesting on the same set. (2) The LoCoMo benchmark used top_k=50, which exceeds the candidate pool size, effectively testing the LLM rather than the retrieval system. (3) AAAK compression is marketed as 'lossless' but reduces retrieval accuracy from 96.6% to 84.2%. The MemPalace team has acknowledged most of these concerns in their documentation.
How does MemPalace compare to Mem0 and Zep on benchmarks?+
On LongMemEval, MemPalace scores 96.6% (raw) vs Mem0's ~85% and Zep's ~82%. On ConvoMem, MemPalace scores 92.9% vs Mem0's estimated 30-45%. However, direct comparison is complicated because each system was tested under different conditions and configurations.
Should I trust AI memory benchmark scores in general?+
Approach all benchmark scores with healthy skepticism. Benchmark engineering is common across the AI industry — teams optimize specifically for test sets. Look for held-out test results, independent reproductions, and whether the benchmark methodology is disclosed. MemPalace is more transparent than most by publishing their full evaluation code and acknowledging limitations.

Sources & Further Reading

Want to verify these numbers yourself? The evaluation code is open source.