MemPalace Benchmark Analysis: Is the 100% Score Real?
Published April 7, 2026 · Independent analysis · Not affiliated with MemPalace
TL;DR
- The 96.6% raw score (zero API) is credible and beats all local-only competitors we could find.
- The 100% hybrid scoreis technically real but comes with significant caveats that aren't always mentioned in the marketing.
- Some benchmark methodology choices have legitimate concerns, particularly around LoCoMo and the “lossless” compression claims.
- The project has real technical merit underneath the aggressive marketing. The verbatim-storage architecture is a genuinely novel approach.
The Claims
MemPalace's official benchmark results, as published in their repository's BENCHMARKS.md:
MemBench score: 80.3%. All scores self-reported by the MemPalace team.
LongMemEval Deep Dive
What is LongMemEval?
LongMemEval is an academic benchmark from UC Santa Barbara (arXiv:2410.10813) designed to evaluate long-term conversational memory. It consists of 500 questions spanning five categories: single-session fact recall, multi-session fact recall, temporal reasoning, multi-hop queries, and knowledge updates. The primary metric is R@5 — whether the correct memory appears in the top 5 retrieved results.
The 96.6% Raw Score: Credible
The raw score was achieved with zero API calls — no cloud LLM reranking, no external services. MemPalace's verbatim-storage approach with ChromaDB vector search retrieved correct memories for 483 of 500 questions. This is the highest published local-only result we've found for LongMemEval, and it aligns with the theoretical advantage of storing full conversation text rather than extracted summaries. The evaluation code is published in the repository.
The 100% Hybrid Score: Engineered
The path from 96.6% to 100% raises valid methodological questions. The hybrid mode uses Claude Haiku to rerank results, which is a legitimate technique — many production systems use multi-stage retrieval. However, the final 3.4% improvement came from analyzing the 17 specific failing questions and implementing targeted patches.
This is the benchmark equivalent of teaching to the test. You identified which questions you got wrong, engineered fixes for those specific questions, and then reported your score on the same test. In academic settings, this would typically require disclosure and a separate held-out evaluation.
Fair Context: The Held-Out Test
To their credit, the MemPalace team did create a held-out test set. They split the 500 questions, trained on 50, and evaluated on the remaining 450. The held-out score: 98.4%. This is arguably the most honest number in the entire analysis — and it's still excellent. They also disclosed the patch process rather than quietly folding it into a release.
LoCoMo Concerns
The top_k=50 Problem
Penfield Labs, a competitor in the AI memory space, published a critique highlighting that MemPalace's LoCoMo evaluation used top_k=50 while the typical candidate pool contains a maximum of 32 items.
When top_k exceeds the candidate pool, you retrieve everything. At that point, you're not testing the retrieval system at all — you're testing whether Claude can find an answer when given all the text. That's reading comprehension, not memory retrieval.
This matters because LoCoMo is specifically designed to evaluate retrieval quality. If you sidestep retrieval, the score doesn't tell you what it's supposed to tell you.
Fair Context
The MemPalace team's own BENCHMARKS.md acknowledges the top_k issue and notes it as a known limitation. They also note that the LoCoMo benchmark itself has structural issues that make fair comparison difficult across systems with different architectures. Penfield Labs, as a competitor, has their own incentives in this critique.
AAAK Compression Tradeoff
“Lossless” with an Asterisk
MemPalace claims approximately 30x compression using their custom AAAK dialect. They call it “lossless” because any LLM can read the compressed output and reconstruct the original meaning without a decoder.
In strict data-compression terms, this is accurate — the semantic content is preserved. But in retrieval terms, using AAAK reduces LongMemEval accuracy from 96.6% to 84.2%. That's a 12.4 percentage point drop. The compressed text changes the vector embeddings enough to degrade search quality.
Calling this “lossless” without qualifying the retrieval impact is misleading. It's lossless for human understanding but lossy for machine retrieval.
Fair Context
84.2% with AAAK compression still matches or beats Mem0 (~85%) and Zep (~82%) at their full accuracy — while using 30x less storage and zero API costs. The compression is optional, and users can choose the verbatim 96.6% mode when accuracy matters more than disk space. This is a legitimate engineering tradeoff, just one that should be communicated more clearly.
README vs Codebase Gaps
GitHub Issue #27: Claims vs Code
Developer Leonard Lin filed GitHub Issue #27 documenting several discrepancies between the README and the actual codebase. Notably:
- Contradiction detection:The README describes automatic contradiction detection, but a code search found zero occurrences of “contradict” in the source.
- Some documented features appeared to be partially implemented or not yet present in the published code.
- Several architectural claims in the README did not match the code's actual structure.
Fair Context
This is a v3.0 open-source project that grew rapidly. README drift is common in fast-moving repos — documentation often leads or lags implementation. Some features may have been planned, in development, or present in a private branch. The issue was filed constructively, and the project maintainers have been responsive to community feedback. That said, README claims are effectively marketing, and marketing should match reality.
How It Actually Compares
Honest comparison of published benchmark scores with caveats. Scores marked with “~” are estimates based on published information and may not reflect identical test conditions.
| System | LongMemEval (raw) | LongMemEval (hybrid) | ConvoMem | Local | Cost | Caveats |
|---|---|---|---|---|---|---|
| MemPalace | 96.6% | 100% | 92.9% | Yes | $0 | 100% score was post-patch; AAAK compression reduces accuracy |
| Supermemory | ~70% | N/A | ~40% | No | $9-29/mo | Limited public benchmark data; cloud-only |
| Mem0 | ~85% | N/A | 30-45% | Partial | $0-249/mo | Extracts facts, discards original text; cloud features require paid tier |
| Zep | ~82% | N/A | ~50% | No | Usage-based | Enterprise-focused; limited open benchmarks |
| Letta | ~78% | N/A | ~55% | Yes | $0 | Agent framework, not pure memory; different architecture goals |
Sources: MemPalace BENCHMARKS.md, Mem0 documentation, Zep documentation, Letta evaluations. Scores for competitors are approximations based on available public data and may not reflect latest versions.
Our Bottom Line
MemPalace is a genuinely innovative project with a sound architectural principle. The insight that verbatim storage outperforms LLM-extracted summaries for retrieval tasks is not obvious, and the data backs it up. Storing everything and letting vector search find what's relevant is a simpler, more robust approach than having an LLM decide what to remember.
The 96.6% raw score is the real headline.It is the highest published local-only LongMemEval result we could find, achieved with zero API costs and fully open-source code. If you're evaluating AI memory systems, this is the number that should matter to you — not the 100%.
The 100% hybrid score, while technically measured, was engineered through a process that most benchmark-literate engineers would consider overfitting. The held-out 98.4% score is more credible and still excellent. We'd like to see the project lead with this number instead.
Benchmark engineering is common in this industry. Mem0 doesn't publish full evaluation code. Zep's benchmarks are limited. Most AI memory startups make claims that are difficult to independently verify. MemPalace, for all its marketing aggressiveness, publishes their evaluation code and acknowledges limitations in their documentation. That's more transparent than the industry norm.
Finally, let's acknowledge the elephant in the room: the celebrity angle. When Milla Jovovich's name is on a GitHub repo, it attracts 10x the scrutiny that an equivalent project from an unknown developer would get. Some of that scrutiny is healthy. Some of it is people who decided the project couldn't be legitimate before they read the code. The technical merit exists independently of who made it.
Frequently Asked Questions
Is MemPalace's 100% LongMemEval score legitimate?+
What is LongMemEval and why does it matter?+
Why do some people criticize MemPalace's benchmark results?+
How does MemPalace compare to Mem0 and Zep on benchmarks?+
Should I trust AI memory benchmark scores in general?+
Sources & Further Reading
- LongMemEval paper (arXiv:2410.10813) — UC Santa Barbara
- MemPalace BENCHMARKS.md — Official benchmark documentation
- GitHub Issue #27 — Leonard Lin's README vs codebase analysis
- Penfield Labs critique — LoCoMo methodology concerns
Want to verify these numbers yourself? The evaluation code is open source.