MLP Memory: A Retriever-Pretrained Memory for Large Language Models

Rubin Wei, Jiaqi Cao, Jiarui Wang, Jushi Kai, Qipeng Guo, Bowen Zhou, Zhouhan Lin · Aug 3, 2025 · Citations: 0

Abstract

Modern approaches to enhancing Large Language Models' factual accuracy and knowledge utilization face a fundamental trade-off: non-parametric retrieval-augmented generation (RAG) provides flexible access to external knowledge but suffers from high inference latency and shallow integration, while parametric fine-tuning methods like LoRA risk catastrophic forgetting and degraded general capabilities. In this work, we propose MLP Memory, a lightweight parametric module that learns to internalize retrieval patterns without explicit document access. By pretraining an MLP to imitate a $k$NN retriever's behavior on the entire pretraining dataset, we create a differentiable memory component that captures the benefits of retrieval-based knowledge access in a fully parametric form. Our architecture integrates this pretrained MLP Memory with Transformer decoders through simple probability interpolation, yielding 17.5\% and 24.1\% scaling gains on WikiText-103 and Web datasets, respectively. It further achieves 12.3\% relative improvement on five question-answering benchmarks and 5.2 points absolute gain across nine general NLP tasks, while reducing hallucinations by up to 10 points on HaluEval. Moreover, MLP Memory delivers 2.5$\times$ faster inference than RAG with superior accuracy. Our findings show that learning retrieval patterns parametrically bridges the gap between efficient inference and effective knowledge access, offering a practical alternative to both RAG and fine-tuning approaches.

HFEPX Relevance Assessment

This paper appears adjacent to HFEPX scope (human-feedback/eval), but does not show strong direct protocol evidence in metadata/abstract.

Eval-Fit Score

5/100 • Low

Treat as adjacent context, not a core eval-method reference.

Human Feedback Signal

Not explicit in abstract metadata

Evaluation Signal

Detected

HFEPX Fit

Adjacent candidate

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Human Data Lens

Uses human feedback: No
Feedback types: None
Rater population: Unknown
Unit of annotation: Unknown
Expertise required: General
Extraction source: Runtime deterministic fallback

Evaluation Lens

Evaluation modes: Automatic Metrics
Agentic eval: None
Quality controls: Not reported
Confidence: 0.45
Flags: low_signal, possible_false_positive, runtime_fallback_extraction

Protocol And Measurement Signals

Benchmarks / Datasets

Halueval

Reported Metrics

accuracylatency

Research Brief

Deterministic synthesis

Modern approaches to enhancing Large Language Models' factual accuracy and knowledge utilization face a fundamental trade-off: non-parametric retrieval-augmented generation (RAG) provides flexible access to external knowledge but suffers… HFEPX signals include Automatic Metrics with confidence 0.45. Updated from current HFEPX corpus.

Generated Mar 3, 2026, 10:37 PM · Grounded in abstract + metadata only

Key Takeaways

Modern approaches to enhancing Large Language Models' factual accuracy and knowledge utilization face a fundamental trade-off: non-parametric retrieval-augmented generation (RAG)…
In this work, we propose MLP Memory, a lightweight parametric module that learns to internalize retrieval patterns without explicit document access.
It further achieves 12.3\% relative improvement on five question-answering benchmarks and 5.2 points absolute gain across nine general NLP tasks, while reducing hallucinations by…

Researcher Actions

Treat this as method context, then pivot to protocol-specific HFEPX hubs.
Cross-check benchmark overlap: Halueval.
Validate metric comparability (accuracy, latency).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Low-signal flag detected: protocol relevance may be indirect.

Recommended Queries

human-eval protocol design pairwise preference data quality inter-rater agreement adjudication

Research Summary

Contribution Summary

Modern approaches to enhancing Large Language Models' factual accuracy and knowledge utilization face a fundamental trade-off: non-parametric retrieval-augmented generation (RAG) provides flexible access to external knowledge but suffers…
In this work, we propose MLP Memory, a lightweight parametric module that learns to internalize retrieval patterns without explicit document access.
It further achieves 12.3\% relative improvement on five question-answering benchmarks and 5.2 points absolute gain across nine general NLP tasks, while reducing hallucinations by up to 10 points on HaluEval.

Why It Matters For Eval

It further achieves 12.3\% relative improvement on five question-answering benchmarks and 5.2 points absolute gain across nine general NLP tasks, while reducing hallucinations by up to 10 points on HaluEval.

Researcher Checklist

Gap: Human feedback protocol is explicit

No explicit human feedback protocol detected.
Pass: Evaluation mode is explicit

Detected: Automatic Metrics
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Pass: Benchmark or dataset anchors are present

Detected: Halueval
Pass: Metric reporting is present

Detected: accuracy, latency

Category-Adjacent Papers (Broader Context)

These papers are nearby in arXiv category and useful for broader context, but not necessarily protocol-matched to this paper.

InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models Category Neighbor

Citations: 0 Relevance: 4.55
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy, latency, memory)
IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation Category Neighbor

Citations: 0 Relevance: 4.10
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy, latency)
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning Category Neighbor

Citations: 0 Relevance: 2.85
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy)
CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era Category Neighbor

Citations: 0 Relevance: 2.85
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy)
Confusion-Aware Rubric Optimization for LLM-based Automated Grading Category Neighbor

Citations: 0 Relevance: 2.85
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy)
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science Category Neighbor

Citations: 0 Relevance: 2.85
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy)
Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems Category Neighbor

Citations: 0 Relevance: 2.85
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (latency)
Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance Category Neighbor

Citations: 0 Relevance: 2.85
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy)

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote