# Locomo Benchmark for Various Memory Backends

>  This project is originally forked from [mem0-evaluation](https://github.com/mem0ai/mem0/tree/main/evaluation) in commit `393a4fd5a6cfeb754857a2229726f567a9fadf36` 

This project contains the code of running benchmark results on [Locomo dataset](https://github.com/snap-research/locomo/tree/main) with different memory methods:

- langmem
- mem0
- zep
- basic rag
- naive LLM
- Memobase

## Result

- We ran Memobase results and pasted the other methods' result from [Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory](https://arxiv.org/pdf/2504.19413). 

- We mainly report the LLM Judge Sorce (higher is better).

| Method                 | Single-Hop(%) | Multi-Hop(%) | Open Domain(%) | Temporal(%) | Overall(%) |
| ---------------------- | ------------- | ------------ | -------------- | ----------- | ---------- |
| Mem0                   | **67.13**     | 51.15        | 72.93          | 55.51       | 66.88      |
| Mem0-Graph             | 65.71         | 47.19        | 75.71          | 58.13       | 68.44      |
| LangMem                | 62.23         | 47.92        | 71.12          | 23.43       | 58.10      |
| Zep                    | 61.70         | 41.35        | 76.60      | 49.31       | 65.99      |
| OpenAI                 | 63.79         | 42.92        | 62.29          | 21.71       | 52.90      |
| Memobase(*v0.0.32*) | 63.83         | **52.08**    | 71.82          | 80.37   | 70.91  |
| Memobase(*v0.0.37*) | **70.92** | 46.88 | **77.17**   | **85.05** | **75.78** |

> **What is LLM Judge Score?**
>
> Basically, Locomo benchmark offers some long conversations and prepare some questions. LLM Judge Score is to use LLM(*e.g.* OpenAI `gpt-4o`) to judge if the answer generated from memory method is the same as the ground truth, score is 1 if it is, else 0.

We attached the artifacts of Memobase under `fixture/memobase/`:

- v0.0.32
  - `fixture/memobase/results_0503_3000.json`: predicted answers from Memobase Memory
  - `fixture/memobase/memobase_eval_0503_3000.json`: LLM Judge results of predicted answers

- v0.0.37
  - `fixture/memobase/results_0710_3000.json`: predicted answers from Memobase Memory
  - `fixture/memobase/memobase_eval_0710_3000.json`: LLM Judge results of predicted answers

To generate the latest scorings, run:

```bash
python generate_scores.py --input_path="fixture/memobase/memobase_eval_0710_3000.json"
```

Output:

```
Mean Scores Per Category:
          bleu_score  f1_score  llm_score  count         type
category
1             0.3516    0.4629     0.7092    282   single_hop
2             0.4758    0.6423     0.8505    321     temporal
3             0.1758    0.2293     0.4688     96    multi_hop
4             0.4089    0.5155     0.7717    841  open_domain

Overall Mean Scores:
bleu_score    0.3978
f1_score      0.5145
llm_score     0.7578
dtype: float64
```


> ❕ We update the results from Zep team (Zep*). See this [issue](https://github.com/memodb-io/memobase/issues/101) for detail reports and artifacts.
> | Method     | Single-Hop(%) | Multi-Hop(%) | Open Domain(%) | Temporal(%) | Overall(%) |
> | ---------- | ------------- | ------------ | -------------- | ----------- | ---------- |
> | Zep*       | 74.11         | 66.04        | 67.71          | 79.79       | 75.14      |


## 🔍 Dataset

[Download](https://github.com/snap-research/locomo/tree/main/data) the `locomo10.json` file and place it under `dataset/`


## 🚀 Getting Started

### Prerequisites

Create a `.env` file with your API keys and configurations. You must have beflow envs:

```bash
# OpenAI API key for GPT models and embeddings
OPENAI_API_KEY="your-openai-api-key"
```

Below is the detailed requirements

### Memobase

**Deps**

```bash
pip install memobase
```

**Env**

You can find free API key in [Memobase Cloud](https://memobase.io), or [deploy](../../../readme.md) one in your local

```bash
MEMOBASE_API_KEY=XXXXX
MEMOBASE_PROJECT_URL=http://localhost:8019 # OPTIONAL
```

**Command**

```bash
# memorize the data
make run-memobase-add 
# answer the benchmark
make run-memobase-search 
# evaluate the results
py evals.py --input_file results.json --output_file evals.json 
# print the final scores
py generate_scores.py --input_path="evals.json"
```


### Run Mem0

**Deps**

```bash
pip install mem0
```

**Env**

```bash
# Mem0 API keys (for Mem0 and Mem0+ techniques)
MEM0_API_KEY="your-mem0-api-key"
MEM0_PROJECT_ID="your-mem0-project-id"
MEM0_ORGANIZATION_ID="your-mem0-organization-id"
```

**Command**

> Just like the commands of Memobase, but replace `memobase` with `mem0`. See [all commands](#Memory Techniques)


### Run Zep

**Deps**

```bash
pip install zep_cloud
```

**Env**

```bash
ZEP_API_KEY="api-key-from-zep"
```

**Command**

> Just like the commands of Memobase, but replace `memobase` with `zep`. See [all commands](#Memory Techniques)


### Run langmem

**Deps**

```bash
pip install langgraph langmem
```

**Env**

```bash
EMBEDDING_MODEL="text-embedding-3-small"  # or your preferred embedding model
```

**Command**
> Just like the commands of Memobase, but replace `memobase` with `zep`. See [all commands](#Memory Techniques)


### Other methods

The rest methods don't require extra deps/envs.


## Memory Techniques

```bash
# Run Mem0 experiments
make run-memobase-add         # Add memories using Memobase
make run-memobase-search      # Search memories using Memobase

# Run Mem0 experiments
make run-mem0-add         # Add memories using Mem0
make run-mem0-search      # Search memories using Mem0

# Run Mem0+ experiments (with graph-based search)
make run-mem0-plus-add    # Add memories using Mem0+
make run-mem0-plus-search # Search memories using Mem0+

# Run RAG experiments
make run-rag              # Run RAG with chunk size 500
make run-full-context     # Run RAG with full context

# Run LangMem experiments
make run-langmem          # Run LangMem

# Run Zep experiments
make run-zep-add          # Add memories using Zep
make run-zep-search       # Search memories using Zep

# Run OpenAI experiments
make run-openai           # Run OpenAI experiments
```


### 📊 Evaluation

To evaluate results, run:

```bash
python evals.py --input_file [path_to_results] --output_file [output_path]
```

This script:
1. Processes each question-answer pair
2. Calculates BLEU and F1 scores automatically
3. Uses an LLM judge to evaluate answer correctness
4. Saves the combined results to the output file

### 📈 Generating Scores

Generate final scores with:

```bash
python generate_scores.py
```

This script:
1. Loads the evaluation metrics data
2. Calculates mean scores for each category (BLEU, F1, LLM)
3. Reports the number of questions per category
4. Calculates overall mean scores across all categories

Example output:
```
Mean Scores Per Category:
         bleu_score  f1_score  llm_score  count
category                                       
1           0.xxxx    0.xxxx     0.xxxx     xx
2           0.xxxx    0.xxxx     0.xxxx     xx
3           0.xxxx    0.xxxx     0.xxxx     xx

Overall Mean Scores:
bleu_score    0.xxxx
f1_score      0.xxxx
llm_score     0.xxxx
```


## 📁 Project Structure

```
.
├── src/                  # Source code for different memory techniques
│   ├── memobase_client/  # Implementation of the Memobase
│   ├── memzero/          # Implementation of the Mem0 technique
│   ├── openai/           # Implementation of the OpenAI memory
│   ├── zep/              # Implementation of the Zep memory
│   ├── rag.py            # Implementation of the RAG technique
│   └── langmem.py        # Implementation of the Language-based memory
├── metrics/              # Code for evaluation metrics
├── results/              # Results of experiments
├── dataset/              # Dataset files
├── evals.py              # Evaluation script
├── run_experiments.py    # Script to run experiments
├── generate_scores.py    # Script to generate scores from results
└── prompts.py            # Prompts used for the models
```