Skip to content

Commit a2a9256

Browse files
ntudyYue Deng
andauthored
feat(doc): add dataset download instruction (#38)
* add dataset download instruction * update download script and remove unused doc --------- Co-authored-by: Yue Deng <[email protected]>
1 parent e943271 commit a2a9256

File tree

5 files changed

+82
-52
lines changed

5 files changed

+82
-52
lines changed
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Dataset Download Instructions
2+
3+
## Prerequisites
4+
5+
Before downloading the datasets, you need to:
6+
7+
1. **Request access to Hugging Face datasets**:
8+
- **GAIA Dataset**: https://huggingface.co/datasets/gaia-benchmark/GAIA
9+
- **HLE Dataset**: https://huggingface.co/datasets/cais/hle
10+
11+
Please visit these links and request access to the datasets.
12+
13+
2. **Configure environment variables**:
14+
15+
Copy the template file and create your environment configuration:
16+
```bash
17+
cp .env.template .env
18+
```
19+
20+
Then edit the `.env` file and configure these two essential variables:
21+
22+
```env
23+
# Required: Your Hugging Face token for dataset access
24+
HF_TOKEN="your-actual-huggingface-token-here"
25+
26+
# Data directory path
27+
DATA_DIR="data/"
28+
```
29+
30+
To get your Hugging Face token:
31+
- Go to https://huggingface.co/settings/tokens
32+
- Create a new token with "Read" permissions
33+
- Replace `<your-huggingface-token>` in the `.env` file with your actual token
34+
35+
## Download and Prepare Datasets
36+
37+
Once you have been granted access to the required datasets, run the script `bash scripts/run_prepare_benchmark.sh` shown below to download and prepare all benchmark datasets. You may comment out any unwanted datasets:
38+
39+
```
40+
#!/bin/bash
41+
echo "Please grant access to these datasets:"
42+
echo "- https://huggingface.co/datasets/gaia-benchmark/GAIA"
43+
echo "- https://huggingface.co/datasets/cais/hle"
44+
echo
45+
46+
read -p "Have you granted access? [Y/n]: " answer
47+
answer=${answer:-Y}
48+
if [[ ! $answer =~ ^[Yy] ]]; then
49+
echo "Please grant access to the datasets first"
50+
exit 1
51+
fi
52+
echo "Access confirmed"
53+
54+
# Comment out any unwanted datasets by adding # at the start of the line
55+
uv run main.py prepare-benchmark get gaia-val
56+
uv run main.py prepare-benchmark get gaia-val-text-only
57+
uv run main.py prepare-benchmark get frames-test
58+
uv run main.py prepare-benchmark get webwalkerqa
59+
uv run main.py prepare-benchmark get browsecomp-test
60+
uv run main.py prepare-benchmark get browsecomp-zh-test
61+
uv run main.py prepare-benchmark get hle
62+
```
63+
This script will:
64+
1. Confirm that you have access to the required datasets
65+
2. Download and prepare the following benchmark datasets:
66+
- gaia-val
67+
- gaia-val-text-only
68+
- frames-test
69+
- webwalkerqa
70+
- browsecomp-test
71+
- browsecomp-zh-test
72+
- hle
73+
74+
75+
---
76+
**Last Updated:** Sep 2025
77+
**Doc Contributor:** Index @ MiroMind AI

docs/mkdocs/docs/gaia_validation.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,9 @@ This section provides step-by-step instructions to reproduce our GAIA validation
2424

2525
### Step 1: Prepare the GAIA Validation Dataset
2626

27-
First, download and prepare the GAIA validation dataset:
27+
Please follow the Dataset Download Instructions from previous section.
28+
29+
Alternatively, you can manually download and set up the dataset as follows:
2830
```bash
2931
cd data
3032
wget https://huggingface.co/datasets/miromind-ai/MiroFlow-Benchmarks/resolve/main/gaia-val.zip

docs/mkdocs/docs/prepare_benchmark_data_from_original_source.md

Lines changed: 0 additions & 49 deletions
This file was deleted.

docs/mkdocs/mkdocs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,10 +31,10 @@ nav:
3131

3232
- Evaluation:
3333
- Overview: evaluation_overview.md
34+
- Download Datasets: download_datasets.md
3435
- Benchmarks:
3536
- GAIA-Validation: gaia_validation.md
3637
- GAIA-Test: gaia_test.md
37-
- Prepare Benchmark Data from Original Source: prepare_benchmark_data_from_original_source.md
3838
- Add New Benchmarks: contribute_benchmarks.md
3939

4040
- Tools:

scripts/run_prepare_benchmark.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ if [[ ! $answer =~ ^[Yy] ]]; then
1212
fi
1313
echo "Access confirmed"
1414

15-
15+
# Comment out any unwanted datasets by adding # at the start of the line
1616
uv run main.py prepare-benchmark get gaia-val
1717
uv run main.py prepare-benchmark get gaia-val-text-only
1818
uv run main.py prepare-benchmark get frames-test

0 commit comments

Comments
 (0)