feat(doc): add dataset download instruction (#38)

ntudy · Yue Deng · web-flow · commit a2a92560e16c · 2025-09-16T18:52:18.000+08:00
* add dataset download instruction

* update download script and remove unused doc

---------

Co-authored-by: Yue Deng &lt;yue.deng@miromind.ai&gt;
diff --git a/docs/mkdocs/docs/download_datasets.md b/docs/mkdocs/docs/download_datasets.md
@@ -0,0 +1,77 @@
+# Dataset Download Instructions
+
+## Prerequisites
+
+Before downloading the datasets, you need to:
+
+1. **Request access to Hugging Face datasets**:
+   - **GAIA Dataset**: https://huggingface.co/datasets/gaia-benchmark/GAIA
+   - **HLE Dataset**: https://huggingface.co/datasets/cais/hle
+
+   Please visit these links and request access to the datasets.
+
+2. **Configure environment variables**:
+
+   Copy the template file and create your environment configuration:
+   ```bash
+   cp .env.template .env
+   ```
+
+   Then edit the `.env` file and configure these two essential variables:
+
+   ```env
+   # Required: Your Hugging Face token for dataset access
+   HF_TOKEN="your-actual-huggingface-token-here"
+   
+   # Data directory path 
+   DATA_DIR="data/"
+   ```
+
+   To get your Hugging Face token:
+   - Go to https://huggingface.co/settings/tokens
+   - Create a new token with "Read" permissions
+   - Replace `<your-huggingface-token>` in the `.env` file with your actual token
+
+## Download and Prepare Datasets
+
+Once you have been granted access to the required datasets, run the script `bash scripts/run_prepare_benchmark.sh` shown below to download and prepare all benchmark datasets. You may comment out any unwanted datasets:
+
+```
+#!/bin/bash
+echo "Please grant access to these datasets:"
+echo "- https://huggingface.co/datasets/gaia-benchmark/GAIA"
+echo "- https://huggingface.co/datasets/cais/hle"
+echo
+
+read -p "Have you granted access? [Y/n]: " answer
+answer=${answer:-Y}
+if [[ ! $answer =~ ^[Yy] ]]; then
+    echo "Please grant access to the datasets first"
+    exit 1
+fi
+echo "Access confirmed"
+
+# Comment out any unwanted datasets by adding # at the start of the line
+uv run main.py prepare-benchmark get gaia-val
+uv run main.py prepare-benchmark get gaia-val-text-only
+uv run main.py prepare-benchmark get frames-test
+uv run main.py prepare-benchmark get webwalkerqa
+uv run main.py prepare-benchmark get browsecomp-test
+uv run main.py prepare-benchmark get browsecomp-zh-test
+uv run main.py prepare-benchmark get hle
+```
+This script will:
+1. Confirm that you have access to the required datasets
+2. Download and prepare the following benchmark datasets:
+   - gaia-val
+   - gaia-val-text-only
+   - frames-test
+   - webwalkerqa
+   - browsecomp-test
+   - browsecomp-zh-test
+   - hle
+
+
+---
+**Last Updated:** Sep 2025  
+**Doc Contributor:** Index @ MiroMind AI
diff --git a/docs/mkdocs/docs/gaia_validation.md b/docs/mkdocs/docs/gaia_validation.md
@@ -24,7 +24,9 @@ This section provides step-by-step instructions to reproduce our GAIA validation
 
 ### Step 1: Prepare the GAIA Validation Dataset
 
-First, download and prepare the GAIA validation dataset:
+Please follow the Dataset Download Instructions from previous section.
+
+Alternatively, you can manually download and set up the dataset as follows:
 ```bash
 cd data
 wget https://huggingface.co/datasets/miromind-ai/MiroFlow-Benchmarks/resolve/main/gaia-val.zip
diff --git a/docs/mkdocs/docs/prepare_benchmark_data_from_original_source.md b/docs/mkdocs/docs/prepare_benchmark_data_from_original_source.md
diff --git a/docs/mkdocs/mkdocs.yml b/docs/mkdocs/mkdocs.yml
@@ -31,10 +31,10 @@ nav:
 
   - Evaluation:
     - Overview: evaluation_overview.md
+    - Download Datasets: download_datasets.md
     - Benchmarks: 
       - GAIA-Validation: gaia_validation.md
       - GAIA-Test: gaia_test.md
-      - Prepare Benchmark Data from Original Source: prepare_benchmark_data_from_original_source.md
     - Add New Benchmarks: contribute_benchmarks.md
 
   - Tools: 
diff --git a/scripts/run_prepare_benchmark.sh b/scripts/run_prepare_benchmark.sh
@@ -12,7 +12,7 @@ if [[ ! $answer =~ ^[Yy] ]]; then
 fi
 echo "Access confirmed"
 
-
+# Comment out any unwanted datasets by adding # at the start of the line
 uv run main.py prepare-benchmark get gaia-val
 uv run main.py prepare-benchmark get gaia-val-text-only
 uv run main.py prepare-benchmark get frames-test