This is the official implementation of CLARIFY (ICML 2025 poster, arxiv, OpenReview).
Experiments require MuJoCo. Please follow the instructions in the mujoco-py repo to install.
Then, dependencies can be installed with the following command:
conda env create -f conda_env.yml
Please download the datasets from LiRE, and put the dataset in ./data/metaworld_data/<task_name> and ./data/dmcontrol_data/<task_name>.
Train the reward model using CLARIFY, for hammer task with
python train_contrastive_reward.py --env hammer --gpu <GPU number> --seed <seed> \
--max_feedback 1000 --teacher_eps_skip 0.5 --feed_type "c"Train the reward model using OPRL, for walker-walk task with
python train_contrastive_reward.py --env walker-walk --gpu <GPU number> --seed <seed> \
--max_feedback 200 --teacher_eps_skip 0.7 --feed_type "d"Train the offline policy based on CLARIFY's reward model, for hammer task with
python scripts/reward_model_mapping.py
python policy_learning/oprl_policy.py --env hammer --gpu <GPU number> --seed <seed> \
--teacher_eps_skip 0.5 --feed_type "c" \
--reward_model_name_mapping "scripts/reward_model_map_q50.json" \
Train the offline policy based on OPRL's reward model, for walker-walk task with
python scripts/reward_model_mapping.py
python policy_learning/oprl_policy.py --env walker-walk --gpu <GPU number> --seed <seed> \
--teacher_eps_skip 0.7 --feed_type "d" \
--reward_model_name_mapping "scripts/reward_model_map_q50.json" \This repo benefits from LiRE, HIM and BPref. Thanks for their wonderful work.
If you find this project helpful, please consider citing the following paper:
@inproceedings{mu2025clarify,
title={CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries},
author={Mu, Ni and Hu, Hao and Hu, Xiao and Yang, Yiqin and XU, Bo and Jia, Qing-Shan},
booktitle={Forty-second International Conference on Machine Learning},
year={2025}
}