We initially explored GRPO based on the Reasoning model synthetic CoT Graph Extraction data, with LLM involved in the reward function.
.
βββ LICENSE
βββ ground_truth_gen # data gen via DeepSeek R1
βΒ Β βββ polished_rl_training_data.csv
βΒ Β βββ r1_distill_reasoning_graph_extraction.ipynb
βββ train
βββ Qwen_GRPO_Graph_Extraction.ipynb # training processupdate: Seems the training notebook doesnt render properly in github, check from colab instead:
| Data Gen | Training |
|---|---|
- DeepSeek-Math and DeepSeek R1's work and DeepSeek R1 Model
- Qwen's Great Base model
- Will's, this(cannot find the author), and Unsloth's Daniel Han Chen work