GitHub

LLaDA Pretraining

Text Pretraining Framework

🌟 Introduction

Under testing...（目前还属于团队自用，上传上来的改动了一部分，可能有少许bug，正在测试中） This is a text pretraining framework for LLaDA models, modified from the MMaDA codebase.

Features:

Text-only training pipeline
Distributed training support with DeepSpeed and Accelerate
YAML-based configuration
Memory efficient training options

🚀 Quick Start

Env（Source from MMaDA）

pip install -r requirements.txt

Basic Training

# Update paths in configs/llada_pretraining.yaml
bash scripts/train.sh

⚙️ Configuration

Edit configs/llada_pretraining.yaml:

model:
    pretrained_model_path: ".../LLaDA-8B-Base/"
    # LLaDA specific configuration
    llada_config:
        gradient_checkpointing: false  # close gradient checkpointing
        new_vocab_size: 126464
        # Add other LLaDA specific configs here if needed

dataset:
  params:
    train_shards_path_or_url: "path/to/data"
    
training:
  batch_size: 16
  max_train_steps: 100000
  mixed_precision: "bf16"

🔧 Training

Setup Accelerate

accelerate config

You can also use the provided configuration files in accelerate_configs/ for different hardware and distributed setups:

1_gpu.yaml - Single GPU
1_node_only.yaml - Single node, single process (CPU or GPU)
1_node_8_gpus_deepspeed_zero1.yaml - 8 GPUs with DeepSpeed ZeRO-1
1_node_8_gpus_deepspeed_zero2.yaml - 8 GPUs with DeepSpeed ZeRO-2
1_node_8_gpus_deepspeed_zero3.yaml - 8 GPUs with DeepSpeed ZeRO-3
8_node_8_gpus_deepspeed_zero2.yaml - 8 nodes, each with 8 GPUs, DeepSpeed ZeRO-2

Run Training

accelerate launch \
    --config_file accelerate_configs/1_node_8_gpus_deepspeed_zero1.yaml \
    --main_process_port=8888 \
    training/train_llada.py \
    config=configs/llada_pretraining.yaml

📁 Project Structure

LLaDA_pretraining/
├── accelerate_configs/     # Accelerate configurations
├── configs/               # Training configurations
├── models/               # Model implementations
├── parquet/              # Data loading utilities
├── training/             # Training scripts
└── scripts/              # Shell scripts

🛠️ Data Format

The files under the folder path you provided should be in JSONL format. It is recommended that the dataset be evenly split into multiple files with the number of files greater than the number of GPUs.

{"text": "Training text content"}

📄 License

MIT License - see LICENSE file.

🙏 Acknowledgments

Based on MMaDA by Yang et al.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLaDA Pretraining

Text Pretraining Framework

🌟 Introduction

🚀 Quick Start

Env（Source from MMaDA）

Basic Training

⚙️ Configuration

🔧 Training

Setup Accelerate

Run Training

📁 Project Structure

🛠️ Data Format

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
accelerate_configs		accelerate_configs
configs		configs
lm_chat_validation		lm_chat_validation
models		models
parquet		parquet
reasoning		reasoning
scripts		scripts
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

Auraithm/LLADA_pretraining

Folders and files

Latest commit

History

Repository files navigation

LLaDA Pretraining

Text Pretraining Framework

🌟 Introduction

🚀 Quick Start

Env（Source from MMaDA）

Basic Training

⚙️ Configuration

🔧 Training

Setup Accelerate

Run Training

📁 Project Structure

🛠️ Data Format

📄 License

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages