🔥🔥 Surveys of MLLMs | 💬 WeChat (MLLM微信交流群)
- 🌟 A Survey of Unified Multimodal Understanding and Generation: Advances and Challenges
 
- 
🌟 MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
arXiv 2025, Paper, Project - 
A Survey on Multimodal Large Language Models
NSR 2024, Paper, Project 
🔥🔥 VITA Series Omni MLLMs | 💬 WeChat (VITA微信交流群)
- 
🌟 VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
NeurIPS 2025 Highlight, Paper, Project - 
🌟 VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
arXiv 2025, Paper, Project - 
🌟 VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
arXiv 2025, Paper, Project - 
VITA: Towards Open-Source Interactive Omni Multimodal LLM
arXiv 2024, Paper, Project - 
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
arXiv 2025, Paper, Project - 
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
NeurIPS 2025, Paper, Project 
🔥🔥 MME Series MLLM Benchmarks
- 
🌟 MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
arXiv 2025, Paper, Project - 
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
NeurIPS 2025 DB Highlight, Paper, Dataset, Eval Tool, ✒️ Citation - 
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
CVPR 2025, Paper, Project, Dataset - 
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
ICLR 2025, Paper, Project, Dataset 
Table of Contents
| Title | Venue | Date | Code | Demo | 
|---|---|---|---|---|
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning  | 
arXiv | 2025-05-09 | Github | - | 
Aligning Multimodal LLM with Human Preference: A Survey  | 
arXiv | 2025-03-23 | Github | - | 
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment  | 
arXiv | 2025-02-14 | Github | - | 
| Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization | arXiv | 2024-10-09 | - | - | 
Silkie: Preference Distillation for Large Visual Language Models  | 
arXiv | 2023-12-17 | Github | - | 
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback  | 
arXiv | 2023-12-01 | Github | Demo | 
Aligning Large Multimodal Models with Factually Augmented RLHF  | 
arXiv | 2023-09-25 | Github | Demo | 
RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data  | 
arXiv | 2024-08-22 | Github | - | 
| Name | Paper | Link | Notes | 
|---|---|---|---|
| Inst-IT Dataset | Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | Link | An instruction-tuning dataset which contains fine-grained multi-level annotations for 21k videos and 51k images | 
| E.T. Instruct 164K | E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding | Link | An instruction-tuning dataset for time-sensitive video understanding | 
| MSQA | Multi-modal Situated Reasoning in 3D Scenes | Link | A large scale dataset for multi-modal situated reasoning in 3D scenes | 
| MM-Evol | MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct | Link | An instruction dataset with rich diversity | 
| UNK-VQA | UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models | Link | A dataset designed to teach models to refrain from answering unanswerable questions | 
| VEGA | VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models | Link | A dataset for enhancing model capabilities in comprehension of interleaved information | 
| ALLaVA-4V | ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model | Link | Vision and language caption and instruction dataset generated by GPT4V | 
| IDK | Visually Dehallucinative Instruction Generation: Know What You Don't Know | Link | Dehallucinative visual instruction for "I Know" hallucination | 
| CAP2QA | Visually Dehallucinative Instruction Generation | Link | Image-aligned visual instruction dataset | 
| M3DBench | M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts | Link | A large-scale 3D instruction tuning dataset | 
| ViP-LLaVA-Instruct | Making Large Multimodal Models Understand Arbitrary Visual Prompts | Link | A mixture of LLaVA-1.5 instruction data and the region-level visual prompting data | 
| LVIS-Instruct4V | To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | Link | A visual instruction dataset via self-instruction from GPT-4V | 
| ComVint | What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | Link | A synthetic instruction dataset for complex visual reasoning | 
| SparklesDialogue | ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | Link | A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns. | 
| StableLLaVA | StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | Link | A cheap and effective approach to collect visual instruction tuning data | 
| M-HalDetect | Detecting and Preventing Hallucinations in Large Vision Language Models | Coming soon | A dataset used to train and benchmark models for hallucination detection and prevention | 
| MGVLID | ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | - | A high-quality instruction-tuning dataset including image-text and region-text pairs | 
| BuboGPT | BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs | Link | A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data | 
| SVIT | SVIT: Scaling up Visual Instruction Tuning | Link | A large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs | 
| mPLUG-DocOwl | mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | Link | An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding | 
| PF-1M | Visual Instruction Tuning with Polite Flamingo | Link | A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo. | 
| ChartLlama | ChartLlama: A Multimodal LLM for Chart Understanding and Generation | Link | A multi-modal instruction-tuning dataset for chart understanding and generation | 
| LLaVAR | LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding | Link | A visual instruction-tuning dataset for Text-rich Image Understanding | 
| MotionGPT | MotionGPT: Human Motion as a Foreign Language | Link | A instruction-tuning dataset including multiple human motion-related tasks | 
| LRV-Instruction | Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | Link | Visual instruction tuning dataset for addressing hallucination issue | 
| Macaw-LLM | Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | Link | A large-scale multi-modal instruction dataset in terms of multi-turn dialogue | 
| LAMM-Dataset | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Link | A comprehensive multi-modal instruction tuning dataset | 
| Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Link | 100K high-quality video instruction dataset | 
| MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Link | Multimodal in-context instruction tuning | 
| M3IT | M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | Link | Large-scale, broad-coverage multimodal instruction tuning dataset | 
| LLaVA-Med | LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | Coming soon | A large-scale, broad-coverage biomedical instruction-following dataset | 
| GPT4Tools | GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | Link | Tool-related instruction datasets | 
| MULTIS | ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | Coming soon | Multimodal instruction tuning dataset covering 16 multimodal tasks | 
| DetGPT | DetGPT: Detect What You Need via Reasoning | Link | Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs | 
| PMC-VQA | PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | Coming soon | Large-scale medical visual question-answering dataset | 
| VideoChat | VideoChat: Chat-Centric Video Understanding | Link | Video-centric multimodal instruction dataset | 
| X-LLM | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | Link | Chinese multimodal instruction dataset | 
| LMEye | LMEye: An Interactive Perception Network for Large Language Models | Link | A multi-modal instruction-tuning dataset | 
| cc-sbu-align | MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Link | Multimodal aligned dataset for improving model's usability and generation's fluency | 
| LLaVA-Instruct-150K | Visual Instruction Tuning | Link | Multimodal instruction-following data generated by GPT | 
| MultiInstruct | MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning | Link | The first multimodal instruction tuning benchmark dataset | 
| Name | Paper | Link | Notes | 
|---|---|---|---|
| MIC | MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | Link | A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs. | 
| MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Link | Multimodal in-context instruction dataset | 
| Name | Paper | Link | Notes | 
|---|---|---|---|
| EMER | Explainable Multimodal Emotion Reasoning | Coming soon | A benchmark dataset for explainable emotion reasoning task | 
| EgoCOT | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | Coming soon | Large-scale embodied planning dataset | 
| VIP | Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | Coming soon | An inference-time dataset that can be used to evaluate VideoCOT | 
| ScienceQA | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | Link | Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains | 
| Name | Paper | Link | Notes | 
|---|---|---|---|
| VLFeedback | Silkie: Preference Distillation for Large Visual Language Models | Link | A vision-language feedback dataset annotated by AI | 
| Name | Paper | Link | Notes | 
|---|---|---|---|
| IMAD | IMAD: IMage-Augmented multi-modal Dialogue | Link | Multimodal dialogue dataset | 
| Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Link | A quantitative evaluation framework for video-based dialogue models | 
| CLEVR-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | Link | A synthetic multimodal fine-tuning dataset for learning to reject instructions | 
| Fruit-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | Link | A manually pictured multimodal fine-tuning dataset for learning to reject instructions | 
| InfoSeek | Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | Link | A VQA dataset that focuses on asking information-seeking questions | 
| OVEN | Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities | Link | A dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild | 

