Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
7e0496c
initial commit
zucchini-nlp Feb 25, 2025
b32b2fa
model can run forward
zucchini-nlp Feb 26, 2025
2faa28e
add conversion script
zucchini-nlp Feb 27, 2025
daeca68
fix conversion
zucchini-nlp Feb 27, 2025
518df71
add cross attention
zucchini-nlp Feb 27, 2025
11f9ab3
add processor
zucchini-nlp Feb 27, 2025
697d6cd
add 5b conversion
zucchini-nlp Feb 28, 2025
f1b9457
at least video-only generation works
zucchini-nlp Mar 6, 2025
2e5e3d2
video2world works yay
zucchini-nlp Mar 6, 2025
16ae997
fix t5 so it's loaded in fp32. Othewise weights are downcasted to bf1…
zucchini-nlp Mar 14, 2025
15a9deb
make rope similar to llama
zucchini-nlp Mar 14, 2025
a67568f
video-only works, verified
zucchini-nlp Mar 18, 2025
323b0b0
wait, it works? Oke, lemme checkpoint this so I dont go crazy later
zucchini-nlp Mar 19, 2025
36644eb
NO WAY, was using wrong ckpt all this time? T_T
zucchini-nlp Mar 19, 2025
5773cbc
save while it still works
zucchini-nlp Apr 1, 2025
4add6c9
update
zucchini-nlp Apr 1, 2025
f724bb2
make it encoder-decoder
zucchini-nlp Apr 1, 2025
056edc3
fixup
zucchini-nlp Apr 1, 2025
1f6f070
rename
zucchini-nlp Apr 1, 2025
2bebf11
fix some tests
zucchini-nlp Apr 2, 2025
54c6036
fixup
zucchini-nlp Apr 3, 2025
545911c
fix some tests
zucchini-nlp Apr 3, 2025
1604811
new attn API for VQ module
zucchini-nlp Apr 3, 2025
d50b2cf
docs
zucchini-nlp Apr 3, 2025
ad9f940
merge main
zucchini-nlp Apr 3, 2025
7117853
fixup
zucchini-nlp Apr 3, 2025
bd9bba5
fix up
zucchini-nlp Apr 3, 2025
c790ab4
fixup
zucchini-nlp Apr 4, 2025
43bf276
fix red ci
zucchini-nlp Apr 4, 2025
26c90d4
no compile for Cosmos
zucchini-nlp Apr 7, 2025
59d0055
Merge remote-tracking branch 'upstream/main' into cosmos
zucchini-nlp Apr 24, 2025
894331d
fix tests
zucchini-nlp Apr 24, 2025
97a267e
remove unused config attributes
zucchini-nlp Apr 24, 2025
b5cedf0
fix
zucchini-nlp Apr 24, 2025
7c7855e
Merge branch 'main' into cosmos
zucchini-nlp May 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -937,6 +937,8 @@
title: CLVP
- local: model_doc/colpali
title: ColPali
- local: model_doc/cosmos
title: Cosmos
- local: model_doc/data2vec
title: Data2Vec
- local: model_doc/deplot
Expand Down
149 changes: 149 additions & 0 deletions docs/source/en/model_doc/cosmos.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# Cosmos

## Overview

The Cosmos model was proposed in [Cosmos World Foundation Model Platform for Physical AI](https://arxiv.org/abs/2501.03575) by Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Arsalan Mousavian, Seungjun Nah, Sriharsha Niverty, David Page, Despoina Paschalidou, Zeeshan Patel, Lindsey Pavao, Morteza Ramezanali, Fitsum Reda, Xiaowei Ren, Vasanth Rao Naik Sabavat, Ed Schmerling, Stella Shi, Bartosz Stefaniak, Shitao Tang, Lyne Tchapmi, Przemek Tredak, Wei-Cheng Tseng, Jibin Varghese, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Xinyue Wei, Jay Zhangjie Wu, Jiashu Xu, Wei Yang, Lin Yen-Chen, Xiaohui Zeng, Yu Zeng, Jing Zhang, Qinsheng Zhang, Yuxuan Zhang, Qingqing Zhao, Artur Zolkowski.


The abstract from the paper is the following:

*Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via this https URL.*

This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
The original code can be found [here](https://github.com/NVIDIA/Cosmos/tree/main).


## Usage examples

Cosmos can generate by conditioning either on video/image or text+video/image. The video used to condition has to be exactly 9 frames in length, while the image is treated as a single frame video.

Below is an example of generating by conditioning on video only.

```python
import torch
import imageio
from transformers.image_utils import load_video
from transformers import CosmosProcessor, CosmosForConditionalGeneration

model_id = "NVIDIA/Cosmos-4B-hf"
processor = CosmosProcessor.from_pretrained(model_id)

model = CosmosForConditionalGeneration.from_pretrained(
model_id,
torch_dtype="bfloat16",
low_cpu_mem_usage=True,
device_map="auto",
)

# Generate from last 9 frames of the video
video, _ = load_video("cosmos1/models/autoregressive/assets/v1p0/input.mp4", backend="decord")[-9:]
inputs = proc(videos=video, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

out = model.generate(**inputs, max_new_tokens=7680)

# Decode the video and save.
video_decoded = model.model.decode_video_tokens(out)
video_decoded = video_decoded.permute(0, 2, 1, 3, 4).float()
video_processed = proc.postprocess([video_decoded[0]], return_tensors="np")
imageio.mimsave("generated_video.mp4", video_processed['pixel_values'].squeeze(0), fps=25)

```

To condition on text input as well, we just pass it along to the processor. The rest is same as in video conditioning.

```python
import torch
import imageio
from transformers.image_utils import load_video
from transformers import CosmosProcessor, CosmosForConditionalGeneration

model_id = "NVIDIA/Cosmos-5B-hf"
processor = CosmosProcessor.from_pretrained(model_id)

model = CosmosForConditionalGeneration.from_pretrained(
model_id,
torch_dtype="bfloat16",
low_cpu_mem_usage=True,
device_map="auto",
)

# Generate from last 9 frames of the video
video, _ = load_video("cosmos1/models/autoregressive/assets/v1p0/input.mp4", backend="decord")[-9:]
text = "A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions."
inputs = proc(videos=video, text=text, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

out = model.generate(**inputs, max_new_tokens=7680)

# Remove the first token which is `BOS`. Decode the video and save.
video_decoded = model.model.decode_video_tokens(out[:, 1:])
video_decoded = video_decoded.permute(0, 2, 1, 3, 4).float()
video_processed = proc.postprocess([video_decoded[0]], return_tensors="np")
imageio.mimsave("generated_video.mp4", video_processed['pixel_values'].squeeze(0), fps=25)

```

## CosmosVideoProcessor

[[autodoc]] CosmosVideoProcessor

## CosmosProcessor

[[autodoc]] CosmosProcessor

## CosmosConfig

[[autodoc]] CosmosConfig

## CosmosVQVAEConfig

[[autodoc]] CosmosVQVAEConfig

## CosmosTextConfig

[[autodoc]] CosmosTextConfig

## CosmosVQVAE

[[autodoc]] CosmosVQVAE
- forward

## CosmosTextModel

[[autodoc]] CosmosTextModel
- forward

## CosmosTextPreTrainedModel

[[autodoc]] CosmosTextPreTrainedModel
- forward

## CosmosPreTrainedModel

[[autodoc]] CosmosPreTrainedModel
- forward

## CosmosModel

[[autodoc]] CosmosModel
- forward

## CosmosForConditionalGeneration

[[autodoc]] CosmosForConditionalGeneration
- forward
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@
("convbert", "ConvBertConfig"),
("convnext", "ConvNextConfig"),
("convnextv2", "ConvNextV2Config"),
("cosmos", "CosmosConfig"),
("cpmant", "CpmAntConfig"),
("csm", "CsmConfig"),
("ctrl", "CTRLConfig"),
Expand Down Expand Up @@ -437,6 +438,7 @@
("convbert", "ConvBERT"),
("convnext", "ConvNeXT"),
("convnextv2", "ConvNeXTV2"),
("cosmos", "Cosmos"),
("cpm", "CPM"),
("cpmant", "CPM-Ant"),
("csm", "CSM"),
Expand Down
3 changes: 3 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@
("convbert", "ConvBertModel"),
("convnext", "ConvNextModel"),
("convnextv2", "ConvNextV2Model"),
("cosmos", "CosmosModel"),
("cpmant", "CpmAntModel"),
("csm", "CsmForConditionalGeneration"),
("ctrl", "CTRLModel"),
Expand Down Expand Up @@ -883,6 +884,7 @@
("blip", "BlipForConditionalGeneration"),
("blip-2", "Blip2ForConditionalGeneration"),
("chameleon", "ChameleonForConditionalGeneration"),
("cosmos", "CosmosForConditionalGeneration"),
("emu3", "Emu3ForConditionalGeneration"),
("fuyu", "FuyuForCausalLM"),
("gemma3", "Gemma3ForConditionalGeneration"),
Expand Down Expand Up @@ -1546,6 +1548,7 @@
("bert", "BertModel"),
("big_bird", "BigBirdModel"),
("clip_text_model", "CLIPTextModel"),
("cosmos", "CosmosTextModel"),
("data2vec-text", "Data2VecTextModel"),
("deberta", "DebertaModel"),
("deberta-v2", "DebertaV2Model"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@
("clipseg", "CLIPSegProcessor"),
("clvp", "ClvpProcessor"),
("colpali", "ColPaliProcessor"),
("cosmos", "CosmosProcessor"),
("emu3", "Emu3Processor"),
("flava", "FlavaProcessor"),
("fuyu", "FuyuProcessor"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,7 @@
("cohere2", (None, "CohereTokenizerFast" if is_tokenizers_available() else None)),
("colpali", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("convbert", ("ConvBertTokenizer", "ConvBertTokenizerFast" if is_tokenizers_available() else None)),
("cosmos", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
(
"cpm",
(
Expand Down
27 changes: 27 additions & 0 deletions src/transformers/models/cosmos/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_cosmos import *
from .modeling_cosmos import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading
Loading