The specific implementation code for text-to-image conversion

Hello, thanks for the great work on VILA-U.
I want to put the first frame of the image into the process of image autoregression to achieve consistency and continuity in video generation, but I don't know where the specific implementation code of the following method is?   Could you please tell me?   Or do you have a better suggestion?
```python
            outputs = self.llm.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                vision_tower=self.vision_tower,
                mm_projector=self.mm_projector,
                image_ids=image_ids,
                cfg=cfg,
                **generation_kwargs
            )
```
> vila_u_arch.py, line 580-588

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The specific implementation code for text-to-image conversion #26

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The specific implementation code for text-to-image conversion #26

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions