Review : OneLLM: One Framework to Align All Modalities with Language
Link to paper: https://arxiv.org/abs/2312.03700
Demonstration
Image 1 demonstrates how OneLLM works, it answers questions related to each input — regardless of the modalities. The limitation is it only aligns in one direction such as image-to-text or audio-to-text but not vice versa nor cross modalities ( such as generating image from audio ). When I run their demo app via GitHub repository, the response are very poor. The Gradio chat app ( using LLaMA-2 7B ) barely understand anything about the inputs given.
Architecture
This paper introduces OneLLM, a unified framework that aligns eight modalities to language model.
This paper contributes the following:
- Lightweight Modality Tokenizers
transform the input into a 2D or 1D sequence, tokenized using a 2D/1D convolution layer
2. Universal Encoder — aligning 8 modalities into one encoder using
use CLIP- ViT [67] as a universal computation engine and keep the parameters frozen during training
3. Universal Projection Module
UPM consists of 3 projection experts where each expert is a stack of transformer layers pretrained on image-text data
Modality router (Multi-Layer Perception) to control each expert’s contribution and increase the model capacity
4. Multimodal instruction dataset
5. Evaluated on 25 diverse benchmarks
Limitation
- For video signals, it fed all video frames into the encoder in parallel and perform token-wise averaging between frames to speed up training. Token concatenation may enhance the model’s video understanding capability
- Use CLIP- ViT as Universal Encoder
- Results : Compared to other MLLMs that they picked, OneLLM wins. However, there must be a unimodal LLM that surpasses OneLLM, such as LLaVA in the following example.
4. When I tested the demo app with random samples, it was not good at all.
How to Run the Repo
Important note:
- Between conda activate and pip install, I added cuda 11.7 installation to make sure its compatible with the torch requirement. ( Ref: Link )
conda install cuda -c nvidia/label/cuda-11.7.0
Now, nvcc — version gives 11.7 as a version.
2. If you encounter RuntimeError: FlashAttention only supports Ampere GPUs or newer.
pip install transformers==4.33.1 - upgrade
# or
pip install flash-attn==1.0.9