Review : OneLLM: One Framework to Align All Modalities with Language

Fathinah Asma Izzati
3 min readAug 18, 2024

--

Link to paper: https://arxiv.org/abs/2312.03700

Demonstration

Image 1

Image 1 demonstrates how OneLLM works, it answers questions related to each input — regardless of the modalities. The limitation is it only aligns in one direction such as image-to-text or audio-to-text but not vice versa nor cross modalities ( such as generating image from audio ). When I run their demo app via GitHub repository, the response are very poor. The Gradio chat app ( using LLaMA-2 7B ) barely understand anything about the inputs given.

Architecture

This paper introduces OneLLM, a unified framework that aligns eight modalities to language model.

Image 2: Predecessor Multimodal Architecture

This paper contributes the following:

Image 3: OneLLM Architecture
  1. Lightweight Modality Tokenizers

transform the input into a 2D or 1D sequence, tokenized using a 2D/1D convolution layer

2. Universal Encoder — aligning 8 modalities into one encoder using

use CLIP- ViT [67] as a universal computation engine and keep the parameters frozen during training

3. Universal Projection Module

UPM consists of 3 projection experts where each expert is a stack of transformer layers pretrained on image-text data

Modality router (Multi-Layer Perception) to control each expert’s contribution and increase the model capacity

4. Multimodal instruction dataset

5. Evaluated on 25 diverse benchmarks

Limitation

  1. For video signals, it fed all video frames into the encoder in parallel and perform token-wise averaging between frames to speed up training. Token concatenation may enhance the model’s video understanding capability
  2. Use CLIP- ViT as Universal Encoder
  3. Results : Compared to other MLLMs that they picked, OneLLM wins. However, there must be a unimodal LLM that surpasses OneLLM, such as LLaVA in the following example.
Example of results for Image — Text Benchmark

4. When I tested the demo app with random samples, it was not good at all.

How to Run the Repo

https://github.com/csuhan/OneLLM/tree/main
https://github.com/csuhan/OneLLM/tree/main

Important note:

  1. Between conda activate and pip install, I added cuda 11.7 installation to make sure its compatible with the torch requirement. ( Ref: Link )
conda install cuda -c nvidia/label/cuda-11.7.0

Now, nvcc — version gives 11.7 as a version.

2. If you encounter RuntimeError: FlashAttention only supports Ampere GPUs or newer.

pip install transformers==4.33.1 - upgrade
# or
pip install flash-attn==1.0.9

--

--