Visual Instruction Tuning (LLaVA)

🌃 VLM

Visual Instruction Tuning (LLaVA)

MINAIR 2025. 8. 19. 23:12

Paper

Visual Instruction Tuning

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use l

arxiv.org

Introduction

배경

언어가 이미지를 설명하는 것에서 더 나아가, user instruction을 따를 수 있는 inter-activity와 adaptability가 있어야 함
NLP 분야에서 LLM instruction으로 튜닝하는 것이 모델의 zero-shot capability 향상에 기여 -> 동일한 concept을 VLM에도 적용하고자 함

contribution

multimodal instruction-following data: vision-language instruction data를 수집
large multimodal models: 수집한 데이터를 이용해 CLIP (vision encoder) + Vicuna (text decoder)를 fine-tuning해 새로운 LVLM을 만듦
multimodal instruction-following benchmark: LLaVA-Bench를 구축
open-source ⭐️

GPT-assisted Visual Instruction Data Generation

처음 시도

widely existing image-caption pair 데이터를 이용해 ChatGPT/GPT-4로부터 multimodal instruction-following data 수집
image Xv, caption Xc에 대해 GPT-4에게 Xv에 대한 Xc의 생성을 지시하는 instruction Xq를 생성하도록 함
instruction-following data format: Human: Xq Xv <STOP> Assistant: Xc <STOP>
그러나, 이렇게 생성된 data는 간단하지만 이렇게 되면 단순히 caption을 생성하는 instruction으로 튜닝되는 것이기 때문에, 데이터의 diveristy & in-depth reasoning이 부족함

이 문제를 해결하기 위해 text-only GPT를 이용함

1) caption Xc, 2) image에서 object의 bounding boxes 좌표값을 입력으로 주어줌 -> image를 LLM-recognizable sequence로 encoding할 수 있음 (GPT를 prompt할 때, image는 주어지지 않고 오직 caption과 boxes만 주어짐)

COCO images를 이용해 3가지 타입의 instruction-following data를 구축 (몇 개는 사람이 미리 만들어서 seed example로 사용해 GPT의 in-context 능력을 이용해 data를 생성)
- conversation
- detailed description
- complex reasoning

Visual Instruction Tuning

CLIP (vision encoder) + Vicuna (language model which has best instruction following capabilities among other models)

image를 text와 같은 차원으로 투영시킴 (trainable W 이용)

auto-regressive training objective를 이용해 모델이 튜닝됨. 아래는 튜닝될 때 모델의 입력 시퀀스.

훈련 단계

step 1) pre-training for feature alignment: 먼저, visual information이 text space로 align될 수 있도록 하기 위해서 훈련 과정이 필요함. Xv에 대해 Xc를 생성할 수 있는 question Xq를 생성한 뒤, 처음 시도처럼 <Xq, Xv>가 입력으로 주어지면 Xc를 gt 삼아 모델이 응답을 생성할 수 있도록 훈련됨. 이때, LLM과 vision encoder는 frozen하고 projection parameters W만 훈련시켜 image features를 pre-trained LLM word embedding space로 잘 투영될 수 있도록 함. (e.g., Xq: "Describe the image precisely")
step 2) fine-tuning end-to-end: LLM과 projection parameters W만 훈련시킴.
- multimodal chatbot: 위에서 구축한 3가지 타입의 데이터를 이용해 튜닝. conversation은 multi-turn, detailed description과 complex reasoning은 single-turn으로 튜닝함
- scienceQA: context (caption, bounding boxes), Xq (generated by GPT)를 입력으로 했을 때 Xa (generated by GPT)를 gt로 삼아 모델이 응답을 생성할 수 있도록 튜닝함 -> SoTA 달성!!

Experiments

multimodal chatabot

llava는 다른 모델 (GPT-4, BLIP-2, OpenFlamingo)와 비교했을 때, 단순히 image를 묘사하는 것뿐 아니라 chatbot으로써 우수한 성능을 보임
또한, llava는 out-of-trained domain인 image에 대해서도 scene을 잘 이해하고 있는 모습을 보임

LLaVA-Bench (COCO)

30개의 각 COCO image에 대해 3가지 타입의 question-answer (conversation, detailed description, complex reasoning)을 생성해 총 90개의 데이터를 생성함
question, image를 모델의 prompt로 주고, answer를 생성하도록 해서 평가함
이 벤치마크를 통해 모델의 instruction-following behavior를 평가할 수 있음

LLaVA-Bench (In-the-wild)

LLaVA의 일반화 능력을 평가하기 위해 여러 도메인에서 24개의 image를 수집하고 60개의 데이터를 만듦. instruction-tuning의 효과 덕분에 벤치마크에 대해 LLaVA는 BLIP-2, OpenFlamingo보다 훨씬 좋은 성능을 보임.
그러나, 문제가 너무 어렵다는 한계가 있음 (wide general knowledge coverage 요구, fine-grained semantic understanding 능력 요구)

'🌃 VLM' 카테고리의 다른 글

LONGHALQA: LONG-CONTEXT HALLUCINATIONEVALUATION FOR MULTIMODAL LARGE LANGUAGEMODELS (Preprint) (0)	2025.08.22
Unified Hallucination Detection for Multimodal Large Language Models (ACL 2024 main) (1)	2025.08.21
Evaluating Object Hallucination in LVLMs (POPE) (0)	2025.08.19
A Survey of State of the Art LVLMs: Alignment, Benchmark, Evaluations and Challenges (1)	2025.08.12
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in LVLMs (1)	2025.08.12

현재글Visual Instruction Tuning (LLaVA)

민공기

Today :
Yesterday :

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

민공기

Visual Instruction Tuning (LLaVA)

Paper

Introduction

GPT-assisted Visual Instruction Data Generation

Visual Instruction Tuning

Experiments

'🌃 VLM' 카테고리의 다른 글

'🌃 VLM'의 다른글

티스토리툴바

Visual Instruction Tuning (LLaVA)

Paper

Introduction

GPT-assisted Visual Instruction Data Generation

Visual Instruction Tuning

Experiments

'🌃 VLM' 카테고리의 다른 글

'🌃 VLM'의 다른글

관련글

티스토리툴바