Skip to content

Commit

Permalink
Update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
haotian-liu committed Jul 29, 2023
1 parent 3337088 commit eae9369
Show file tree
Hide file tree
Showing 7 changed files with 207 additions and 324 deletions.
301 changes: 20 additions & 281 deletions README.md

Large diffs are not rendered by default.

129 changes: 129 additions & 0 deletions docs/ScienceQA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
### ScienceQA

#### Prepare Data
1. Please see ScienceQA [repo](https://github.com/lupantech/ScienceQA) for setting up the dataset.
2. Generate ScienceQA dataset for LLaVA conversation-style format.

```Shell
python scripts/convert_sqa_to_llava \
convert_to_llava \
--base-dir /path/to/ScienceQA/data/scienceqa \
--split {train,val,minival,test,minitest}
```

#### Training
**NOTE**: Due to that ScienceQA experiments were done earlier, the current checkpoints are trained *without* `<im_start>` and `<im_end>` tokens. Here we provide our training scripts for the current checkpoints.

<details>
<summary>1. Pretraining</summary>

```Shell
torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
llava/train/train_mem.py \
--model_name_or_path ./checkpoints/llama-vicuna-13b \
--data_path /path/to/cc3m_595k.json \
--image_folder /path/to/cc3m_595k \
--vision_tower openai/clip-vit-large-patch14 \
--tune_mm_mlp_adapter True \
--mm_vision_select_layer -2 \
--bf16 True \
--output_dir ./checkpoints/llava-13b-pretrain-no_im_start_end_token \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2400 \
--save_total_limit 1 \
--learning_rate 2e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True \
--report_to wandb
```
</details>

<details>
<summary>2. Finetuning</summary>

You may download our pretrained `llava-13b-v0-pretrain-no_im_start_end_token.bin` [here](https://huggingface.co/liuhaotian/LLaVA-13b-pretrain-projector-v0/blob/main/LLaVA-13b-pretrain-projector-v0-CC3M-595K-original_caption-no_im_token.bin).

```Shell
torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
llava/train/train_mem.py \
--model_name_or_path /path/to/llama-vicuna-13b \
--data_path /path/to/scienceqa/llava_train_QCM-LEPA.json \
--image_folder /path/to/scienceqa/images/train \
--vision_tower openai/clip-vit-large-patch14 \
--pretrain_mm_mlp_adapter ./checkpoints/llava-13b-pretrain-no_im_start_end_token/mm_projector.bin \
--mm_vision_select_layer -2 \
--bf16 True \
--output_dir ./checkpoints/llava-13b-pretrain-no_im_start_end_token-finetune_scienceqa \
--num_train_epochs 12 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 5000 \
--save_total_limit 3 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True \
--report_to wandb
```
</details>

#### Evaluation

1. Download our pretrained LLaVA-13B (delta) weights for ScienceQA dataset [here](https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0-science_qa). Convert the delta weights to actual weights.

```Shell
python -m llava.model.apply_delta \
--base /path/to/llama-13b \
--target /path/to/LLaVA-13b-v0-science_qa \
--delta liuhaotian/LLaVA-13b-delta-v0-science_qa
```

2. [Option 1] Multiple-GPU inference
You may evaluate this with multiple GPUs, and concatenate the generated jsonl files. Please refer to our script for [batch evaluation](https://github.com/haotian-liu/LLaVA/blob/main/scripts/sqa_eval_batch.sh) and [results gathering](https://github.com/haotian-liu/LLaVA/blob/main/scripts/sqa_eval_gather.sh).

3. [Option 2] Single-GPU inference

(a) Generate LLaVA responses on ScienceQA dataset

```Shell
python -m llava.eval.model_vqa_science \
--model-path /path/to/LLaVA-13b-v0-science_qa \
--question-file /path/to/ScienceQA/data/scienceqa/llava_test.json \
--image-folder /path/to/ScienceQA/data/scienceqa/images/test \
--answers-file vqa/results/ScienceQA/test_llava-13b.jsonl \
--answer-prompter \
--conv-mode llava_v0
```

(b) Evaluate the generated responses

```Shell
python eval_science_qa.py \
--base-dir /path/to/ScienceQA/data/scienceqa \
--result-file vqa/results/ScienceQA/test_llava-13b.jsonl \
--output-file vqa/results/ScienceQA/test_llava-13b_output.json \
--output-result vqa/results/ScienceQA/test_llava-13b_result.json \
```

For reference, we attach our prediction file [`test_llava-13b_result.json`](https://github.com/haotian-liu/LLaVA/blob/main/llava/eval/table/results/test_sqa_llava_13b_v0.json) for comparison when reproducing our results, as well as for further analysis in detail.
Binary file added images/demo_cli.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions llava/model/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,12 @@ def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, l
kwargs['load_in_8bit'] = True
elif load_4bit:
kwargs['load_in_4bit'] = True
kwargs['quantization_config'] = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4'
)
else:
kwargs['torch_dtype'] = torch.float16

Expand Down
9 changes: 6 additions & 3 deletions llava/serve/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,11 @@ def main(args):
disable_torch_init()

model_name = get_model_name_from_path(args.model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name)
tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, args.load_8bit, args.load_4bit)

if "v1" in model_name.lower():
if 'llama-2' in model_name.lower():
conv_mode = "llava_llama_2"
elif "v1" in model_name.lower():
conv_mode = "llava_v1"
elif "mpt" in model_name.lower():
conv_mode = "mpt"
Expand Down Expand Up @@ -107,10 +109,11 @@ def main(args):
parser.add_argument("--model-base", type=str, default=None)
parser.add_argument("--image-file", type=str, required=True)
parser.add_argument("--num-gpus", type=int, default=1)
parser.add_argument("--device", type=str, choices=["cuda", "cpu"], default="cuda")
parser.add_argument("--conv-mode", type=str, default=None)
parser.add_argument("--temperature", type=float, default=0.2)
parser.add_argument("--max-new-tokens", type=int, default=512)
parser.add_argument("--load-8bit", action="store_true")
parser.add_argument("--load-4bit", action="store_true")
parser.add_argument("--debug", action="store_true")
args = parser.parse_args()
main(args)
40 changes: 0 additions & 40 deletions scripts/extract_mm_projector.py

This file was deleted.

46 changes: 46 additions & 0 deletions scripts/finetune_full_schedule.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/bin/bash

# Uncomment and set the following variables correspondingly to run this script:

################## VICUNA ##################
# PROMPT_VERSION=v1
# MODEL_VERSION="vicuna-v1-3-7b"
################## VICUNA ##################

################## LLaMA-2 ##################
# PROMPT_VERSION="llava_llama_2"
# MODEL_VERSION="llama-2-7b-chat"
################## LLaMA-2 ##################

deepspeed llava/train/train_mem.py \
--deepspeed /path/to/deepspeed.json \
--model_name_or_path ./checkpoints/$MODEL_VERSION \
--version $PROMPT_VERSION \
--data_path ./playground/data/llava_instruct_158k.json \
--image_folder /path/to/coco/train2017 \
--vision_tower openai/clip-vit-large-patch14 \
--pretrain_mm_mlp_adapter ./checkpoints/llava-$MODEL_VERSION-pretrain/mm_projector.bin \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ./checkpoints/llava-$MODEL_VERSION-finetune \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb

0 comments on commit eae9369

Please sign in to comment.