0% found this document useful (0 votes)
32 views15 pages

LLaVA - Large Multimodal Model

LLaVA - Large Multimodal Model

Uploaded by

Marcos Luis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
32 views15 pages

LLaVA - Large Multimodal Model

LLaVA - Large Multimodal Model

Uploaded by

Marcos Luis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 15
9115124, 8:97 AM LLAVA- Large Multimodal Model | MLExpert- Get Things Done with Al Bootcamp Blog > Llava Large Multimodal Model LLaVA - Large Multimodal Model LLaVA - Large Open Source Multimodal Model | Chat with Images like GPT-4.. Large Language Models (LLMs) allow us to generate text, but they only take text as an input. Large Multimodal Models (LMM) can take both text and image as an input, and generate text based on both. So, you can chat with your model about an image Join the AI BootCamp! Ready to dive into the world of Al and Machine Learning? Join the Al BootCamp to transform your career with the latest skills and hands-on project experience. Learn about LLMs, ML best practices, and much more! JOIN NOW htips:shwww.mlexpertiofblogilave-arge-mulimedal-madel 4s 971924, 827 AM LAVA. Large Multimodal Made! | MLExpert- Get Tings Done wi Al Bootcamp OpenAl has released their GPT-4V(ision)! model that integrates nicely with the ChatGPT interface. However, open-source models are on the way. LLaVA is one of them. © In this part, we will be using Jupyter Notebook to run the code. If you prefer to follow along, you can find the notebook on GitHub: GitHub Repository What is LLaVA? LLaVA2, a Large Multimodal Model (LMM), allows you to have image-based conversations. Similar to GPT-4V but without the price tag, LLaVA is free and open source, LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA. So, LLaVA combines a vision encoder and an open-source LLM (Vicune in this case). LLaVA 1.5 The LLaVA-1.53 model offers a solid improvement on all benchmarks, compared to the original model. It is trained on 1.2M data points, adds academic-task-oriented VQA dataset and it trains in ~1 day on a 8-A100 node. We're going to use the 138 model checkpoint and load it with the ava-torch library in a 4bit format. How good is it? Let's find out. Setup Setting up the LLaVA library requires installing the following dependencies: ntps:wwwmlexpertofblogiava-arge-mulimodal-model 215 9115124, 8:97 AM LLAVA- Large Multimodal Model | MLExpert- Get Things Done with Al Bootcamp pip install -Uqqq pip --progress-bar off pip install -qqq torch==2.1 --progress-bar off pip install -qqq transformers==4.34.1 --progress-bar off pip install -qqq accelerate==0.23.0 --progress-bar off pip install -qqq bitsandbytes==0.41.1 --progress-bar off pip install -qqq llava-torch==1.1.1 --progress-bar off The last package, Llava-torch is the LLaVA library. Let's add the necessary imports: import textwrap from io import BytesIO import requests import torch from llava.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX from llava.conversation import SeparatorStyle, conv_tenplates from Llava.mn_utils import ( KeywordsStoppingCritenia, get_model_name_from_path, process_images, tokenizer_image_token, ) rom Llava.model.builder import load_pretrained_model from llava.utils import disable_torch_init from PIL import Image disable_torch_init() Data To reproduce the results, we need to download the following images: Igdown AnpSrAod-apd1@D305XXQhjMa2ja7FEH igdown 1Qnutc8S7F6jMNERKIZBgiAePynDC}3Ti tgdown 1XH7QgiuN}7Kjapaetjy#x"VWSdQaqsaH igdonn 1n9v8EVZ16sYcULCGUHBPFULxFxam190U Igdown 1x7XtPRG-IbSxyCO-ZT0_P7JirwRFY-3N ntps:wwwmlexpertofblogiava-arge-mulimodal-model 3s 9115124, 8:97 AM LLAVA- Large Multimodal Model | MLExpert- Get Things Done with Al Bootcamp Download Model We'll use the 13B model checkpoint and load it with the 1lava-torch library in a 4bit format. Let's start by taking it's name: MODEL = “4bit/llava-v1.5-13b-3GB" model_name = get_model_name_from_path(MODEL) model_name "Llava-v1.5-13b-368" To load the model, tokenizer, and image processor we can use the load_pretrained_model helper function: tokenizer, model, image_processor, context_len = load_pretrained_model( model_path=MODEL, model_base=None, model_name=model_name, load_4bit=True Image Preprocessing and Prompt We need a way to load the image and process it for the model. Let's create a helper function for loading the image using PIL def load_image(image_file): if image_file.startswith("https://") or image_file.startswith("https://"): response = requests.get(image_file) image = Inage.open(BytesI0(response.content)). convert ("RGB") else: image = Image.open(image_file).convert("RGB") return image The function will load a local file or download it from a URL (via the requests library). Next, we'll create a function that will process the image for the mode! ntps:wwwlexpertofblogiava-arge-mulimodal-model ans, 9115124, 8:97 AM LLAVA- Large Multimodal Model | MLExpert- Get Things Done with Al Bootcamp def process, image (image): args = {"image_aspect_ratio": "pad"} image_tensor = process_images([image], image_processor, args) return image_tensor.to(model.device, dtype=torch. floati6) Let's try it out: image = = load_image("bike-girl. jpeg") processed_image = process_image(image) type(processed_image), processed_image. shape (torch.Tensor, torch.Size([1, 3, 336, 336])) The functions load the image and process it for the model by converting it into a Tensor. Next, we'll create function that will create a prompt using the correct template: CONV_MODE = "1lava_ve" def create_prompt(prompt: str): conv = cony_templates[CONV_MODE].copy() roles = conv.roles prompt = DEFAULT_IMAGE_TOKEN + "\n" + prompt conv.append_message(roles[@], prompt) conv.append_message(roles[1], None) return conv.get_prompt(), conv prompt, _ = create_prompt("Describe the image") print (pronpt) The function takes care of any special tokens and adding roles to the prompt. Here's the final template: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. uiHuman: Describe the image si#Assistant: ntps:wwwlexpertofblogiava-arge-mulimodal-model 55 9115724, 8:97 AM LLaVA- Large Multimodal Model | MLExpart- Get Things Done wth Al Bootsama We have a prompt and a way to process the image. Let's create a function that will ask the model a question about the image: def ask_image(image: Image, prompt: str): image_tensor = process_image(image) Prompt, conv = create_prompt(prompt) input_ids = ( ‘tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt" -unsqueeze(@) -to(model device) stop_str = conv.sep if conv.sep_style != Separatorstyle.THO else conv. sep2 stopping criteria = keywordsStoppingCriteria( keywords=[stop_str], tokenizer=tokenizer, input_ids-input_ids with torch.inference_mode(): output_ids = model.generate( input_ids, images=image_tensor, do_sample=True, [email protected], max_new_tokens=512, use_cache=True, stopping_criteri stopping_criteria], ) return tokenizer.decode( output_ids[@, input_ids.shape[1] +], skip_special_tokens-True ).strip() The function takes care of the following: creating the prompt, tokenizing it, generating the output, and decoding it. The interface is very similar to other generative models from the HuggingFace developers. Q&A Over Image Let's load our first image: https:shwwwemlexpertiofblogilave-arge-mulimedal-madel ens ‘9115124, 8:97 AM Muttimadal Model | MLExpert- Get Things Done with Al Bootcamp Girl on a bike We can start with a simple question result = ask_image(image, "Describe the image") print (textwrap.fill(result, width=110)) The image features a woman sitting on a motorcycle, which is parked on a brick htips:shwwwemlexpert iofblogilave-arge-mulimedal-madel 715 96124, 8:37 AM LLLaVA- Large Multimodal Model | MLExpert- Get Things Qone with Al Bootcamp driveway in front of a house. She is wearing a black leather outfit, which includes a leather jacket and leggings. The motorcycle is positioned prominently in the scene, with the woman sitting comfortably on it. The house in the background adds a sense of context to the scene, suggesting that the woman may be preparing to ride the motorcycle or has just arrived at her destination. The description is quite detailed and good overall. Let's ask something more specific result = ask_image(image, "Does the woman wear a helmet?”) print (textwrap.fill(result, width=110)) Lava Yes, the woman is wearing a helmet while sitting on the motorcycle. The model has failed to answer the question correctly. Let's ask something similar by try to make the model reason about the image: result = ask_image( image, “Take a look at the woman's head. What is the color of her skin? Does she wear a he d print(textwrap.fill(result, width=110)) Lava, The woman's skin color is white, and she is not wearing a helmet. This time around the model has answered correctly. Asking for focusing on the woman's head and color of her skin helped us get a correct response OCR & Document Understanding https:shwwwemlexpertiofblogilave-arge-mulimedal-madel ans 9115124, 8:97 AM Let's try something more challenging. Can the model read and understand documents? We'll use t LLLaVA- Large Multimodal Model | MLExpert- Get Things Qone with Al Bootcamp 1¢ following image from the Bitcoin whitepaper: Bitcoin: A Peer-to-Peer Electronic Cash System Sto Nekamots sstostinggm com wn bicoieorg Abstracts A prey pepe venion of cone cat wou all elie ‘vo beset diel fm ne gry 1 soir wiht ain Doe # Frncl aston Dial signs povie pat of he solion bt fe malt ees ot i sed i pay sl gure pet dese. Weep ohaer os aeicaeatng pelea wa rerdngne terre ‘Toe neck timesmps tnsacion by hashing he nt an ongoing cin ot Ihase en-ozwo, emis aco eamotbe cand witb Felons tbe pootetwork. The ng cai enya pot of he eure of ‘rent wind, bt pool ht came or he rt poo of CPU poms AS eng 1 malay af CPU pow scone by ods af 0 coperatng io sack te newt, they pent He Imps can and eiace acer. The basis and nods cn lave and om the newak at wll sping He Togs rood ae x root of wt ppence whey wore me 1. Introduction ‘Commerce on theirs as comet ely almost exclusively o ficial instnons serving as ‘std hed pares to process clecoie payments Whe the sim woRs well enough ft ‘ont eanactiooy, sll sullzs Gow he tihermt neakpemer of tbe ‘ru bued mol (Compiclynorrerersble ranactons are not reall posib sce Fisocalinsitutonscanot ‘oid mediating capes, The cont oF mediation rerwcs transicon cost, imiing the ‘nim praieal asacton size and eating oT he possi fr sal casual vansctins, td there le bondor sot inthe lw of abit to male now roverble payments fer on ‘versie services. Wit he possi of reversal the ced fr ts spreads. Merhans mst te wary ofther customers, sling them fr moe nfomaton than they would othewise ned ‘Acetain percentage of fad isaceptel as unaoidble. These cows a pay werainaes {be voied in person y using pyscal currency. bu no mechani exis to mabe pMNONS ‘ver aconmcatone channel witout tse par. "Wiat is pedois a electronic payment system tas oncryptopaphic poof insta of tas, lowing nyt willing parts to ranact dealywithen ote witht he ned br atraed {hid party Tansictins tat are computationally impact w reverse woul protect sells ‘um fan, and roan escrow mesharss could easly be implementa 1 prac buss. [8 Is paper we pops soutien tothe dublespenling proble wsig pee-o-pet dito ‘extn sewer openers computational et othe sro der ef mentors The ‘mt is Secure a fg as honest pode ealecivey cool more CPU power Un 20) ‘Stopsating grup of archer nodes First page of Bitcoin paper htips:shwwwemlexpert iofblogilave-arge-mulimedal-madel ons 9115124, 8:97 AM LLAVA- Large Multimodal Model | MLExpert- Get Things Done with Al Bootcamp wetime result = ask_image(image, "What is the title of the paper?") print (textwrap.fill(result, width=110)) Lava, Bitcoin: A Peer-to-Peer Electronic Cash System Great, the model has correctly extracted the title of the paper. Let's see if it can extract the abstract: wxKtime result = ask_image(image, “Extract the text from the abstract") print (textwrap.fill(result, width=110)) Lava, Bitcoin: A Peer-to-Peer Electronic Cash System It got that wrong, It extracted the title again, but nothing from the abstract. Again, we can try to make the model reason about the image by asking for a summary of the abstract: setime result = ask_image(image, "Summarize the abstract of the paper in 2 sentences. print(textwrap.fill(result, width=11@)) Lava, The paper discusses the concept of a peer-to-peer electronic cash system, focusing on the Bitcoin system. It highlights the advantages of this system, such as its decentralized nature, security, and potential for financial inclusion. The paper also addresses some of the challenges and limitations of the Bitcoin system, such as scalability and regulatory issues. ntps:wwwlexpertofblogiava-arge-multimodal-madel rons 9115124, 8:97 AM LLLaVA- Large Multimodal Model | MLExpert- Get Things Qone with Al Bootcamp Much better! LLaVA has correctly extracted the abstract and summarized it in 2 sentences. Price Chart We can also ask the model to reason about charts. Let's try with the following Bitcoin price chart Bitcoin price chart result = ask_image( image, “This is a chart of Bitcoin price. What is the current price according to the chart ) print (textwrap.fill(result, width=110)) Lava The current price of Bitcoin according to the chart is $23,000. It got that wrong, It wasn't able to get the correct value from the chart ($28.9k) htips:shwwwemlexpert iofblogilave-arge-mulimedal-madel 115 9115124, 8:97 AM LLaVA- Large Multimodal Mode! | MLExper- Get Things Done with Al Bootcamp Captcha Another interesting use case is to ask the model to solve a captcha. Let's try with something simple: “REACH Captcha wetime result = ask_image(image, "Extract the text from the image") print (textwrap.fill(result, width=110)) Total failure, it didn't even get the number of characters right. Meme Our final experiment will be to ask the model to reason about a meme. Let's try with the following one: https:shwwwemlexpertiofblogilave-arge-mulimedal-madel 2s 9115124, 8:97 AM LLaVA- Large Multimodal Mods! | MLExper- Get Things Done with Al Bootcamp meme setime result = ask_image(image, "Is this funny and why? print (textwrap.fill(result, width=110)) Yes, this image is funny because it humorously represents the process of learning by showing a person's brain going through different stages of learning. The image features a series of four pictures of a brain, each representing a different stage of learning, such as from university, online courses, YouTube, and articles. The visual representation of the brain's journey through these stages is exaggerated and comical, making it a light-hearted and entertaining image. The model has correctly identified the meme as funny but has provided a very generic answer. It didn’t note the different sources of education and the funny side of their htips:shwwwemlexpert iofblogilave-arge-mulimedal-madel 136 9115124, 8:97 AM LLAVA- Large Multimodal Model | MLExpert- Get Things Done with Al Bootcamp ranking. Let's specificially ask for the ranking wxtime result = ask_image( image, “Order all learning resources sorted by usefulness in a list, according to the imag ) print(textwrap.fill(result, width=110)) Lava, Online Courses YouTube University Articles Memes yayese This one is interesting, I would say that the model didn't get the ranking right. It has put memes at the bottom, but according to the image, they are the best. The model has correctly identified the different sources of education (the OCR did work), but it didn't get the ranking right. Keep in mind that this particular meme might've been included in the training set. Conclusion While the LLaVA model can be used to understand images, it is not perfect. It can be used to extract text from images, summarize and describe, but it struggles with more complex reasoning. However, it is a great start and I'm looking forward to seeing more open-source LMMs, possibly beating the GPT-4V (and more commercial) model(s). 3,000+ people already joined Join the The State of Al Newsletter https:shwwwemlexpert iofblogilave-arge-mulimedal-madel sans 9115124, 8:97 AM LLAVA- Large Multimodal Model | MLExpert- Get Things Done with Al Bootcamp Every week, receive a curated collection of cutting-edge Al developments, practical tutorials, and analysis, empowering you to stay ahead in the rapidly evolving field of Al Your Email Address SUBSCRIBE Iwon't send you any spam, ever! References 1, GPT-4V(ision) system card © 2. Visual Instruction Tuning © 3. Improved Baselines with Visual Instruction Tuning © Dark © 2020-2024 MLExpert™ by Venelin Valkov. All Rights Reserved. ntps:wwwlexpertofblogiava-arge-mulimodal-model 1515

You might also like