MiniGPT4-The Future of Language Understanding With Vision AI
MiniGPT4-The Future of Language Understanding With Vision AI
GPT-4 is the latest Large Language Model that OpenAI has released. Its
multimodal nature sets it apart from all the previously introduced LLMs.
GPT-4 has shown tremendous performance in solving tasks like
producing detailed and precise image descriptions, explaining unusual
visual phenomena, developing websites using handwritten text
instructions, and so on. The reason behind GPT-4’s exceptional
performance is not fully understood. Experts believe that GPT-4’s
advanced abilities may be due to the use of a more advanced Large
Language Model. Which is mostly not present in smaller models and
that's where Mini GPT comes into picture. Mini GPT4 is developed by a
team of Ph.D. students from King Abdullah University of Science and
Technology, Saudi Arabia.
Training Process
The training process was done in two stages. In the first stage, they
used roughly 5 million text-image pairs and trained the model for 10
hours using four A100 GPUs. In the second stage, they only used 3500
high-quality text-image pairs generated by the model itself using Chat
GPT. This fine-tuning took only seven minutes on a single A100 GPU.
Model Abilities
The model can generate very detailed descriptions of images based on
human prompts or questions. The output from this model is simple but
impressive. For example, it can describe logos or designs in detail. The
data set used for fine-tuning the model is available as well.
Data Collection
During the initial stage of training, an enormous number of text-image
pairs, approximately 5 million, were collected. For the next phase, a
selection of 3500 top-quality text-image pairs created by the ChatGPT
model were used for fine-tuning. These pairs are accompanied by image
descriptions stored in a separate Json file, while the images themselves
are kept in a separate folder.
Image Description and Analysis (examples)