Large language models (LLMs), like OpenAI's GPT-4, are the extremely capable, state-of-the-art AI models that have been generating countless headlines for the past couple of years. The best of these LLMs are capable of parsing, understanding, interpreting, and generating text as well as most humans—and are able to ace many standardized tests.
But there are still plenty of things LLMs can't do by themselves, like understand different forms of inputs. For example, LLMs can't natively respond to spoken or handwritten instructions, video footage, or anything else that isn't just text. Of course, the world isn't just made up of neatly formatted text, so some AI researchers think that training large AI models to be able to understand different "modalities"—like images, videos, and audio—is going to be a big deal in AI research.
We're already seeing the first of these new large multimodal models or LMMs. Google, OpenAI, and Anthropic, the makers of Claude, are all talking about how powerful their latest AI models are across different modalities—even if all the features aren't widely available to the public just yet.
So, if large multimodal models are the next frontier of AI, let's have a look at what they are, how they work, and what they can do.
What is multimodal AI?
Large multimodal models are AI models that are capable across multiple "modalities."
In machine learning and artificial intelligence research, a modality is a given kind of data. So text is a modality, as are images, videos, audio, computer code, mathematical equations, and so on. Most current AI models can only work with a single modality or convert information from one modality to another.
For example, large language models, like GPT-4, typically just work with one modality: text. They take a text prompt as an input, do some black box AI stuff, and then return text as an output.
AI image recognition and text-to-image models both work with two modalities: text and images. AI image recognition models take an image as an input and output a text description, while text-to-image models take a text prompt and generate a corresponding image.
When an LLM appears to work with multiple modalities, it's most likely using an additional AI model to convert the other input into text. For example, before the launch of GPT-4o (a multimodal model), ChatGPT used GPT-3.5 and GPT-4 to power its text features, but it relied on Whisper to parse audio inputs and DALL·E 3 to generate images.
But that's starting to change.
Multimodal AI models go mainstream: Gemini, GPT-4o, GPT-4o mini, and Claude 3
When Google announced its Gemini series of AI models, it made a big deal about how they were "natively multimodal." Instead of having different modules tacked on to give the appearance of multimodality, they were apparently trained from the start to be able to handle text, images, audio, video, and more.
OpenAI recently released GPT-4o, a multimodal model that offers GPT-4 level performance (or better) at much faster speeds and lower costs. It's available to ChatGPT Plus and Enterprise users, and its multimodality means you can quickly create and analyze images, interpret data, and have flowing voice conversations with the AI—among other tasks.
Shortly after the release of GPT-4o, OpenAI launched GPT-4o mini—a smaller language model that's faster and cheaper than GPT-4o. As of this writing, GPT-4o mini doesn't support all the same inputs and outputs as GPT-4o—for example, video and audio—but OpenAI says it plans to roll that out in the future.
Similarly, Anthropic claims that Claude 3 has "sophisticated vision capabilities on par with other leading models." So, while large multimodal model is a fancy new term, it's basically describing the direction the major LLMs have been going.
How do large multimodal models work?
Large multimodal models are very similar to large language models in training, design, and operation. They rely on the same training and reinforcement strategies, and have the same underlying transformer architecture. This article on how ChatGPT works is a good place to start if you want a breakdown of some of these concepts.
We're in the era of rapid AI commercialization, so a lot of interesting and important information about the various AI models is no longer released publicly. The broad strokes have to be pieced together from the technical announcements, product specs, and general direction of the research. As a result, this is more of an overarching view about how these models work as a whole, rather than a detailed breakdown of how a specific LMM was developed.
In addition to an unimaginable quantity of text, LMMs are also trained on millions or billions of images (with accompanying text descriptions), video clips, audio snippets, and examples of any other modality that the AI model is designed to understand (e.g., code). Crucially, all this training happens at the same time. The underlying neural network—the algorithm that powers the whole AI model—not only learns the word "dog," but it also learns the concept of what a dog is, as well as what a dog looks and sounds like. In theory, it should be just as capable of recognizing a photo of a dog or identifying a woof in an audio clip as it is at processing the word "dog."
Of course, this pre-training is just the first step in creating a functional AI model. It's likely to have incorporated some pretty unhealthy stereotypes and toxic ideas—mainlining the entire internet isn't good for human brains, let alone artificial networks based on them. To get a large multimodal model that behaves as expected and, importantly, is actually useful, the results are fine-tuned using techniques like reinforcement learning with human feedback (RLHF), supervisory AI models, and "red teaming" (to try to break it).
Once all that is done, you should have a working large multimodal model that's similar to a large language model, but capable of handling other modalities, too.
What can a large multimodal model do?
If you want to see a pretty phenomenal demo of where LMMs are headed, check out OpenAI's GPT-4o demo. Google also shared an early vision of what future multimodal AI agents would be like with Google Astra.
To summarize, with some of the LMMs available now, you can do things like:
Upload an image and get a description of what's going on, as well as use it as part of a prompt to generate text or images.
Upload an image and ask questions about it, as well as follow-up questions about specific elements of the image.
Translate the text in an image of, say, a menu, to a different language, and then use it as part of a text prompt.
Upload charts and graphs and ask complicated follow-up questions about what they show.
Upload a design mockup and get the HTML and CSS code necessary to create it.
Voice chat with the AI in a vaguely natural cadence.
And as LMMs get more widely available, what they're capable of will likely expand. A multimodal medical chatbot, for example, would be able to better diagnose different rashes and skin discolorations.
Multimodal AI models available now
The easiest way to see what a multimodal model feels like is with GPT-4o, using ChatGPT. But Claude and Gemini are also good demos, just with fewer flashy science-fiction-vibe features.
Either way, over the next year or two, we're likely to see significantly more multimodal AI tools capable of working with text, images, video footage, audio, code, and other modalities we probably haven't even considered.
Automate your multimodal AI models
In addition to interacting with some multimodal models via chatbot right now, you can also use them in your everyday workflows. With Zapier's Google Vertex AI and Google AI Studio integrations, you can access Gemini from all the apps you use at work. And with the ChatGPT integration, you can access GPT-4o and send its outputs wherever you need them.
Learn more about how to automate Gemini and how to automate ChatGPT, or get started with one of these pre-built workflows.
Send prompts to Google Vertex AI from Google Sheets and save the responses
Label incoming emails automatically with Google AI Studio (Gemini)
Send prompts in Google AI Studio (Gemini) for new or updated rows in Google Sheets
Create email copy with ChatGPT from new Gmail emails and save as drafts in Gmail
Start a conversation with ChatGPT when a prompt is posted in a particular Slack channel
Zapier is the leader in workflow automation—integrating with thousands of apps from partners like Google, Salesforce, and Microsoft. Use interfaces, data tables, and logic to build secure, automated systems for your business-critical workflows across your organization's technology stack. Learn more.
Related reading:
This article was originally published in March 2024. The most recent update was in July 2024 with contributions from Jessica Lau.