ChatGPT again awed the world. With its latest upgrade last month, ChatGPT extended its capabilities beyond text to image and voice. With ChatGPT’s new capabilities, you can get your kids to do their homework without losing your temper, or help them bring the imagination of a “super-duper sunflower hedge hog” to an expressive graphic in a breeze. ChatGPT’s new features not only made a leap forward for industry application of AI multimodal systems, but also fueled a new wave of discussion around the future of multimodal models.

From Multimodal Systems to the Birth of Large Multimodal Models (LMMs)

Multimodal system, or multimodality, in generative AI indicates a model’s capability to produce a variety of outputs, including text, images, audio, video and even other modalities based on the input. These models, trained on specific data, learn underlying patterns to generate similar new data, and therefore enrich AI applications.

Not all multimodal systems are Large Multimodal Models (LMMs). For example, text-to-image models like Midjourney and Stable Diffusion are multimodal but not LMMs because they don’t have a large language model component.

In other words, LMMs are constituted by incorporating additional modalities into Large Language Models (LLM), such as Open AI’s newly launched DALL.E-3. While LMM’s performance is highly dependent on the performance of its base LLM, it also strengthens the performance of its base LLM with each modality added.

Compared with text-generation-only LLMs, LMMs are more akin to humans’ natural intelligence. We feel the world through a variety of modalities, especially through vision. Using an image as a prompt makes it easier for the user to query the model instead of crafting a perfect promote in texts.

In fact, integrating multimodal models expands and enriches LLM’s understanding of the world. The fusion of different information formats could enable AI systems to mimic the human cognitive model, allowing them to understand the world through multiple senses instead of mere language, therefore resulting in less hallucinations, more sophisticated reasoning abilities and continuous learning capabilities.

Tech Giants Spearheading Multimodal AI Advancements

Large Multimodal models will take the center stage of Generative AI’s evolution. Those companies, no matter tech giants or start ups, with mixed modalities, will be one of the big AI demands going forward. Large Multimodal Models opens up new possibilities, taking language models to more interactive interfaces, creating fresh experiences for users and solving new kinds of tasks.

For OpenAI, the upgraded capabilities of image and voice multimodalities is just a natural start as these two are the most common forms of data a user is inclined to use. In the future, it’s likely to train any forms of data, no matter whether it’s a photograph, a 3D model data or even smell data.

A multimodal ChatGPT brings OpenAI closer to the age of artificial general intelligence (AGI), which is OpenAI’s ultimate vision put on its website and has been the Holy Grail of the AI community for decades. As noted by OpenAI in their GPT-4V system card, “incorporating additional modalities (such as image inputs) into LLMs is viewed by some as a key frontier in AI research and development.”

OpenAI is not the only one claiming the leading position in multimodal AI. After OpenAI launched the GPT-4V system, Google is also under pressure to release Gemini, which claims to be a multi-modal system that was created from the ground up. Being trained on twice as many tokens as GPT4, Gemini was said to have a distinct edge in the sophistication of the insights and inferences that it took from large amounts of its proprietary data. Similarly, Meta’s recent series debut of the SeamlessM4T, AudioCraft and CM3leon all represented its determination in rivaling OpenAI and Google in multimodal AI advancements.

From Multimodal Foundation Models to Specialist Models

The landscape of the Large Multimodal Model race mirrors the current competitive scenario in the Large Language Model domain, with the victors being those who possess the resources to train their models on extensive, diverse datasets. Although the competition is fierce, the potential rewards are immense. Researchers foresee the generative AI market reaching an astounding worth of $1.3 trillion by 2032.

While big tech behemoths may dominate the foundational models across modalities, there lies a realm of possibility where specialized models can outshine even the mightiest giants. This opens up a window of opportunity for startups. Emad Mostaque, CEO of Stability AI, envisions the future as a process of “de-constructing” technology into its ideal roles. He predicts, “We’re going to see a plethora of specialist models across various modalities,” along with a select few multimodal models capable of handling diverse tasks at the right moment.

The applications of multimodal AI are poised to revolutionize a multitude of fields, and pilot tests and discussions are already underway in industries like healthcare, robotics, and autonomous driving. The fusion of disparate modalities promises to enhance perception, interaction, and overall system performance, ushering in transformative changes.

  • Healthcare: Multimodal applications will enable comprehensive medical analysis, facilitate communication between healthcare providers and patients who speak different languages, and serve as a central hub for various unimodal AI applications within hospitals.
  • Robotics: Robotics pioneers have incorporated multimodal learning systems into human-machine interfaces and automated movements. In the industrial sector, expect greater collaboration between robots and human workers, while in the consumer realm, robots will perform intricate tasks assigned by humans with ease.
  • Autonomous Driving: Multimodal models have already found their way into Advanced Driver Assistance Systems (ADAS) and In-Vehicle Human Machine Interfaces (HMI) assistants. In the future, automobiles will possess the same sensory perception and decision-making capabilities as human drivers.

In conclusion, while the development of multimodal models demands significant resources and expertise, it presents startups with a golden opportunity to craft innovative solutions that address real-world challenges across diverse industries. Armed with finely tuned Large Multimodal Models, a focused industry niche, and a well-defined audience, startups specializing in these models are poised to deliver surprises on par with tech giants.