In a rapidly evolving landscape of Artificial Intelligence, chatbots have taken a significant leap forward in their capabilities. Beyond generating text and code, these AI-driven conversational agents can now handle text, images, and even sound, marking a milestone in the world of multimodal AI.
ChatGPT Vision and Its Multimodal Evolution
Just over ten months ago, OpenAI introduced ChatGPT Vision to the public, igniting a wave of excitement about artificial intelligence. Since then, tech giants like Google and Meta have raced to develop their large language models (LLMs), each striving to expand the boundaries of AI capabilities.
OpenAI has recently unveiled a multimodal version of ChatGPT Vision, powered by its advanced LLM GPT-4, which can process text, images, and more. Google also introduced similar features, such as image and audio recognition, into its chatbot, Bard, earlier this year. The world of multimodal AI is taking shape, and its potential applications are vast.
The Versatility of Multimodal AI
So, what can these multimodal AI chatbots do? Scientific American conducted tests with ChatGPT (GPT-4V) and Bard, revealing their impressive capabilities:
- Splitting Complicated Bills: Using only a photograph of a receipt and a simple prompt, ChatGPT vision accurately divided a complex bar tab, including tips and tax, in less than 30 seconds.
- Describing Scenes: Both chatbots can describe scenes within images, decipher lines of text in a picture, and offer detailed descriptions of objects.
- Insect Identification: ChatGPT vision outperformed Bard in identifying insects from photographs, showcasing its potential in image recognition.
- Aid for the Visually Impaired: OpenAI tested its multimodal GPT-4 version through Be My Eyes, an app for the blind and visually impaired. Users experienced significant improvements in receiving accurate descriptions, enhancing their independence.
How Multimodal AI Works
Multimodal AI chatbots combine text, images, and sound through two primary approaches:
- Layered Models: In this method, separate AI models handle different modalities, such as text and images. A user’s input, like an image, is first processed by an image captioning AI before being integrated into the chatbot’s response.
- Tight Integration: The second approach involves a tighter coupling of different AI algorithms, creating a more interconnected system. This approach requires extensive training with multimedia datasets to establish strong associations between visual representations and words.
Both approaches rely on the underlying principle of transforming inputs into vector data, enabling chatbots to understand and respond effectively.
The Future of Multimodal AI
Multimodal AI represents a promising step toward artificial general intelligence (AGI), where machines approach human-level comprehension. While rapid advancements are anticipated, challenges persist. These include addressing AI hallucinations, and privacy concerns, and ensuring the accuracy and trustworthiness of AI-generated information.
Despite these challenges, experts predict that in the next five to ten years, personal AI assistants capable of handling a wide range of tasks will become a reality. These AI systems could handle everything from customer service calls to complex research inquiries with just a brief prompt.
While multimodal AI is not the ultimate AGI, it is a significant milestone on the path to a more versatile and capable generation of AI systems. As these technologies continue to evolve, users are advised to explore them cautiously, recognizing their incredible potential while considering privacy and accuracy.
In the world of AI, the future is becoming increasingly multimodal, offering exciting possibilities and challenges that lie ahead.