Google DeepMind has recently revealed its latest artificial intelligence (AI) model, Gemini, designed to rival OpenAI’s ChatGPT. Both models fall under the category of “generative AI,” learning patterns from vast amounts of training data to generate various types of content, with ChatGPT focusing primarily on text generation.
While ChatGPT operates as a web app for text-based conversations, Google introduced Bard, its conversational web app, based on the LaMDA model trained on dialogues. Now, Google is enhancing Bard using the newly developed Gemini model.
Gemini stands out from previous generative AI models like LaMDA by being a “multi-modal model.” This means it can handle various types of input and output, including text, images, audio, and video. The term “LMM” (large multimodal model) is emerging to describe models like Gemini, distinguishing them from large language models (LLMs) like ChatGPT.
OpenAI had previously introduced GPT-4Vision, capable of processing images, audio, and text. However, it does not operate as a fully multimodal model like Gemini, which directly supports multiple input and output modes.
A crucial difference lies in how these models handle different types of content. ChatGPT-4, powered by GPT-4V, processes audio inputs by converting them to text using a separate model called Whisper and generates speech outputs through a different model. Similarly, it generates images by using text prompts interpreted by another model, Dall-E 2.
In contrast, Gemini is designed to be “natively multimodal,” allowing the core model to handle various input types (audio, images, video, and text) directly.
The Verdict Initial assessments suggest that Gemini 1.0 Pro, the publicly available version
of Gemini, is currently not as advanced as GPT-4 and shares similarities with GPT 3.5 in terms of capabilities. Google has also teased a more powerful version, Gemini 1.0 Ultra, claiming superior capabilities compared to GPT-4. However, independent validation is currently challenging as the Ultra version is yet to be released.
There’s a level of skepticism surrounding Google’s claims, fueled by a demonstration video that, as reported by Bloomberg, wasn’t conducted in real-time. Certain tasks, such as the cup and ball trick, were pre-learned by Gemini through a sequence of still images. Despite these challenges, the emergence of large multimodal models like Gemini marks a significant stride in generative AI.
A Promising Future Despite the current issues and uncertainties, the potential of Gemini and similar multimodal models in advancing generative AI is noteworthy. The incorporation of images, audio, and videos as new training data reservoirs provides a promising avenue for enhancing AI capabilities. GPT-4, for instance, was trained on an extensive 500 billion words from publicly available text. With multimodal models like Gemini, there’s access to an even more substantial volume of diverse training data.
The performance of deep learning models often hinges on increased complexity and the amount of training data. Multimodal models, by tapping into varied data forms, are poised to achieve heightened capabilities. The prospect of models trained on video developing intricate internal representations of “naïve physics” – the fundamental understanding of causality, movement, gravity, and other physical phenomena – adds to the excitement.
Competitive Landscape For the past year, OpenAI’s GPT models have dominated the generative AI landscape. Google’s entry with Gemini signals the rise of a major competitor that is likely to drive innovation in the field. While OpenAI may be working on GPT-5, the competition ensures continuous advancements and remarkable new capabilities.
Looking forward, the hope is to witness the emergence of very large multimodal models that are open-source and non-commercial. This shift could democratize access to advanced AI capabilities and foster collaborative development within the research community.
Gemini’s Implementation Gemini introduces some notable features, including a lightweight version called Gemini Nano, designed to run directly on mobile phones. This move towards more lightweight models addresses environmental concerns related to AI computing and offers privacy benefits. The development of such models is likely to encourage competitors to follow suit, promoting sustainability and accessibility in the AI landscape.
In conclusion, the unveiling of Gemini represents a pivotal moment in the ongoing narrative of generative AI. As the competition between Google and OpenAI intensifies, the real winners are likely to be the broader AI research community and, ultimately, society at large. The quest for increasingly sophisticated and versatile AI models promises a future where AI seamlessly integrates into various aspects of our lives.