Language and other models

July 29, 2024

After looking at the different types of AI in the first part of our series, and taking a closer look at generative AI and its specifics in the second part, this part looks at the different models used.

Language models: the foundation of text-based AI

An example of such models are the language models that form the basis of applications such as ChatGPT and similar systems. These models are trained on large amounts of data to learn the meaning of words and their relationships to each other. Various machine learning techniques are used to enable the model to string words together into meaningful and grammatically correct sentences.

Image, audio, and video models: specialization in different modalities

In addition to language models, there are specialized models for other modalities, such as image, audio, and video generation. These models must also be able to understand textual instructions in order to translate them into the target modality. Image models, for example, analyze a textual description, break it down into its essential components, and assign visual elements to them. These elements are iteratively assembled to create a complete image.

Unimodal vs. Multimodal Models

Speech and image models are considered unimodal models because they focus on only one modality - either text or images. In contrast, Large Multimodal Models (LMMs) can process and output multiple modalities simultaneously. These models combine different types of data, such as text, images, and audio, and use specialized training methods to deal with these different modalities. The great advantage of multimodal models is that they can seamlessly switch between text-based and visual or auditory communication and accept a variety of input types.

Is this really multimodal?

However, this does not mean that every chatbot that can process audio is based on a multimodal model. Vendors often combine different specialized models to provide their users with different communication channels. This is not a problem in principle, but the difference becomes apparent when it comes to interactivity. In principle, there is nothing wrong with this. However, we noticed the difference in audio chats in particular. The kind of conversation we saw in OpenAI's presentation of GPT-4o is only possible with a multimodal model. This ability to react flexibly to different inputs is a typical feature of multimodal models that cannot be simulated by using several unimodal models together.

From theory to practice