Multimodal Learning : Combining Text, Images, Audio, and Video for Richer AI Models

Introduction
I asked ChatGPT today to “produce a vibrant image of a cool panda sipping a coconut from a straw while riding a bicycle”
I got this as an answer

Pretty cool, isn’t it? How do the GenAI models make it happen? There are multiple components to it, one of the most important aspects is a concept called Multimodal Learning, where during the training phase of the models, the models are given training data of multiple modalities, i.e. multiple types of inputs. The most common modalities used are text, visual (images, videos), audio (speech, music), but there can be other modalities present as well. For example, a wearable device capturing heart rate along with ECG data along with accelerometers, or EEG data. The other example is a self-driving car requiring inputs from stereo video data along with LiDAR sensors.
Evolution of Multimodal Deep Learning
Early 2010s - start of Multimodal Deep Learning
One of the more important works of Multimodal Deep Learning started in 2010 by Andrew Ng’s lab with the paper titled as, not so surprisingly, “Multimodal Deep Learning” [1]! Various ideas on how to formulate a multimodal problem are discussed in the paper. The key idea was to train a bimodal deep autoencoder, having a shared/joint representation of multiple modalities. A joint representation here means, single representation that combines different types of information (such as text and images), into a single shared understanding.

The work required ⅓ rd of the training data from modality 1 (audio only), the other ⅓ rd of the training data from modality 2 (video only), and the other ⅓ rd of the data as the paired data. The data that has both the modalities present. While the results were promising, this requirement of paired data was a major limitation of the approach.
2014 - Multimodal representation with semantic context
A group in University of Toronto published a paper (Kiros et. al 2)pushed the envelope of multimodal learning further. They applied their methods on an image captioning problem which is an image -> text modality construction problem. The key ideas of the process were as follows:
- The first step is to generate two separate unimodal vector embeddings first, one separately trained for text; the other one separately trained for images. Please note that these embeddings themselves are similarity preserving first of all. For example, related words “king” and “queen” are closer to each other in the unimodal embedding space than the unrelated words “kings” and “utensil”.
- Take some paired data and learn the joint embedding space, utilizing the unimodal embedding spaces in step 1. For the 2014 paper, the joint embedding space was simply a dot product between the embeddings of each modality. This step in the multimodal deep learning framework is now called “fusing”. Instead of the multiplicative approach used by the paper, the embeddings can be added with each other, OR concatenated with each other. In practice, multiplicative fusion has seen more success.
- The next step is “alignment”, which is to make the fused embedding space similarity preserving. This was done with pairwise ranking of the encoder, and using a log-bilinear language model for caption generation.
Once the fused embedding space is learned, vector arithmetic on that embedding space using only one modality still holds true, that allows for cross-modal retrieval and arithmetic (i.e. given an image, the model can generate the text, and vice versa). That’s the power of the method. Here are some images from the paper basing the idea

Here, you can see that in the multimodal embedding space, from an image of the car; the word “blue” is subtracted , and the word “red” is added. For the answer embedding, nearest neighbor search is applied, and the nearest neighbor reveals cars with red color from the database. Isn’t that mind-boggling?

2021 - CLIP
CLIP[3] from OpenAI harnessed the idea of Contrastive Learning. Contrastive learning is a nifty technique that increases the dot-product of embeddings between the matched input pairs, and decreases the dot-product of embeddings of unmatched pairs. This is already naturally done in the unimodal embeddings, but not in the fused embedding space. With applying contrastive loss on the paired multimodal data, we are able to generate astoundingly good results.
The other improvement point was the scale and quality of the training data/ They curated data of 400 million pairs of images + text gathered from the Web (The dataset is called WebText dataset). Moreover, in the 7 years in between, the text-only encoders, and vision encoders had become even more powerful and expressive.
This is how the model architecture diagram looks like (source [3])

2021-22 - ALIGN
Google also has its own public version of the paper with the method called ALIGN[4]. In this one, they combined the concepts of MLLM (Masked Large Language Models) into multimodal embedding. The work was using more than one billion image text pairs.
2023 - ImageBind
Meta came up with ImageBind[5]. The name is a bit of a misnomer, as they present an approach to unify six different modalities (images, text, audio, depth, thermal, IMU) into one shared representation using pair-to-pair cross-modal retrievals, by cross-modal retrievals I mean learning models from image<-> text, image <-> audio, audio <-> text etc separately but all of them sharing a common multimodal embedding space. The multimodal arithmetic that we saw in the Kiros paper[2] still holds true. The results of this paper are truly astounding where for example, the embeddings from an image of fruits + the sound of birds could retrieve images of birds surrounded by fruits. Maybe that’s how the human mind’s associativity also works!
Applications of Multimodal learning
The usage of multimodal learning has become very promising over the past few years. Here are some of the main applications:
Vision + Language tasks
- Image captioning : Given an image, generate a descriptive text for it.
- VQA (Visual Question Answering) : Given an image, answer questions about the image. The question and the answer(s) are in text form. For example, given an image of a chess-board, it may ask “how many pawns are there on the board?”
- Text-to-Image generation : Generate images on-the-fly based on user descriptions. Like ChatGPT-4 and 5, DALL-E, Gemini 2.5 Flash, and a few other products,
- Image retrieval from text : “Find an image of cherry blossom festival where I wore a pink sweater” - could be a question a multimodal agent can answer today when given access to your photos library.
Audio and Speech tasks
- Automatic Speech Recognition (ASR) : Recognizing the speech / voice using the speech (or speech + video) into a transcribed text.
- Voice Assistants : Siri or Alexa type of assistants, who take voice inputs, and convert them into commands.
Scene Understanding
- Surveillance systems that recognizes a person’s activities
- Video captioning
- Video retrieval from text
HealthCare systems
- Patient diagnosis combining medical images, reports, and other Electronic Health Records (EHRs)
- Bionic arms and limbs
- Multimodal Brain Modeling
Here is a summary table of various commercial / open-source model offerings in various modalities
| Modality Combo | Application Domain | Example |
|---|---|---|
| Text + Image | Search, captioning, retrieval | CLIP, BLIP |
| Image + Audio | Audio–image matching | ImageBind |
| Video + Text | Surveillance, video Q&A | VideoBERT, Flamingo |
| Audio + Text | Speech recognition, emotion | Whisper, voice bots |
| Text + Image + Video | General-purpose reasoning agents | GPT-4V, Kosmos |
| Vision + Language + Action | Robotics, Assitants | RT-2, SayCan |
Conclusion
Multimodal learning has come a long way in the last 10-15 years. Today’s commercial models are really powerful and they process and correlate data of different modalities fairly well, enabling them for serious applications involving real-time complex decision making. They provide an edge over single-modality models because of the way they harness the multiple modalities. Maybe that’s how our brain perceives and acts too?
References
[1] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal Deep Learning. In Proceedings of the 28th International Conference on Machine Learning (ICML) (pp. 689–696). Bellevue, Washington, USA
[2] Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). Unifying Visual‑Semantic Embeddings with Multimodal Neural Language Models. In Proceedings of the 31st International Conference on Machine Learning (ICML), pp. 595–603
[3] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. Presented at the 38th International Conference on Machine Learning (ICML 2021)
[4] Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q. V., Sung, Y., Li, Z., & Duerig, T. (2021). Scaling Up Visual and Vision‑Language Representation Learning With Noisy Text Supervision (ALIGN). Presented at ICML 2021
[5] Girdhar, R., El‑Nouby, A., Liu, Z., Singh, M., Alwala, K. V., Joulin, A., & Misra, I. (2023). ImageBind: One Embedding Space To Bind Them All. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)