‘The Dark Knight’ — one of my favourite movies. TMDB classifies this movie into four genres — Drama, Action, Crime, Thriller. A multimodal deep learning model I have trained classifies this movie as — Action, Drama, Thriller. Let’s take a look at how this model manages to do this.
What is multimodal deep learning?
Modality is a particular way of doing or experiencing something. We live our daily lives in a multimodal environment. We see things, hear sounds, smell odours and feel textures. Analogous to this, multimodal deep learning involves multiple modalities used together to predict some output. In this project, I concatenated the features extracted from images and text sequences using a Convolutional Neural Network (CNN) and a Long Short-Term Memory (LSTM) network, respectively. These features were used to try and predict movie genres.
For this project, I have used Kaggle’s The Movies Dataset. It consists of over 40,000 movies with overviews, the poster URL link on TMDB and genres taken from the TMDB website. After splitting the dataset, the training set consists of 26864 examples; the test set includes of 7463 samples, and the validation set consists of 2986 cases.