M-RNN
Multi-Modal Recurrent Neural Network is a research done by The University of California and the Baidu Research Team which generates captions for images.
In this research,Deep Recurrent Neural Network is used for sentences, and Deep Convolutional Neural Network is used for images.
Model Network
The network has 5 layers. It first receives each word from a sentence and applies word embeddings in the first 2 layers. In these 2 layers, both the “syntactic” and “semantic” meaning of the words are being encoded. Then, the embedding layer, recurrent layer, and convolution layer are inputted to the multi-modal layer to link the context between sentence and image. Finally, the softmax layer tries to predict the probability of the next word.