A single ViLBERT Multi-Task model can perform 8 different vision and language tasks learnt from 12 datasets!
More details about the ViLBERT Multi-Task paper can be found here.
Browsers currently supported by the demo: Google Chrome, Mozilla Firefox.
How it works
You upload an image.
Our servers run the deep-learning based algorithm.
Results and updates are shown in real-time.