Multimodal Neural Script Knowledge Models
merlot
MERLOT is a model for learning what we are calling “neural script knowledge” — representations about what is going on in videos, spanning multiple video frames with associated captions.
What’s here
We are releasing the following:
- Code for the MERLOT model (in model/, with data processing in data/
- Code for running MERLOT over visual story ordering.
We plan to release:
- Information about the videos used in this work
- Code for adapting the model to other tasks (not strictly needed, but just to make things easier)
This is somewhat ongoing — we hope to make it somewhat easier to adapt MERLOT to other tasks, please follow if interested!
Enviroment and setup
There are two different ways of running MERLOT right now
- Pretraining