Multimodal Neural Script Knowledge Models

merlot

MERLOT is a model for learning what we are calling “neural script knowledge” — representations about what is going on in videos, spanning multiple video frames with associated captions.

What’s here

We are releasing the following:

  • Code for the MERLOT model (in model/, with data processing in data/
  • Code for running MERLOT over visual story ordering.

We plan to release:

  • Information about the videos used in this work
  • Code for adapting the model to other tasks (not strictly needed, but just to make things easier)

This is somewhat ongoing — we hope to make it somewhat easier to adapt MERLOT to other tasks, please follow if interested!

Enviroment and setup

There are two different ways of running MERLOT right now