Teaching Video Comprehension to AI, One Million Moments at a Time

The coloring indicates the areas of the video frames the neural network is focusing on in order to recognize the event in the video using the CAM Method developed by Zhou et al. (Video courtesy of IBM/MIT.)

It’s taken a lot of coaching, but my one-year-old is just starting to grasp the concept of a high five. Now, when you hold your hand up in front of him, he smacks it with his hand over and over again—close, but not quite. Humans are designed by nature to be learning machines, and even we struggle to come to grips with basic actions like waving and clapping.

Now, imagine how difficult it is to teach an AI not only to recognize a high five, but to distinguish it from a wave, a salute, a handshake, etc. That example might seem frivolous, but video comprehension in general and the ability to parse actions specifically are key competencies for the future of AI technology, whether it’s being used to pilot autonomous vehicles or pick and place parts on an assembly line.

As with toddlers, attaining these competencies takes training—though of a very different kind. Deep learning algorithms need to be fed massive, labeled datasets in order to build up a model of understanding. In many cases, these datasets consist of static images, but what if you wanted a computer to be able to describe actions and events as well as objects?

Enter the Moments in Time Dataset, developed by the MIT-IBM Watson AI Research Lab. Moments in Time is a human-annotated collection of one million 3-second videos depicting basic actions and events. As one of the largest annotated video classification datasets in existence, the scale of Moments in Time is impressive; but even more impressive is its diversity, with videos depicting both human and non-human actors in varying environments and at different scales.

ENGINEERING.com had the opportunity to discuss this project with Dan Guttfreund, Video Analytics Scientist at IBM Research.

 

The labels in Moments in Time are based on the most common verbs in English. How difficult would it be to create versions for other languages?

One can just translate the verbs that we used to other languages and use the same videos. So, it should be straightforward.


In your paper, you mention that events can be interpreted differently depending on their social context and the type of place in which they occur. Could Moments in Time be used to train an algorithm to pick up on these variables, or is there something about them that would require a different dataset?

The Moments in Time Dataset contains atomic actions which typically have a clear meaning, although they can be performed by different agents on different objects. The comment about different meanings was with respect to events that combine several atomic actions such as picking something and carrying it away, which can be interpreted as stealing or delivering. One will need to develop a different dataset to capture such events.

You started with an initial list of 4,500 verbs and reduced it to 339. Can you explain how that process worked?

First, we clustered the verbs into semantically related groups. We also had a ranking of verbs according to how frequently they are used in English. We then iteratively picked the most frequent verb in the most frequent cluster and kept going. We stopped after the frequency of use of the verbs that we have not picked was very low.


Some of the clips in the dataset are sound-dependent, so a clip with the sound of clapping in the background would be labelled as such. Did that pose any additional challenges?

That did not impose additional challenges. Naturally, some of the actions are easily recognized by sound, e.g. clapping.

You describe this project as “a stepping-stone toward the development of learning algorithms able to build analogies between things, imagine and synthesize novel events, and interpret compositional scenarios.” I know it’s a big question, but can you summarize the steps between this dataset and algorithms that are capable of analogy and synthesis?

This is a hard question indeed, and in some sense an open question. One direction that I would like to try is to inject external semantic knowledge into deep learning models to help them model spatio-temporal transformations associated with certain actions, and then use them to understand, for example, that a person opening a door and a dog opening its mouth are in fact the same action just performed in a very different setting and for very different purposes.

Amazon Mechanical Turk clearly played an important role here, as often seems to be the case in this kind of research. I’ve gotten the impression that the extent to which machine learning depends on services like Mechanical Turk is something of a shameful secret in the AI world. Do you believe AI will always depend on this kind of brute force work, or will we be able to kick away the HIT [Human Intelligence Task] ladder once we’ve climbed high enough?

This is indeed the case, as long as we heavily depend on highly supervised models such as convolutional neural networks. I hope that one day we will not need to depend so heavily on this approach. One direction, which is a holy grail on its own, is to get a better understanding of how humans learn with very little supervision and apply that to machines.

Another direction is to develop highly photorealistic simulations and use them to generate huge amounts of labeled data without the need for human labelers. We are exploring both directions at the MIT-IBM Watson AI Lab.

For more research from IBM, check out Forecasting Waves with Deep Learning.