Summary: In a new paper, researchers have developed a system that allows viewers to change what happens in a video.
Original author and publication date: Kyle Wiggers – January 29, 2021
Futurizonte Editor’s Note: Clearly, the holodeck is closer than we think.
From the article:
Humans at an early age can identify objects and how each object can interact with its environment. For example, when watching videos of sports like tennis and football, spectators and sportscasters can understand and anticipate plays despite never being given a list of possible actions. We as humans develop this skill as we watch events unfold live and on the screen. Furthermore, we can reason about what happens if a player took a different action and how this might change the video.
In an effort to create an AI system that can develop some of these same reasoning skills, researchers at the University of Trento, the Institut Polytechnique de Paris, and Snap, Inc. propose in a new paper the task of playable video generation, where the goal is to learn a set of actions from real-world video clips and offer users the ability to generate new videos.
The idea is that users provide an “action label” at every time step and can see its impact on the generated video, like a video game. The researchers believe this framework might pave the way for methods that can simulate real-world environments and provide a gaming-like experience.
In an experiment, the researchers architected a framework called Clustering for Action Decomposition and DiscoverY (CADDY) that discovers a set of actions after watching multiple videos and outputs “playable” videos. (Here’s a live demo.) CADDY uses the aforementioned action labels to encode the semantics of a given action, as well as a continuous component to capture how the action is performed.
“Our experiments show that we can learn a rich set of actions that offer the user a gaming-like experience to control the generated video. As future work, we plan to extend our method to multi-agent environments,” the researchers wrote.
“CADDY automatically discovers the most significant actions to condition video generation and can produce playable video generation models in a variety of settings, from video games to real videos.”