Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Abstract

This article introduces Wan-Move, a novel framework for generating motion-controllable videos by leveraging latent trajectory guidance. The method proposes a mechanism to explicitly steer motion within the latent space, enabling fine-grained control over the generated video content. Experiments demonstrate that Wan-Move significantly improves the precision and fidelity of motion control compared to existing state-of-the-art techniques, producing high-quality videos with user-defined movements. The work offers a new direction for practical applications requiring precise video synthesis.

1. Introduction

Video generation has seen remarkable progress, yet achieving precise and controllable motion within generated sequences remains a significant challenge, often resulting in unrealistic or uncontrollable movements. Current methods struggle with decoupling content generation from motion dynamics, limiting user ability to dictate specific object or camera paths. This paper addresses this gap by introducing a novel approach that allows for explicit motion control through latent trajectory guidance. The primary models conceptually utilized or extended within this framework include diffusion models for video generation and variational autoencoders (VAEs) for latent space manipulation.

2. Related Work

Prior research in video generation has explored various techniques, including recurrent neural networks, GANs, and more recently, diffusion models, often focusing on overall video realism rather than fine-grained motion control. Some works have attempted motion control through conditional inputs like text or motion vectors, but these often lack the precision needed for complex or custom trajectories. Specifically, methods utilizing latent spaces for video synthesis often entangle motion and appearance factors, making independent manipulation difficult. Wan-Move differentiates itself by explicitly modeling and guiding motion as a trajectory in a disentangled latent space.

3. Methodology

The Wan-Move framework employs a two-stage approach: initially, a base video generation model creates a sequence, followed by a latent trajectory guidance module. This module projects desired motion characteristics into a learnable latent trajectory, which then modulates the generation process in the latent space. By optimizing this latent trajectory against user-defined motion parameters, the system ensures that the generated video aligns with the intended movement. A dedicated motion encoder-decoder network is trained to translate high-level motion descriptions into precise latent space paths, thus enabling fine-grained control over object and camera movement throughout the video.

4. Experimental Results

Our experimental evaluation demonstrates that Wan-Move significantly outperforms baseline methods in terms of motion accuracy and perceived video quality. Quantitative metrics such as Frechet Video Distance (FVD) and FID score show substantial improvements, alongside a newly proposed Motion Alignment Score (MAS) indicating superior adherence to specified trajectories. Qualitative results further confirm the model's ability to generate diverse videos with complex and precise motion paths. The table below summarizes key performance metrics across different datasets, highlighting Wan-Move's robustness and superior control capabilities compared to existing state-of-the-art methods.

5. Discussion

The results highlight the effectiveness of latent trajectory guidance in achieving highly controllable video generation, a significant step towards more practical and expressive video synthesis tools. The ability to dictate precise motion within the latent space opens new avenues for applications in content creation, animation, and virtual reality. Future work could explore extending this framework to real-time interactive motion control and integrating semantic understanding for more nuanced high-level motion commands. Further research will also focus on expanding the model's capacity to handle even longer and more complex motion sequences while maintaining temporal consistency.