MotionCharacter: Fine-Grained Motion Controllable Human Video Generation

Abstract

Recent advancements in personalized Text-to-Video (T2V) generation highlight the importance of integrating character-specific identities and actions. However, integrating the quantifiable intentions in human actions into personalized T2V generation models remains challenging. In this paper, we propose MotionCharacter, a high-fidelity human video generation framework designed for fine-grained motion control (i.e., precise control of both action type and intensity). The primary limitations of current models stem from several key challenges: (1) Due to the lack of fine-grained motion modeling, current subject-driven T2V models rely on coarse action descriptions that fail to precisely control the continuous nuances of motion intensity. To overcome this, we introduce a motion control module that decouples character motion into action type and motion intensity, enabling precise manipulation through text-driven control while ensuring controllable motion intensity. To further enhance motion dynamics, we also incorporate a region-aware loss mechanism that improves the quality of motion transitions. (2) Current methods struggle to preserve identity details when altering other attributes. Thus, we introduce an ID (identity) content insertion module to maintain identity fidelity while enabling flexible attribute modifications, and incorporate an ID-consistency loss to enhance identity consistency and detail fidelity. (3) Existing datasets do not sufficiently capture the nuances of human motion and facial features. Accordingly, we curate a new dataset, Human-Motion, which uses LLMs to generate detailed descriptions of motion and facial features for training. Extensive experiments highlight the effectiveness of MotionCharacter, demonstrating significant improvements in fine-grained motion control and high-fidelity human video generation.

Overview

Framework overview. Our proposed framework comprises three core components: the ID Content Insertion Module, the Motion Control Module, and a composite loss function. The loss function incorporates a Region-Aware Loss to ensure high motion fidelity and an ID-Consistency Loss to maintain alignment with the reference ID image. During training, motion intensity \(\mathcal{M})\ is derived from optical flow. At inference, human animations are generated based on user-defined motion intensity \(\mathcal{M})\ and specified action phrases, enabling fine-grained and controllable video synthesis.

Qualitative Comparisons

To further illustrate the effectiveness of our model, additional qualitative comparisons across various methods are provided.

IPA-PlusFace IPA-FaceID-Portrait IPA-FaceID-PlusV2 ID-Animator Ours

ID Image

A seasoned chef dressed in a white apron expertly seasons a delicious dish.

ID Image

A nostalgic grandmother gazes at old family photos, her face lighting up with memories.

ID Image

A young man grins confidently, his eyes gleaming with excitement as he nods and gives a thumbs-up.

ID Image

The man stretches his arms above his head, eyes shut from the stretch.

ID Image

An intellectual professor giving a lecture adjusts their glasses for a clearer view.

ID Image

Sitting in the sunlight, the elderly woman smiles, savoring the warmth around her.

Action Phrase Control

To further explore the impact of action phrases on generated motion, we conducted experiments where we fixed the reference ID image, text prompt, and motion intensity, varying only the action phrase. This setup demonstrates how different action phrases influence specific motion details, enabling fine-grained control over generated actions.

ID Image

Action Phrase: null

"smile"

"turn head"

A nostalgic grandmother gazes at old family photos, her face lighting up with memories.

ID Image

Action Phrase: null

"open mouth"

"wink"

Iron Man soars through the clouds.

ID Image

Action Phrase: null

"hold a microphone"

"wave hand"

The woman's face beams with pure happiness.

ID Image

Action Phrase: "talk"

"talk, close eyes"

"talk, hold a microphone"

A woman blows a kiss, her lips puckered.

Motion Intensity Control

To analyze the impact of motion intensity adjustments, we conducted controlled experiments where we kept the reference ID image, text prompt, and action phrase fixed while varying only the motion intensity. This approach highlights how different intensity levels affect output quality and clarity, demonstrating the model's responsiveness to motion control parameters.

ID Image

Action Phrase:
"open mouth"

Intensity: 5

Intensity: 10

Intensity: 20

A nature-loving hiker with a majestic mountain in the background takes a deep, refreshing breath.

ID Image

Action Phrase:
"smile"

Intensity: 5

Intensity: 10

Intensity: 20

The girl at the flower stall carefully arranges a bright bouquet of roses, a gentle smile playing on her lips.

ID Image

Action Phrase:
"open mouth"

Intensity: 5

Intensity: 10

Intensity: 20

At the bustling market, the young man in a denim jacket carefully inspects a piece of fruit, looking for the perfect one.

The Human-Motion Dataset

To build the Human-Motion Dataset, a multi-step pipeline was developed to ensure the collection of high-quality video clips:

1. Video Sources: Our data sources comprise video clips from diverse origins, including VFHQ, CelebV-Text, CelebV-HQ, AAHQ, and a private dataset, Sing Videos.

2. Filtering Process: To maintain data quality, a multi-step filtering process was applied:

Visual Quality Check: We used CLIP Image Quality Assessment (CLIP-IQA) to evaluate visual quality by sampling one frame per clip, discarding videos with frames of low quality.
Resolution Filter: Videos with resolutions below 512 pixels were removed to uphold visual standards.
Text Overlay Detection: EasyOCR was used to detect excessive subtitles or text overlays, filtering out obstructed frames.
Face Detection: Videos containing multiple faces or low face detection confidence were discarded to ensure each video contains a single, clearly detectable person.

3. Captioning: To enrich motion-related data, we utilized MiniGPT to automatically generate two types of captions for each video:

Overall Descriptions (𝒫): General summaries of the video content.
Action Phrases (𝒜): Detailed annotations of facial and body movements, serving as action phrases 𝒜 in our framework.

4. Optical Flow Estimation: We use the RAFT model on video frames to compute optical flow, determining motion intensity for training and optimizing the loss function.

5. Motion Intensity Resampling: We resampled videos to balance the dataset, ensuring an even distribution of motion intensity values within the range of 0 to 20.

6. Dataset Summary: The Human-Motion dataset consists of 106,292 video clips. Each clip was rigorously filtered and re-annotated to ensure high-quality identity and motion information across diverse formats, resolutions, and styles.

1. Video Sources