MotionCharacter: Fine-Grained Motion Controllable Human Video Generation

We propose MotionCharacter, a human video generation framework specifically
designed for identity preservation and fine-grained motion control.


ID Image
Reference Image
Action Phrase:
"open mouth"
Intensity: 10
First GIF
Intensity: 20
Second GIF
A nature-loving hiker with a majestic mountain in the
background takes a deep, refreshing breath.
ID Image
Reference Image
Action Phrase:
"smile"
Intensity: 10
First GIF
Intensity: 20
Second GIF
The girl at the flower stall carefully arranges
a bright bouquet of roses.

Abstract

Recent advancements in personalized Text-to-Video (T2V) generation highlight the importance of integrating character-specific identities and actions. However, integrating the quantifiable intentions in human actions into personalized T2V generation models remains challenging. In this paper, we propose MotionCharacter, a high-fidelity human video generation framework designed for fine-grained motion control (i.e., precise control of both action type and intensity). The primary limitations of current models stem from several key challenges: (1) Due to the lack of fine-grained motion modeling, current subject-driven T2V models rely on coarse action descriptions that fail to precisely control the continuous nuances of motion intensity. To overcome this, we introduce a motion control module that decouples character motion into action type and motion intensity, enabling precise manipulation through text-driven control while ensuring controllable motion intensity. To further enhance motion dynamics, we also incorporate a region-aware loss mechanism that improves the quality of motion transitions. (2) Current methods struggle to preserve identity details when altering other attributes. Thus, we introduce an ID (identity) content insertion module to maintain identity fidelity while enabling flexible attribute modifications, and incorporate an ID-consistency loss to enhance identity consistency and detail fidelity. (3) Existing datasets do not sufficiently capture the nuances of human motion and facial features. Accordingly, we curate a new dataset, Human-Motion, which uses LLMs to generate detailed descriptions of motion and facial features for training. Extensive experiments highlight the effectiveness of MotionCharacter, demonstrating significant improvements in fine-grained motion control and high-fidelity human video generation.

Overview

Proposed Model Architecture
Our proposed framework comprises three core components: the ID-Preserving Module, the Motion Control Module, and a composite loss function. The ID-Preserving Module ensures identity consistency by focusing on identity-specific regions within cross-attention layers, guided by an identity embedding \(C_{id}\) from a reference ID image. The Motion Control Module adjusts motion intensity \(\mathcal{M}\) based on user-specified prompts and action phrases. The loss function incorporates a Region-Aware Loss to ensure high motion fidelity and an ID-Consistency Loss to maintain alignment with the reference ID image. During training, motion intensity is derived from optical flow. At inference, human animations are generated based on user-defined motion intensity and specified action phrases, enabling fine-grained and controllable video synthesis.

Qualitative Comparisons

To further illustrate the effectiveness of our model, additional qualitative comparisons across various methods are provided.


IPA-PlusFace IPA-FaceID-Portrait IPA-FaceID-PlusV2 ID-Animator Ours
Reference Image
ID Image
Comparison GIF 1
A seasoned chef dressed in a white apron expertly seasons a delicious dish.
Reference Image
ID Image
Comparison GIF 2
A nostalgic grandmother gazes at old family photos, her face lighting up with memories.
Reference Image
ID Image
Comparison GIF 2
A young man grins confidently, his eyes gleaming with excitement as he nods and gives a thumbs-up.
Reference Image
ID Image
Comparison GIF 2
The man stretches his arms above his head, eyes shut from the stretch.
Reference Image
ID Image
Comparison GIF 2
An intellectual professor giving a lecture adjusts their glasses for a clearer view.
Reference Image
ID Image
Comparison GIF 2
Sitting in the sunlight, the elderly woman smiles, savoring the warmth around her.

Action Phrase Control

To further explore the impact of action phrases on generated motion, we conducted experiments where we fixed the reference ID image, text prompt, and motion intensity, varying only the action phrase. This setup demonstrates how different action phrases influence specific motion details, enabling fine-grained control over generated actions.


Reference Image
ID Image
Action Phrase: null
Comparison GIF 1
"smile"
Comparison GIF 2
"turn head"
Comparison GIF 3
A nostalgic grandmother gazes at old family photos, her face lighting up with memories.
Reference Image
ID Image
Action Phrase: null
Comparison GIF 1
"open mouth"
Comparison GIF 2
"wink"
Comparison GIF 3
Iron Man soars through the clouds.
Reference Image
ID Image
Action Phrase: null
Comparison GIF 1
"hold a microphone"
Comparison GIF 2
"wave hand"
Comparison GIF 3
The woman's face beams with pure happiness.
Reference Image
ID Image
Action Phrase: "talk"
Comparison GIF 1
"talk, close eyes"
Comparison GIF 2
"talk, hold a microphone"
Comparison GIF 3
A woman blows a kiss, her lips puckered.

Motion Intensity Control

To analyze the impact of motion intensity adjustments, we conducted controlled experiments where we kept the reference ID image, text prompt, and action phrase fixed while varying only the motion intensity. This approach highlights how different intensity levels affect output quality and clarity, demonstrating the model's responsiveness to motion control parameters.


Reference Image
ID Image
Action Phrase:
"open mouth"
Intensity: 5
Comparison GIF 1
Intensity: 10
Comparison GIF 2
Intensity: 20
Comparison GIF 3
A nature-loving hiker with a majestic mountain in the background takes a deep, refreshing breath.
Reference Image
ID Image
Action Phrase:
"smile"
Intensity: 5
Comparison GIF 1
Intensity: 10
Comparison GIF 2
Intensity: 20
Comparison GIF 3
The girl at the flower stall carefully arranges a bright bouquet of roses, a gentle smile playing on her lips.
Reference Image
ID Image
Action Phrase:
"open mouth"
Intensity: 5
Comparison GIF 1
Intensity: 10
Comparison GIF 2
Intensity: 20
Comparison GIF 3
At the bustling market, the young man in a denim jacket carefully inspects a piece of fruit, looking for the perfect one.

The Human-Motion Dataset

To build the Human-Motion Dataset, a multi-step pipeline was developed to ensure the collection of high-quality video clips:

1. Video Sources: Our data sources comprise video clips from diverse origins, including VFHQ, CelebV-Text, CelebV-HQ, AAHQ, and a private dataset, Sing Videos.

2. Filtering Process: To maintain data quality, a multi-step filtering process was applied:

  • Visual Quality Check: We used CLIP Image Quality Assessment (CLIP-IQA) to evaluate visual quality by sampling one frame per clip, discarding videos with frames of low quality.
  • Resolution Filter: Videos with resolutions below 512 pixels were removed to uphold visual standards.
  • Text Overlay Detection: EasyOCR was used to detect excessive subtitles or text overlays, filtering out obstructed frames.
  • Face Detection: Videos containing multiple faces or low face detection confidence were discarded to ensure each video contains a single, clearly detectable person.

3. Captioning: To enrich motion-related data, we utilized MiniGPT to automatically generate two types of captions for each video:

  • Overall Descriptions (𝒫): General summaries of the video content.
  • Action Phrases (𝒜): Detailed annotations of facial and body movements, serving as action phrases 𝒜 in our framework.

4. Optical Flow Estimation: We use the RAFT model on video frames to compute optical flow, determining motion intensity for training and optimizing the loss function.

5. Motion Intensity Resampling: We resampled videos to balance the dataset, ensuring an even distribution of motion intensity values within the range of 0 to 20.

6. Dataset Summary: The Human-Motion dataset consists of 106,292 video clips. Each clip was rigorously filtered and re-annotated to ensure high-quality identity and motion information across diverse formats, resolutions, and styles.

1. Video Sources

Text Watermark
Text Watermark
Blur
Blur
Multiple People
Multiple People
Low Resolution
Low Resolution
High-quality
High-quality

2. Filtering

Text Watermark Blur Multiple People Low Resolution High-quality Trash Bin

3. Captioning

Overall Descriptions

Action Phrases

4. Optical Flow Estimation

Optical Flow

Optical Flow

Optical Flow Mask

Optical Flow Mask

5. Resampling

Balanced motion intensity range:
[0-20]

6. The Human-Motion Dataset

106,292 video clips.