MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation

We propose MotionCharacter, a human video generation framework specifically
designed for identity preservation and fine-grained motion control.


ID Image
Reference Image
Action Phrase:
"open mouth"
Intensity: 10
First GIF
Intensity: 20
Second GIF
A nature-loving hiker with a majestic mountain in the
background takes a deep, refreshing breath.
ID Image
Reference Image
Action Phrase:
"smile"
Intensity: 10
First GIF
Intensity: 20
Second GIF
The girl at the flower stall carefully arranges
a bright bouquet of roses.

Abstract

Recent advancements in personalized Text-to-Video (T2V) generation highlight the importance of integrating character-specific identities and actions. However, previous T2V models struggle with identity consistency and controllable motion dynamics, mainly due to limited fine-grained facial and action-based textual prompts and datasets that overlook key human attributes and actions. To address these challenges, we propose MotionCharacter, an efficient and high-fidelity human video generation framework designed for identity preservation and fine-grained motion control. We introduce an ID-preserving module to maintain identity fidelity while allowing flexible attribute modifications, and further integrate ID-consistency and region-aware loss mechanisms, significantly enhancing identity consistency and detail fidelity. Additionally, our approach incorporates a motion control module that prioritizes action-related text while maintaining subject consistency, along with a dataset, Human-Motion, which utilizes large language models to generate detailed motion descriptions. For simplify user control during inference, we parameterize motion intensity through a single coefficient, allowing for easy adjustments. Extensive experiments highlight the effectiveness of MotionCharacter, demonstrating significant improvements in ID-preserving, high-quality video generation.

Overview

Proposed Model Architecture
Our proposed framework comprises three core components: the ID-Preserving Module, the Motion Control Module, and a composite loss function. The ID-Preserving Module ensures identity consistency by focusing on identity-specific regions within cross-attention layers, guided by an identity embedding \(C_{id}\) from a reference ID image. The Motion Control Module adjusts motion intensity \(\mathcal{M}\) based on user-specified prompts and action phrases. The loss function incorporates a Region-Aware Loss to ensure high motion fidelity and an ID-Consistency Loss to maintain alignment with the reference ID image. During training, motion intensity is derived from optical flow. At inference, human animations are generated based on user-defined motion intensity and specified action phrases, enabling fine-grained and controllable video synthesis.

Qualitative Comparisons

To further illustrate the effectiveness of our model, additional qualitative comparisons across various methods are provided.


IPA-PlusFace IPA-FaceID-Portrait IPA-FaceID-PlusV2 ID-Animator Ours
Reference Image
ID Image
Comparison GIF 1
A seasoned chef dressed in a white apron expertly seasons a delicious dish.
Reference Image
ID Image
Comparison GIF 2
A nostalgic grandmother gazes at old family photos, her face lighting up with memories.
Reference Image
ID Image
Comparison GIF 2
A young man grins confidently, his eyes gleaming with excitement as he nods and gives a thumbs-up.
Reference Image
ID Image
Comparison GIF 2
The man stretches his arms above his head, eyes shut from the stretch.
Reference Image
ID Image
Comparison GIF 2
An intellectual professor giving a lecture adjusts their glasses for a clearer view.
Reference Image
ID Image
Comparison GIF 2
Sitting in the sunlight, the elderly woman smiles, savoring the warmth around her.

Action Phrase Control

To further explore the impact of action phrases on generated motion, we conducted experiments where we fixed the reference ID image, text prompt, and motion intensity, varying only the action phrase. This setup demonstrates how different action phrases influence specific motion details, enabling fine-grained control over generated actions.


Reference Image
ID Image
Action Phrase: null
Comparison GIF 1
"smile"
Comparison GIF 2
"turn head"
Comparison GIF 3
A nostalgic grandmother gazes at old family photos, her face lighting up with memories.
Reference Image
ID Image
Action Phrase: null
Comparison GIF 1
"open mouth"
Comparison GIF 2
"wink"
Comparison GIF 3
Iron Man soars through the clouds.
Reference Image
ID Image
Action Phrase: null
Comparison GIF 1
"hold a microphone"
Comparison GIF 2
"wave hand"
Comparison GIF 3
The woman's face beams with pure happiness.
Reference Image
ID Image
Action Phrase: "talk"
Comparison GIF 1
"talk, close eyes"
Comparison GIF 2
"talk, hold a microphone"
Comparison GIF 3
A woman blows a kiss, her lips puckered.

Motion Intensity Control

To analyze the impact of motion intensity adjustments, we conducted controlled experiments where we kept the reference ID image, text prompt, and action phrase fixed while varying only the motion intensity. This approach highlights how different intensity levels affect output quality and clarity, demonstrating the model's responsiveness to motion control parameters.


Reference Image
ID Image
Action Phrase:
"open mouth"
Intensity: 5
Comparison GIF 1
Intensity: 10
Comparison GIF 2
Intensity: 20
Comparison GIF 3
A nature-loving hiker with a majestic mountain in the background takes a deep, refreshing breath.
Reference Image
ID Image
Action Phrase:
"smile"
Intensity: 5
Comparison GIF 1
Intensity: 10
Comparison GIF 2
Intensity: 20
Comparison GIF 3
The girl at the flower stall carefully arranges a bright bouquet of roses, a gentle smile playing on her lips.
Reference Image
ID Image
Action Phrase:
"open mouth"
Intensity: 5
Comparison GIF 1
Intensity: 10
Comparison GIF 2
Intensity: 20
Comparison GIF 3
At the bustling market, the young man in a denim jacket carefully inspects a piece of fruit, looking for the perfect one.

The Human-Motion Dataset

To build the Human-Motion Dataset, a multi-step pipeline was developed to ensure the collection of high-quality video clips:

1. Video Sources: Our data sources comprise video clips from diverse origins, including VFHQ, CelebV-Text, CelebV-HQ, AAHQ, and a private dataset, Sing Videos.

2. Filtering Process: To maintain data quality, a multi-step filtering process was applied:

  • Visual Quality Check: We used CLIP Image Quality Assessment (CLIP-IQA) to evaluate visual quality by sampling one frame per clip, discarding videos with frames of low quality.
  • Resolution Filter: Videos with resolutions below 512 pixels were removed to uphold visual standards.
  • Text Overlay Detection: EasyOCR was used to detect excessive subtitles or text overlays, filtering out obstructed frames.
  • Face Detection: Videos containing multiple faces or low face detection confidence were discarded to ensure each video contains a single, clearly detectable person.

3. Captioning: To enrich motion-related data, we utilized MiniGPT to automatically generate two types of captions for each video:

  • Overall Descriptions (𝒫): General summaries of the video content.
  • Action Phrases (𝒜): Detailed annotations of facial and body movements, serving as action phrases 𝒜 in our framework.

4. Optical Flow Estimation: We use the RAFT model on video frames to compute optical flow, determining motion intensity for training and optimizing the loss function.

5. Motion Intensity Resampling: We resampled videos to balance the dataset, ensuring an even distribution of motion intensity values within the range of 0 to 20.

6. Dataset Summary: The Human-Motion dataset consists of 106,292 video clips. Each clip was rigorously filtered and re-annotated to ensure high-quality identity and motion information across diverse formats, resolutions, and styles.

1. Video Sources

Text Watermark
Text Watermark
Blur
Blur
Multiple People
Multiple People
Low Resolution
Low Resolution
High-quality
High-quality

2. Filtering

Text Watermark Blur Multiple People Low Resolution High-quality Trash Bin

3. Captioning

Overall Descriptions

Action Phrases

4. Optical Flow Estimation

Optical Flow

Optical Flow

Optical Flow Mask

Optical Flow Mask

5. Resampling

Balanced motion intensity range:
[0-20]

6. The Human-Motion Dataset

106,292 video clips.