PHOSA

Photorealistic 3D Sign Avatar Modeling and Benchmark

PHOSA introduces MVSign, a multi-view Chinese Sign Language benchmark co-designed with Deaf experts, and a decoupled photorealistic avatar representation for expressive hands and faces.

ECCV 2026
Dataset
16
Synchronized RGB cameras
115K
High-resolution frames
5
Native CSL signers
2048P
Capture resolution
SMPL-X
Hands, face, body annotations

In this work, we focus on photorealistic sign avatar modeling, which is crucial for effective communication with the Deaf community and is characterized by complex hand gestures and nuanced facial expressions. To this end, we introduce MVSign, the first multi-view Chinese sign language dataset co-designed with Deaf experts, featuring diverse gestures and rich annotations. For precise SMPL-X annotation, we develop a hybrid fitting pipeline that produces accurate body, hand, and facial parameters and can also be applied to the monocular setting. Building on MVSign, we propose a decoupled sign avatar representation that isolates body, head, and hand components to capture complex articulations, together with a motion-aware sampling strategy to handle motion blur and balance gesture diversity. Extensive experiments demonstrate that our method achieves high-fidelity visual results on MVSign, particularly in detailed hand and facial regions, and generalizes well to in-the-wild monocular sign language videos. The dataset and code will be publicly released.

01

Designed with Deaf experts

The capture protocol follows sign-language components such as hand shapes, movement, orientation, location, and facial expression.

02

16 calibrated views

One frontal camera, ten lateral cameras, and five dedicated head views cover body motion, hand articulation, facial expression, and mouthing.

03

Rich avatar-ready labels

Each sequence includes matting, body-part segmentation, 3D keypoints, and expressive SMPL-X parameters for body, hands, and face.

PHOSA combines complementary estimators instead of relying on a single whole-body detector. DWPose provides dense 2D keypoints, HaMeR strengthens MANO hand estimates, multi-view triangulation recovers 3D supervision, and INFERNO refines facial expressions. Temporal initialization and SmoothNet reduce high-frequency artifacts across frames.

PHOSA hybrid SMPL-X fitting pipeline fusing multi-view pose, hand mesh, triangulation, and facial expression estimation.

SMPL-X Annotation

Body-Part Segmentation

Existing avatar models often learn a single whole-body representation, forcing fine hand and facial motion to compete with coarse body movement. PHOSA separates Gaussian maps into body, head, and hands, then applies partial hand kinematic decoupling so hand pose maps depend on hand gestures rather than global body pose.

PHOSA avatar pipeline with body, head, and hand StyleUNet branches predicting Gaussian maps for photorealistic sign avatar rendering.
Method FullHandFace
PSNR↑SSIM↑LPIPS↓PSNR↑SSIM↑LPIPS↓PSNR↑SSIM↑LPIPS↓
SplattingAvatar 23.710.96250.049415.640.67620.388416.900.76330.2423
GaussianAvatar 24.230.96330.042816.300.68330.308217.780.77370.2086
AnimatableGS 25.090.96470.046516.950.70020.313518.630.78420.2230
EVA 25.560.96670.043617.430.71540.286919.210.80000.1964
MmlpHuman 25.910.96520.046017.410.70840.267519.380.80250.1869
Ours 27.030.96890.039318.550.73250.256820.600.81890.1757
@inproceedings{wang2026phosa,
title = {PHOSA: Photorealistic 3D Sign Avatar Modeling and Benchmark},
author = {Wang, Haodong and Hu, Hezhen and Zhou, Wengang and Li, Houqiang},
booktitle = {European Conference on Computer Vision},
year = {2026}
}