AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer


1 Southern University of Science and Technology
2 Tsinghua University

*Equal Contribution, Corresponding Author

Video

Abstact

In the era of foundation models, achieving a unified understanding of all dynamic objects using single network has the potential to empower stronger spatial intelligence. Moreover, accurate estimation of animal pose and shape across diverse species is essential for quantitative analysis in biological research. However, this problem remains underexplored due to the limited network capacity of previous methods and the scarcity of comprehensive multi-species datasets. To address these limitations, we introduce AniMer+, an extended version of our scalable AniMer framework. In this paper, we focus on a unified approach for reconstructing mammals (mammalia) and birds (aves). A key innovation of AniMer+ is its high-capacity, family-aware Vision Transformer (ViT) incorporating a Mixture-of-Experts (MoE) design. Its architecture partitions network layers into taxa-specific components (for mammalia and aves) and taxa-shared components, enabling efficient learning of both distinct and common anatomical features within a single model. To overcome the critical shortage of 3D training data, especially for birds, we introduce a diffusion-based conditional image generation pipeline. This pipeline yields two large-scale synthetic datasets: CtrlAni3D for quadrupeds (about 10k images with pixel-aligned SMAL labels) and CtrlAVES3D (about 7k images with pixel-aligned AVES labels). CtrlAVES3D represents the first large-scale, 3D-annotated dataset for birds, which is crucial for resolving single-view depth ambiguities. Trained on an aggregated collection of 41.3k mammalian and 12.4k avian images (combining real and synthetic data), our method demonstrates superior performance over existing approaches across a wide range of benchmarks, including the challenging out-of-domain Animal Kingdom dataset. Ablation studies confirm the effectiveness of both our novel network architecture and the generated synthetic datasets in enhancing real-world application performance.

AniMer+

Different from AniMer, AniMer+ can estimate the shape and pose for both mammals and birds. Given an Image IRH×W×3I \in \mathbb{R}^{H \times W \times 3}, we first utilize a ViT-MoE encoder to extract image feature tokens FR192×1280\mathbf{F} \in \mathbb{R}^{192 \times 1280}, while the class token interacts with the image to capture information about animal family. Then, we feed the feature token F\mathbf{F} into Transformer Decoder to obtain a feature vector fR1×1280\boldsymbol{f} \in \mathbb{R}^{1 \times 1280}. Finally, the parameters of the parametric model are regressed using the regression head, and these parameters are fed into the corresponding template to generate the 3D mesh. At the same time, the class token is fed into the predictor head for animal family supervised contrastive learning. By combining AniMer+ with the dataset generation pipeline, our method can effectively accommodate both mammals and birds, with the potential to generalize to any animal group representable by a parametric model. For more details, please refer our paper.

BibTeX citation

    @misc{lyu2025animerunifiedposeshape,
      title={AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer}, 
      author={Jin Lyu and Liang An and Li Lin and Pujin Cheng and Yebin Liu and Xiaoying Tang},
      year={2025},
      eprint={2508.00298},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.00298}, 
}

  

Credits

Thanks to RomanHauksson for the website template.