Given a real-world image (a, d, g) as input condition, our framework generates articulated 3D objects with realistic geometry, articulation, and appearance. For each example, we first generate an articulation-aware voxel structure (b, e, h), and then decode it into 3D Gaussian splats that support physically plausible part-level motion (c, f, i). The resulting models exhibit high visual fidelity and motion consistency across various object types.
We propose ArtiLatent, a generative framework that synthesizes human-made 3D objects with fine-grained geometry, accurate articulation, and realistic appearance. Our approach jointly models part geometry and articulation dynamics by embedding sparse voxel representations and associated articulation properties—including joint type, axis, origin, range, and part category—into a unified latent space via a variational autoencoder. A latent diffusion model is then trained over this space to enable diverse yet physically plausible sampling. To reconstruct photorealistic 3D shapes, we introduce an articulation-aware Gaussian decoder that accounts for articulation-dependent visibility changes (e.g., revealing the interior of a drawer when opened). By conditioning appearance decoding on articulation state, our method assigns plausible texture features to regions that are typically occluded in static poses, significantly improving visual realism across articulation configurations. Extensive experiments on furniture-like objects from PartNet-Mobility and ACD datasets demonstrate that ArtiLatent outperforms existing approaches in geometric consistency and appearance fidelity. Our framework provides a scalable solution for articulated 3D object synthesis and manipulation.
The overall architecture of ArtiLatent. Given voxel-level articulation-aware inputs (occupancy, semantics, joint types, bounding boxes, joint parameters, and motion ranges), we encode them into a latent representation using an articulation-aware VAE. A conditional diffusion model samples articulation-aware latent codes under user-specified conditions (e.g., image), which are then decoded into an animatable voxel structure. The final appearance is generated using an articulation-aware Gaussian decoder, producing high-fidelity 3D Gaussian splats with consistent geometry and appearance across motion states.
Our work is inspired by the following work:
TRELLIS is a large 3D asset generation model. It takes in text or image prompts and generates high-quality 3D assets in various formats, such as Radiance Fields, 3D Gaussians, and meshes.
CAGE addresses the challenge of generating 3D articulated objects in a controllable fashion.
SINGAPO is a generative method to reconstruct 3D articulated objects from an image observing the object in the resting state from a random view.
@article{chen2025artilatent,
title={ArtiLatent: Realistic Articulated 3D Object Generation via Structured Latents},
author={Chen, Honghua and Lan, Yushi and Chen, Yongwei and Pan, Xingang},
year={2025}
}