Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

1University of Trento, 2University of Pisa, 3University of Modena and Reggio Emilia
*Indicates Equal Contribution

TEMU-VTOFF is a novel virtual try-off (VTOFF) framework that generates standardized product images from real-world photos of clothed individuals. It uses a dual DiT-based architecture with multimodal inputs to handle multiple garment categories and refine visual details.

Abstract

While virtual try-on (VTON) systems aim to render a garment onto a target person image, this paper tackles the novel task of virtual try-off (VTOFF), which addresses the inverse problem: generating standardized product images of garments from real-world photos of clothed individuals. Unlike VTON, which must resolve diverse pose and style variations, VTOFF benefits from a consistent and well-defined output format — typically a flat, lay-down-style representation of the garment — making it a promising tool for data generation and dataset enhancement. However, existing VTOFF approaches face two major limitations: (i) difficulty in disentangling garment features from occlusions and complex poses, often leading to visual artifacts, and (ii) restricted applicability to single-category garments (e.g., upper-body clothes only), limiting generalization. To address these challenges, we present Text-Enhanced MUlti-category Virtual Try-Off (TEMU-VTOFF), a novel architecture featuring a dual DiT-based backbone with a modified multimodal attention mechanism for robust garment feature extraction. Our architecture is designed to receive garment information from multiple modalities like images, text, and masks to work in a multi-category setting. Finally, we propose an additional alignment module to further refine the generated visual details. Experiments on VITON-HD and Dress Code datasets show that TEMU-VTOFF sets a new state-of-the-art on the VTOFF task, significantly improving both visual quality and fidelity to the target garments.

Method

Overall Framework of TEMU-VTOFF. The feature extractor \( F_E \) processes spatial inputs (noise, masked image, binary mask), and global inputs (model image via AdaLN). The intermediate keys and values \( \mathbf{K}^l_{\text{extractor}} \), \( \mathbf{V}^l_{\text{extractor}} \) are injected into the corresponding hybrid blocks of the garment generator \( F_D \). Then, the main DiT model generates the final garment leveraging the proposed MHA module. We align our model with a diffusion loss for the noise estimate and an alignment loss with clean, DINOv2 features of the target garment.

Quantitative Results

Table 1: Results on Dress Code

Quantitative results on the Dress Code dataset.

Quantitative results on the Dress Code dataset, considering both the entire test set and the three category-specific subsets.

Table 2: Ablation Study on Dress Code

Ablation study results on the Dress Code dataset, showing effects of text/mask conditioning and garment aligner components across 'All', 'Upper-Body', 'Lower-Body', and 'Dresses' categories.

Ablation study of the proposed components on the Dress Code dataset.

Table 3: Results on VITON-HD

Quantitative results on the VITON-HD dataset, comparing methods for metrics like SSIM, LPIPS, DISTS, FID, and KID.

Quantitative results on the VITON-HD dataset.

Qualitative Results

BibTeX

@article{lobba2025inverse,
        title={Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals},
        author={Lobba, Davide and Sanguigni, Fulvio and Ren, Bin and Cornia, Marcella and Cucchiara, Rita and Sebe, Nicu},
        journal={arXiv preprint arXiv:2505.21062},
        year={2025}
      }