DS-VTON: An Enhanced Dual-Scale Coarse-to-Fine Framework for Virtual Try-On

Xianbing Sun1, Yan Hong2, Jiahui Zhan1, Jun Lan2, Huijia Zhu2, Weiqiang Wang2, Liqing Zhang1, Jianfu Zhang1,*

1Shanghai Jiao Tong University    2Ant Group   

* Corresponding author   



DS-VTON adopts an enhanced dual-scale coarse-to-fine framework combined with a mask-free strategy.

Abstract

Despite recent progress, most existing virtual try-on methods still struggle to simultaneously address two core challenges: accurately aligning the garment image with the target human body, and preserving fine-grained garment textures and patterns. These two requirements map directly onto a coarse-to-fine generation paradigm, where the coarse stage handles structural alignment and the fine stage recovers rich garment details. Motivated by this observation, we propose DS-VTON, an enhanced dual-scale coarse-to-fine framework that tackles the try-on problem more effectively. DS-VTON consists of two stages: the first stage generates a low-resolution try-on result to capture the semantic correspondence between garment and body, where reduced detail facilitates robust structural alignment. In the second stage, a blend-refine diffusion process reconstructs high-resolution outputs by refining the residual between scales through noise–image blending, emphasizing texture fidelity and effectively correcting fine-detail errors from the low-resolution stage. In addition, our method adopts a fully mask-free generation strategy, eliminating reliance on human parsing maps or segmentation masks. Extensive experiments show that DS-VTON not only achieves state-of-the-art performance but consistently and significantly surpasses prior methods in both structural alignment and texture fidelity across multiple standard virtual try-on benchmarks.

Method

Image 1

Upper panel: Two-scale generation pipeline. A low-resolution stage produces a coarse try-on result, then refined by a high-resolution stage; both stages share the same network architecture. Lower panel: Results with different settings; ours uses \(\sigma = 2\) and \(\alpha = \beta = \tfrac{1}{2}\). With proper two-stage settings, the second stage leverages the reliable coarse structure from the first stage to correct fine-detail errors and generate high-quality try-on results.

Multiple People on Same Garment

One garment + four try-on images per view. Hover (or swipe on mobile) to navigate.

Experiment

Quantitative Comparison

Image 1

Qualitative Comparison on VITON-HD

Image 1

More Qualitative Results

Qualitative Comparison on VITON-HD

Image 1

Qualitative Comparison on DressCode

Image 1
Image 1
Image 1