Skip to content

JetAstra/SDAR-VL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Official implementation of "SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding"

License: MIT HuggingFace: Models Paper: Arxiv

sdar ## 🔍 Overview

SDAR-VL is the first large-scale block-wise discrete diffusion framework for vision-language understanding (VLU).
It provides a stable and efficient alternative to autoregressive (AR) decoders by introducing an integrated training framework that improves convergence, stability, and efficiency.

Key Features

  • 🧩 Block-wise Diffusion Backbone — Parallel intra-block denoising with causal inter-block dependencies
  • 🔄 Asynchronous Block-wise Noise Scheduling — Diversifies supervision and smooths optimization
  • ⚖️ Effective Mask Ratio Scaling — Unbiased loss normalization under stochastic masking
  • 📈 Progressive Beta Noise Curriculum — Improves convergence and coverage over training
  • 📊 SOTA Performance — Matches or surpasses AR models like LLaVA-OneVision under matched setups, and achieves state-of-the-art results among diffusion-based multimodal models.

🤗 Model Zoo

Model Type Link
SDAR-VL-Instruct-4B Instruct https://huggingface.co/JetLM/SDAR-VL-Instruct-4B
SDAR-VL-Instruct-8B Instruct https://huggingface.co/JetLM/SDAR-VL-Instruct-8B
SDAR-VL-Think-4B Think https://huggingface.co/JetLM/SDAR-VL-Think-4B
SDAR-VL-Think-8B Think https://huggingface.co/JetLM/SDAR-VL-Think-8B

⚙️ Usage

Inference

python generate.py

Training

For detailed instructions on how to fine-tune the model on your own dataset, please refer to the guide in the training directory: training/README.md.

📬 Contact

For issues or inquiries:

🔬 Citation

@misc{cheng2025sdarvl,
  title        = {SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding},
  author       = {Cheng, Shuang and Jiang, Yuhua and Zhou, Zineng and Liu, Dawei and Wang, Tao and Zhang, Linfeng and Qi, Biqing and Zhou, Bowen},
  year         = {2025},
  note         = {Zhejiang University, Shanghai AI Laboratory, Tsinghua University, Shanghai Jiao Tong University, ByteDance}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages