Beyond Language Modeling: An Exploration of Multimodal Pretraining Paper • 2603.03276 • Published 3 days ago • 66
Mode Seeking meets Mean Seeking for Fast Long Video Generation Paper • 2602.24289 • Published 7 days ago • 36
veScale-FSDP: Flexible and High-Performance FSDP at Scale Paper • 2602.22437 • Published 8 days ago • 7
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation Paper • 2602.19163 • Published 12 days ago • 14
SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning Paper • 2602.13515 • Published 20 days ago • 43
Autoregressive Image Generation with Masked Bit Modeling Paper • 2602.09024 • Published 25 days ago • 6
PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss Paper • 2602.02493 • Published Feb 2 • 44
One-step Latent-free Image Generation with Pixel Mean Flows Paper • 2601.22158 • Published Jan 29 • 18
Revisiting Diffusion Model Predictions Through Dimensionality Paper • 2601.21419 • Published Jan 29 • 4
Towards Pixel-Level VLM Perception via Simple Points Prediction Paper • 2601.19228 • Published Jan 27 • 18
OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation Paper • 2601.15369 • Published Jan 21 • 21
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders Paper • 2601.16208 • Published Jan 22 • 53
Bidirectional Normalizing Flow: From Data to Noise and Back Paper • 2512.10953 • Published Dec 11, 2025 • 7
Towards Scalable Pre-training of Visual Tokenizers for Generation Paper • 2512.13687 • Published Dec 15, 2025 • 106