Multimodal
updated
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image
Understanding
Paper
• 2306.17107
• Published
• 12
On the Hidden Mystery of OCR in Large Multimodal Models
Paper
• 2305.07895
• Published
• 1
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
• 2308.12966
• Published
• 11
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper
• 2401.15947
• Published
• 53
DocPedia: Unleashing the Power of Large Multimodal Model in the
Frequency Domain for Versatile Document Understanding
Paper
• 2311.11810
• Published
• 1
OCR-free Document Understanding Transformer
Paper
• 2111.15664
• Published
• 6
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding
Paper
• 2210.03347
• Published
• 3
Reading Order Matters: Information Extraction from Visually-rich
Documents by Token Path Prediction
Paper
• 2310.11016
• Published
Nougat: Neural Optical Understanding for Academic Documents
Paper
• 2308.13418
• Published
• 42
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper
• 2403.00522
• Published
• 46
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
• 2403.07508
• Published
• 77
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper
• 2402.04615
• Published
• 44
LayoutLLM: Layout Instruction Tuning with Large Language Models for
Document Understanding
Paper
• 2404.05225
• Published
• 2
LayoutLLM: Large Language Model Instruction Tuning for Visually Rich
Document Understanding
Paper
• 2403.14252
• Published
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
• 2405.15738
• Published
• 46
CogVLM: Visual Expert for Pretrained Language Models
Paper
• 2311.03079
• Published
• 27
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Paper
• 2405.20204
• Published
• 37
OpenVLA: An Open-Source Vision-Language-Action Model
Paper
• 2406.09246
• Published
• 43
Unveiling Encoder-Free Vision-Language Models
Paper
• 2406.11832
• Published
• 54
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
Paper
• 2407.04172
• Published
• 25
LLaVA-OneVision: Easy Visual Task Transfer
Paper
• 2408.03326
• Published
• 61
Law of Vision Representation in MLLMs
Paper
• 2408.16357
• Published
• 95
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
• 2408.16500
• Published
• 57