Nurmukhamed 's Collections llm-performance
updated
QLoRA: Efficient Finetuning of Quantized LLMs
Paper
• 2305.14314
• Published
• 59
Training Transformers with 4-bit Integers
Paper
• 2306.11987
• Published
• 23
FasterViT: Fast Vision Transformers with Hierarchical Attention
Paper
• 2306.06189
• Published
• 32
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme
Long Sequence Transformer Models
Paper
• 2309.14509
• Published
• 20
VeRA: Vector-based Random Matrix Adaptation
Paper
• 2310.11454
• Published
• 30
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Paper
• 2310.08659
• Published
• 27
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
Paper
• 2310.17157
• Published
• 14
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Paper
• 2312.16862
• Published
• 31
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper
• 2312.04985
• Published
• 40
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Paper
• 2312.12456
• Published
• 45
LLM in a flash: Efficient Large Language Model Inference with Limited
Memory
Paper
• 2312.11514
• Published
• 260
LLM Augmented LLMs: Expanding Capabilities through Composition
Paper
• 2401.02412
• Published
• 38
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
Paper
• 2401.15024
• Published
• 73
Step-3 is Large yet Affordable: Model-system Co-design for
Cost-effective Decoding
Paper
• 2507.19427
• Published
• 21