-
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment
Paper • 2601.14249 • Published • 13 -
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Paper • 2402.07033 • Published • 19 -
MeepleLM: A Virtual Playtester Simulating Diverse Subjective Experiences
Paper • 2601.07251 • Published • 11 -
GameTalk: Training LLMs for Strategic Conversation
Paper • 2601.16276 • Published • 13
Collections
Discover the best community collections!
Collections including paper arxiv:2402.07033
-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 153 -
ReFT: Reasoning with Reinforced Fine-Tuning
Paper • 2401.08967 • Published • 32 -
Tuning Language Models by Proxy
Paper • 2401.08565 • Published • 22 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 69
-
Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers
Paper • 2312.03694 • Published • 2 -
FaceStudio: Put Your Face Everywhere in Seconds
Paper • 2312.02663 • Published • 32 -
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Paper • 2402.07033 • Published • 19 -
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Paper • 2403.03507 • Published • 190
-
Efficient LLM Inference on CPUs
Paper • 2311.00502 • Published • 7 -
Exponentially Faster Language Modelling
Paper • 2311.10770 • Published • 119 -
Cached Transformers: Improving Transformers with Differentiable Memory Cache
Paper • 2312.12742 • Published • 13 -
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Paper • 2312.11514 • Published • 264
-
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Paper • 2310.16795 • Published • 27 -
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
Paper • 2308.12066 • Published • 4 -
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
Paper • 2303.06182 • Published • 1 -
EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate
Paper • 2112.14397 • Published • 1
-
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
Paper • 2408.08152 • Published • 62 -
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Paper • 2402.15220 • Published • 20 -
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Paper • 2402.19427 • Published • 57 -
Simple linear attention language models balance the recall-throughput tradeoff
Paper • 2402.18668 • Published • 20
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 191 -
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
Paper • 2401.04658 • Published • 27 -
Weaver: Foundation Models for Creative Writing
Paper • 2401.17268 • Published • 45 -
Efficient Tool Use with Chain-of-Abstraction Reasoning
Paper • 2401.17464 • Published • 21
-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 38 -
Co-training and Co-distillation for Quality Improvement and Compression of Language Models
Paper • 2311.02849 • Published • 8 -
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper • 2311.04934 • Published • 32 -
Exponentially Faster Language Modelling
Paper • 2311.10770 • Published • 119
-
S^{3}: Increasing GPU Utilization during Generative Inference for Higher Throughput
Paper • 2306.06000 • Published • 1 -
Fast Distributed Inference Serving for Large Language Models
Paper • 2305.05920 • Published • 1 -
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline
Paper • 2305.13144 • Published • 1 -
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
Paper • 2303.06182 • Published • 1
-
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment
Paper • 2601.14249 • Published • 13 -
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Paper • 2402.07033 • Published • 19 -
MeepleLM: A Virtual Playtester Simulating Diverse Subjective Experiences
Paper • 2601.07251 • Published • 11 -
GameTalk: Training LLMs for Strategic Conversation
Paper • 2601.16276 • Published • 13
-
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
Paper • 2408.08152 • Published • 62 -
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Paper • 2402.15220 • Published • 20 -
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Paper • 2402.19427 • Published • 57 -
Simple linear attention language models balance the recall-throughput tradeoff
Paper • 2402.18668 • Published • 20
-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 153 -
ReFT: Reasoning with Reinforced Fine-Tuning
Paper • 2401.08967 • Published • 32 -
Tuning Language Models by Proxy
Paper • 2401.08565 • Published • 22 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 69
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 191 -
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
Paper • 2401.04658 • Published • 27 -
Weaver: Foundation Models for Creative Writing
Paper • 2401.17268 • Published • 45 -
Efficient Tool Use with Chain-of-Abstraction Reasoning
Paper • 2401.17464 • Published • 21
-
Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers
Paper • 2312.03694 • Published • 2 -
FaceStudio: Put Your Face Everywhere in Seconds
Paper • 2312.02663 • Published • 32 -
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Paper • 2402.07033 • Published • 19 -
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Paper • 2403.03507 • Published • 190
-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 38 -
Co-training and Co-distillation for Quality Improvement and Compression of Language Models
Paper • 2311.02849 • Published • 8 -
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper • 2311.04934 • Published • 32 -
Exponentially Faster Language Modelling
Paper • 2311.10770 • Published • 119
-
Efficient LLM Inference on CPUs
Paper • 2311.00502 • Published • 7 -
Exponentially Faster Language Modelling
Paper • 2311.10770 • Published • 119 -
Cached Transformers: Improving Transformers with Differentiable Memory Cache
Paper • 2312.12742 • Published • 13 -
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Paper • 2312.11514 • Published • 264
-
S^{3}: Increasing GPU Utilization during Generative Inference for Higher Throughput
Paper • 2306.06000 • Published • 1 -
Fast Distributed Inference Serving for Large Language Models
Paper • 2305.05920 • Published • 1 -
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline
Paper • 2305.13144 • Published • 1 -
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
Paper • 2303.06182 • Published • 1
-
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Paper • 2310.16795 • Published • 27 -
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
Paper • 2308.12066 • Published • 4 -
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
Paper • 2303.06182 • Published • 1 -
EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate
Paper • 2112.14397 • Published • 1