Full Paper List
updated
Ultra-Sparse Memory Network
Paper
• 2411.12364
• Published • 23
Paper
• 2409.19606
• Published • 26
Polynomial Composition Activations: Unleashing the Dynamics of Large
Language Models
Paper
• 2411.03884
• Published • 28
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
Paper
• 2501.16975
• Published • 32
Scale-Distribution Decoupling: Enabling Stable and Effective Training of
Large Language Models
Paper
• 2502.15499
• Published • 15
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid
Normalization
Paper
• 2503.04598
• Published • 21
Frac-Connections: Fractional Extension of Hyper-Connections
Paper
• 2503.14125
• Published • 22
Efficient Pretraining Length Scaling
Paper
• 2504.14992
• Published • 20
Scaling Law for Quantization-Aware Training
Paper
• 2505.14302
• Published • 76
Stepsize anything: A unified learning rate schedule for
budgeted-iteration training
Paper
• 2505.24452
• Published • 5
UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior
Long-Context Learning
Paper
• 2508.18756
• Published • 36