Dropout Decay Streaming Experiments

This project tests dropout decay only after first finding a model/data regime where static dropout has a real nonzero validation optimum.

The implementation is derived from Andrej Karpathy's nanochat repository: https://github.com/karpathy/nanochat. Only the core tokenizer ideas and foundational causal Transformer architecture are retained. Chat interfaces, deployment scripts, distributed training code, and inference services are not included. The original nanochat MIT copyright and permission notice are retained in derived source files and in LICENSE.

Compliance

All Torch experiment runs are MPS-only. The runner exits before model creation if MPS is unavailable, if PyTorch was not built with MPS, or if PYTORCH_ENABLE_MPS_FALLBACK=1 is set.

Local Data and Environment

The project should not depend on another checkout of nanochat at runtime. Use the project-local package and either:

  • --use-cached-data --cache-dir .cache/dropout_decay to reuse the local tokenizer and encoded token array; or
  • --corpus / --corpus-glob to build a fresh local cache from a source corpus.

The existing local cache is:

  • .cache/dropout_decay/tokenizer-v4096.json
  • .cache/dropout_decay/tokens-v4096-uint16.npy

Use a project-local Python environment with MPS-capable PyTorch, for example .venv/bin/python. Attribution to nanochat remains in the source and docs, but experiment commands should not point into a separate nanochat repository.

Workflow

  1. Screen candidate model sizes with cheap static dropout sweeps.
  2. Select candidate models whose validation curve has an interior nonzero dropout optimum.
  3. Confirm the winner with a 3-seed static sweep.
  4. Lock the model and run static-vs-decay streaming comparisons from scratch.

Every run writes:

  • config.json: command, model specs, data paths, environment, attribution.
  • metrics.jsonl: one row per seed/model/dropout/stage.
  • trace.jsonl: optional training and intermediate evaluation trace.
  • summary.csv / summary.json: mean/std train loss, validation loss, and gap.
  • model_selection.csv / model_selection.json: static-sweep optimum and plateau diagnostics for screen and confirm runs.

Old exploratory outputs are archived under archive/.

For exact headline reproduction, see REPRODUCING.md. For a first-reader explanation of the hypothesis, math, evidence, and current limits, see docs/dropout_decay_hypothesis_summary.md. For the denser run-by-run research summary, see docs/dropout_decay_research_report_v2.md. For the corpus difficulty follow-up that motivated the next formula refinement, see docs/corpus_difficulty_probe_20260529.md. For the first single-seed probe-calibrated streaming test, see docs/probe_calibrated_stream_20260529.md.

Step 1: Cheap Static Screen

Use one or two seeds. The output tells us, for each model, where the static dropout curve bottoms out and which dropout range is within the configured plateau delta.

PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
  --mode screen_static \
  --use-cached-data \
  --cache-dir .cache/dropout_decay \
  --models 8x8x256 12x8x384 16x8x384 \
  --seeds 1 2 \
  --token-limits 5000000 \
  --dropout-rates 0.0 0.02 0.05 0.08 0.10 0.14 0.20 0.30 0.50 \
  --steps 2000 \
  --eval-batches 64

Step 2: Confirm Winner

After selecting a promising model, rerun the static dropout curve with exactly three seeds.

PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
  --mode confirm_static \
  --use-cached-data \
  --cache-dir .cache/dropout_decay \
  --models winner=12x8x384 \
  --seeds 1 2 3 \
  --token-limits 5000000 \
  --dropout-rates 0.0 0.02 0.05 0.08 0.10 0.14 0.20 0.30 0.50 \
  --steps 2000 \
  --eval-batches 64

Step 3: Locked Streaming Comparison

Only after the model is locked, compare static dropout and decay schedules from fresh initialization.

PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
  --mode locked_stream \
  --use-cached-data \
  --cache-dir .cache/dropout_decay \
  --models winner=12x8x384 \
  --seeds 1 2 3 \
  --stream-token-caps 5000000 10000000 20000000 40000000 \
  --dropout-rates 0.0 0.10 0.14 0.20 \
  --decays decay_030_to_014:0.30:0.14:cosine decay_020_to_010:0.20:0.10:cosine \
  --stage-steps 1000 \
  --eval-batches 64
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support