Dropout Decay Streaming Experiments
This project tests dropout decay only after first finding a model/data regime where static dropout has a real nonzero validation optimum.
The implementation is derived from Andrej Karpathy's nanochat repository:
https://github.com/karpathy/nanochat. Only the core tokenizer ideas and
foundational causal Transformer architecture are retained. Chat interfaces,
deployment scripts, distributed training code, and inference services are not
included. The original nanochat MIT copyright and permission notice are retained
in derived source files and in LICENSE.
Compliance
All Torch experiment runs are MPS-only. The runner exits before model creation if
MPS is unavailable, if PyTorch was not built with MPS, or if
PYTORCH_ENABLE_MPS_FALLBACK=1 is set.
Local Data and Environment
The project should not depend on another checkout of nanochat at runtime. Use
the project-local package and either:
--use-cached-data --cache-dir .cache/dropout_decayto reuse the local tokenizer and encoded token array; or--corpus/--corpus-globto build a fresh local cache from a source corpus.
The existing local cache is:
.cache/dropout_decay/tokenizer-v4096.json.cache/dropout_decay/tokens-v4096-uint16.npy
Use a project-local Python environment with MPS-capable PyTorch, for example
.venv/bin/python. Attribution to nanochat remains in the source and docs, but
experiment commands should not point into a separate nanochat repository.
Workflow
- Screen candidate model sizes with cheap static dropout sweeps.
- Select candidate models whose validation curve has an interior nonzero dropout optimum.
- Confirm the winner with a 3-seed static sweep.
- Lock the model and run static-vs-decay streaming comparisons from scratch.
Every run writes:
config.json: command, model specs, data paths, environment, attribution.metrics.jsonl: one row per seed/model/dropout/stage.trace.jsonl: optional training and intermediate evaluation trace.summary.csv/summary.json: mean/std train loss, validation loss, and gap.model_selection.csv/model_selection.json: static-sweep optimum and plateau diagnostics for screen and confirm runs.
Old exploratory outputs are archived under archive/.
For exact headline reproduction, see REPRODUCING.md. For a first-reader
explanation of the hypothesis, math, evidence, and current limits, see
docs/dropout_decay_hypothesis_summary.md. For the denser run-by-run research
summary, see docs/dropout_decay_research_report_v2.md. For the corpus
difficulty follow-up that motivated the next formula refinement, see
docs/corpus_difficulty_probe_20260529.md. For the first single-seed
probe-calibrated streaming test, see
docs/probe_calibrated_stream_20260529.md.
Step 1: Cheap Static Screen
Use one or two seeds. The output tells us, for each model, where the static dropout curve bottoms out and which dropout range is within the configured plateau delta.
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
--mode screen_static \
--use-cached-data \
--cache-dir .cache/dropout_decay \
--models 8x8x256 12x8x384 16x8x384 \
--seeds 1 2 \
--token-limits 5000000 \
--dropout-rates 0.0 0.02 0.05 0.08 0.10 0.14 0.20 0.30 0.50 \
--steps 2000 \
--eval-batches 64
Step 2: Confirm Winner
After selecting a promising model, rerun the static dropout curve with exactly three seeds.
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
--mode confirm_static \
--use-cached-data \
--cache-dir .cache/dropout_decay \
--models winner=12x8x384 \
--seeds 1 2 3 \
--token-limits 5000000 \
--dropout-rates 0.0 0.02 0.05 0.08 0.10 0.14 0.20 0.30 0.50 \
--steps 2000 \
--eval-batches 64
Step 3: Locked Streaming Comparison
Only after the model is locked, compare static dropout and decay schedules from fresh initialization.
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
--mode locked_stream \
--use-cached-data \
--cache-dir .cache/dropout_decay \
--models winner=12x8x384 \
--seeds 1 2 3 \
--stream-token-caps 5000000 10000000 20000000 40000000 \
--dropout-rates 0.0 0.10 0.14 0.20 \
--decays decay_030_to_014:0.30:0.14:cosine decay_020_to_010:0.20:0.10:cosine \
--stage-steps 1000 \
--eval-batches 64