scGPT fine-tuned on Norman 2019

Produced as part of the sc-interp single-cell model comparison repo.

Provenance

Base model

Initialised from the scGPT whole-human checkpoint (~33M cells of CellxGene Census), 12 transformer layers, 512 hidden dim, 8 heads. Downloaded from the official Google Drive folder linked in the scGPT README. Not currently hosted on the HuggingFace Hub.

Training

  • Task: perturb-GEP, control cells as input, matched perturbed cells as target
  • Runner: invoked via the sc-interp dispatcher python -m scripts.run scgpt --dataset norman
  • Split: GEARS simulation split with seed 42 (152 train / 33 val / 99 test perturbations), materialised once by scripts/data/gears.py into data/norman/splits/simulation_42_0.75.json and consumed by runners via scripts/data/splits.py
  • Recipe adapted from scGPT Tutorial_Perturbation.ipynb
  • Loss: masked MSE on all gene positions
  • Optimiser: Adam, lr 1e-4, StepLR gamma 0.9 per epoch
  • AMP: enabled
  • Attention: standard torch.nn.MultiheadAttention (flash-attn not installed, Wqkv weights renamed to in_proj during load)

Budget and stopping

epochs trained 15 / 15
cells seen 794,880
gradient steps 12,420
wall clock 2.0 hours (H100 PCIe)
best val pearson (all-gene) 0.9879
best val epoch 7
stopping reason max_epochs

Test set metrics (cell-eval)

metric mean median max
pearson_delta 0.5067 0.5503 0.9132
mse 0.0038 0.0033 0.0183
mae 0.0209 0.0204 0.0449
mse_delta 0.0038 0.0033 0.0183
mae_delta 0.0209 0.0204 0.0449
de_direction_match 0.7159 0.7126 0.9434
de_sig_genes_recall 0.9076 0.9089 0.9906
de_spearman_sig 0.2633 0.2633 0.2633
de_spearman_lfc_sig 0.8006 0.8217 0.9571
pr_auc 0.0782 0.0768 0.1994
roc_auc 0.3839 0.3802 0.5288
de_nsig_counts_real 487.3535 501.0000 1122.0000
de_nsig_counts_pred 4915.4646 4924.0000 4989.0000
overlap_at_N 0.0242 0.0218 0.0978
overlap_at_50 0.0265 0.0200 0.1400
overlap_at_100 0.0233 0.0200 0.1000
overlap_at_200 0.0244 0.0200 0.1000
overlap_at_500 0.0240 0.0220 0.1040
precision_at_N 0.0899 0.0906 0.2128
precision_at_50 0.0265 0.0200 0.1400
precision_at_100 0.0231 0.0200 0.1000
precision_at_200 0.0248 0.0200 0.1000
precision_at_500 0.0246 0.0220 0.1040
discrimination_score_l1 0.5911 0.5758 1.0000
discrimination_score_l2 0.6160 0.6263 1.0000
discrimination_score_cosine 0.6502 0.7172 1.0000
pearson_edistance 0.6486 0.6486 0.6486
clustering_agreement 0.2460 0.2460 0.2460

For reference, the scGPT paper Table 1 reports pearson_delta 0.459 (ALL) and 0.546 (DE) on Norman. Our all-gene mean (0.5067) sits between the paper's ALL and DE columns. de_nsig_counts_real vs de_nsig_counts_pred (~487 vs ~4915 non-significant genes per perturbation, out of 5045 total) quantifies the scGPT-typical over-prediction of DE: the model flags far fewer genes as non-significant than reality, which is why roc_auc (0.38) and pr_auc (0.08) on DE classification are low while de_sig_genes_recall (0.91) is high.

Known limitations

  • Trained with dropout=0.2 and pert_pad_id=2 inherited from the pretrained args.json. The scGPT tutorial hardcodes dropout=0 and pert_pad_id=0 for fine-tuning; switching to those values is expected to improve metrics.
  • Early stopping used all-gene val pearson, which saturates near 0.99 and never fired; training ran the full 15 epochs. pearson_delta or pearson_de_delta would be a stricter stop criterion.
  • Low overlap_at_50 (0.03) and overlap_at_N (0.024) are consistent with scGPT's known weakness at identifying the specific top-k DE genes driving a perturbation, rather than a training flaw. See the GEARS and CellFlow papers for the same observation.

Files

  • best_model.pt โ€” fine-tuned state dict, loads into TransformerGenerator built with use_fast_transformer=False
  • training_stats.json โ€” unified sc-interp TrainStats schema: top-level keys wall_clock_s, wandb_run_url, reason, details (with model-specific training metadata nested in details)

Usage

from huggingface_hub import hf_hub_download

ckpt = hf_hub_download(
    repo_id="matthewshu/scGPT-norman-ft",
    filename="best_model.pt",
)

# Or reproduce from source (runs in the scgpt venv):
#   python -m scripts.run scgpt --dataset norman --hf-repo matthewshu/scGPT-norman-ft

Citation

Dataset: Norman et al. 2019 (Science). Base foundation model: Cui et al. 2024 (Nat Methods). See the scGPT and GEARS repos for BibTeX.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support