Pylox Neuro — Brain Engagement Encoder (TRIBE v2)

A trained multimodal brain-engagement model that predicts fMRI BOLD signal responses in the visual cortex from naturalistic video + speech stimuli. Built on Meta's TRIBE v2 architecture (Temporally Recurrent Image-text Brain Encoder), fine-tuned across 10 public fMRI datasets covering film, narrative speech, music, and naturalistic video viewing.

This is the Pylox Labs brain-encoding research artifact. For the LLM fine-tune consultancy, see Pylox Forge.

What it does

Given a video (with audio) and a subject identifier, the model predicts per-voxel BOLD-signal activation across the Visual Cortex ROI — the brain regions active when watching the stimulus. Outputs are time-aligned voxel maps at ~1.5s TR resolution.

Use cases:

Ad effectiveness testing — predict audience engagement per-scene without human panels
Film/TV pre-screening — identify high-arousal segments, predict viewer retention
UX research on video content — A/B variants scored by predicted cortex activation
Academic neuroscience — out-of-distribution BOLD prediction baseline

Architecture

Backbone: TRIBE v2 (Meta FAIR, 2025) — transformer over per-frame multimodal embeddings
Vision encoder: facebook/dinov2-giant (frozen, ~1.1B params)
Text encoder: Qwen/Qwen3-Embedding-8B (frozen)
Audio features: Wav2Vec 2.0 transcription + speech embeddings
Output head: Linear regressor over VC-ROI voxels (subject-specific)
Loss: MSE per voxel, 15-20 epochs with early stopping
Framework: PyTorch Lightning

Training data

Trained across 10 public fMRI datasets:

Algonauts 2025 BOLD — standardized fMRI prediction challenge (held-out test)
BOLD5000 — image-viewing fMRI (Chang et al.)
Lahner 2024 BOLD — video-viewing fMRI
FudanVideo CC2017/FCVID/WebVid — naturalistic video viewing
GOD (Generic Object Decoding) — image-viewing fMRI, perceptual experiments
Narratives — story-listening fMRI (Princeton Nastase et al.)
FilmFestival — cinematic fMRI
MusicEmotion — music-listening fMRI
PixarKids — animation-viewing fMRI
PublicSpeaking — speech-listening fMRI

All source datasets are public research releases. No proprietary or identifiable subject data.

Files in this repo

best.ckpt — best validation-loss checkpoint (~24 GB, PyTorch Lightning format)
last.ckpt — final-epoch checkpoint (available on request due to HF storage)

Usage

from tribev2.main import TribeExperiment
import torch

ckpt = torch.load("best.ckpt", map_location="cuda")
model = TribeExperiment.load_from_checkpoint("best.ckpt")
model.eval()

# preds shape: [n_subjects, n_timepoints, n_voxels]
preds = model(video_tensor, audio_tensor, subject_id=42)

Or via the FastAPI backend at pylox-neuro repo — upload an MP4, get per-scene engagement heatmaps back.

Hardware

Trained on an NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB UMA, sm_121). All training performed on Pylox Labs on-prem hardware — zero cloud GPU, zero third-party API exposure to the raw fMRI data.

License

CC-BY-NC-4.0 — non-commercial research use. For commercial licensing (ad-testing platforms, media research firms): contact inquiries@pyloxforge.com.

Citation

If you use this model in research:

@misc{pylox-neuro-tribev2,
  title = {Pylox Neuro: Brain Engagement Encoder on TRIBE v2},
  author = {Pylox Labs},
  year = {2026},
  url = {https://huggingface.co/emiliogirard/pylox-neuro}
}

Underlying TRIBE v2 architecture: Meta FAIR, 2025.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support