Pylox Neuro β Brain Engagement Encoder (TRIBE v2)
A trained multimodal brain-engagement model that predicts fMRI BOLD signal responses in the visual cortex from naturalistic video + speech stimuli. Built on Meta's TRIBE v2 architecture (Temporally Recurrent Image-text Brain Encoder), fine-tuned across 10 public fMRI datasets covering film, narrative speech, music, and naturalistic video viewing.
This is the Pylox Labs brain-encoding research artifact. For the LLM fine-tune consultancy, see Pylox Forge.
What it does
Given a video (with audio) and a subject identifier, the model predicts per-voxel BOLD-signal activation across the Visual Cortex ROI β the brain regions active when watching the stimulus. Outputs are time-aligned voxel maps at ~1.5s TR resolution.
Use cases:
- Ad effectiveness testing β predict audience engagement per-scene without human panels
- Film/TV pre-screening β identify high-arousal segments, predict viewer retention
- UX research on video content β A/B variants scored by predicted cortex activation
- Academic neuroscience β out-of-distribution BOLD prediction baseline
Architecture
- Backbone: TRIBE v2 (Meta FAIR, 2025) β transformer over per-frame multimodal embeddings
- Vision encoder:
facebook/dinov2-giant(frozen, ~1.1B params) - Text encoder:
Qwen/Qwen3-Embedding-8B(frozen) - Audio features: Wav2Vec 2.0 transcription + speech embeddings
- Output head: Linear regressor over VC-ROI voxels (subject-specific)
- Loss: MSE per voxel, 15-20 epochs with early stopping
- Framework: PyTorch Lightning
Training data
Trained across 10 public fMRI datasets:
- Algonauts 2025 BOLD β standardized fMRI prediction challenge (held-out test)
- BOLD5000 β image-viewing fMRI (Chang et al.)
- Lahner 2024 BOLD β video-viewing fMRI
- FudanVideo CC2017/FCVID/WebVid β naturalistic video viewing
- GOD (Generic Object Decoding) β image-viewing fMRI, perceptual experiments
- Narratives β story-listening fMRI (Princeton Nastase et al.)
- FilmFestival β cinematic fMRI
- MusicEmotion β music-listening fMRI
- PixarKids β animation-viewing fMRI
- PublicSpeaking β speech-listening fMRI
All source datasets are public research releases. No proprietary or identifiable subject data.
Files in this repo
best.ckptβ best validation-loss checkpoint (~24 GB, PyTorch Lightning format)last.ckptβ final-epoch checkpoint (available on request due to HF storage)
Usage
from tribev2.main import TribeExperiment
import torch
ckpt = torch.load("best.ckpt", map_location="cuda")
model = TribeExperiment.load_from_checkpoint("best.ckpt")
model.eval()
# preds shape: [n_subjects, n_timepoints, n_voxels]
preds = model(video_tensor, audio_tensor, subject_id=42)
Or via the FastAPI backend at pylox-neuro repo β upload an MP4, get per-scene engagement heatmaps back.
Hardware
Trained on an NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB UMA, sm_121). All training performed on Pylox Labs on-prem hardware β zero cloud GPU, zero third-party API exposure to the raw fMRI data.
License
CC-BY-NC-4.0 β non-commercial research use. For commercial licensing (ad-testing platforms, media research firms): contact inquiries@pyloxforge.com.
Citation
If you use this model in research:
@misc{pylox-neuro-tribev2,
title = {Pylox Neuro: Brain Engagement Encoder on TRIBE v2},
author = {Pylox Labs},
year = {2026},
url = {https://huggingface.co/emiliogirard/pylox-neuro}
}
Underlying TRIBE v2 architecture: Meta FAIR, 2025.