Pylox Neuro β€” Brain Engagement Encoder (TRIBE v2)

A trained multimodal brain-engagement model that predicts fMRI BOLD signal responses in the visual cortex from naturalistic video + speech stimuli. Built on Meta's TRIBE v2 architecture (Temporally Recurrent Image-text Brain Encoder), fine-tuned across 10 public fMRI datasets covering film, narrative speech, music, and naturalistic video viewing.

This is the Pylox Labs brain-encoding research artifact. For the LLM fine-tune consultancy, see Pylox Forge.


What it does

Given a video (with audio) and a subject identifier, the model predicts per-voxel BOLD-signal activation across the Visual Cortex ROI β€” the brain regions active when watching the stimulus. Outputs are time-aligned voxel maps at ~1.5s TR resolution.

Use cases:

  • Ad effectiveness testing β€” predict audience engagement per-scene without human panels
  • Film/TV pre-screening β€” identify high-arousal segments, predict viewer retention
  • UX research on video content β€” A/B variants scored by predicted cortex activation
  • Academic neuroscience β€” out-of-distribution BOLD prediction baseline

Architecture

  • Backbone: TRIBE v2 (Meta FAIR, 2025) β€” transformer over per-frame multimodal embeddings
  • Vision encoder: facebook/dinov2-giant (frozen, ~1.1B params)
  • Text encoder: Qwen/Qwen3-Embedding-8B (frozen)
  • Audio features: Wav2Vec 2.0 transcription + speech embeddings
  • Output head: Linear regressor over VC-ROI voxels (subject-specific)
  • Loss: MSE per voxel, 15-20 epochs with early stopping
  • Framework: PyTorch Lightning

Training data

Trained across 10 public fMRI datasets:

  • Algonauts 2025 BOLD β€” standardized fMRI prediction challenge (held-out test)
  • BOLD5000 β€” image-viewing fMRI (Chang et al.)
  • Lahner 2024 BOLD β€” video-viewing fMRI
  • FudanVideo CC2017/FCVID/WebVid β€” naturalistic video viewing
  • GOD (Generic Object Decoding) β€” image-viewing fMRI, perceptual experiments
  • Narratives β€” story-listening fMRI (Princeton Nastase et al.)
  • FilmFestival β€” cinematic fMRI
  • MusicEmotion β€” music-listening fMRI
  • PixarKids β€” animation-viewing fMRI
  • PublicSpeaking β€” speech-listening fMRI

All source datasets are public research releases. No proprietary or identifiable subject data.

Files in this repo

  • best.ckpt β€” best validation-loss checkpoint (~24 GB, PyTorch Lightning format)
  • last.ckpt β€” final-epoch checkpoint (available on request due to HF storage)

Usage

from tribev2.main import TribeExperiment
import torch

ckpt = torch.load("best.ckpt", map_location="cuda")
model = TribeExperiment.load_from_checkpoint("best.ckpt")
model.eval()

# preds shape: [n_subjects, n_timepoints, n_voxels]
preds = model(video_tensor, audio_tensor, subject_id=42)

Or via the FastAPI backend at pylox-neuro repo β€” upload an MP4, get per-scene engagement heatmaps back.

Hardware

Trained on an NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB UMA, sm_121). All training performed on Pylox Labs on-prem hardware β€” zero cloud GPU, zero third-party API exposure to the raw fMRI data.

License

CC-BY-NC-4.0 β€” non-commercial research use. For commercial licensing (ad-testing platforms, media research firms): contact inquiries@pyloxforge.com.

Citation

If you use this model in research:

@misc{pylox-neuro-tribev2,
  title = {Pylox Neuro: Brain Engagement Encoder on TRIBE v2},
  author = {Pylox Labs},
  year = {2026},
  url = {https://huggingface.co/emiliogirard/pylox-neuro}
}

Underlying TRIBE v2 architecture: Meta FAIR, 2025.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support