Assessing AI models trained on brain signals has historically been a disorganized and inconsistent area of research. Various research teams employ different preprocessing methods, train their models on distinct datasets, and report outcomes on a limited range of tasks — making it extremely difficult to determine which model truly performs best, or in what context. A new framework from the Meta AI team aims to address this issue.
Meta researchers have introduced NeuralBench, a comprehensive, open-source framework for evaluating AI models of brain activity. Its initial release, NeuralBench-EEG v1.0, represents the largest open benchmark of its kind: 36 downstream tasks, 94 datasets, 9,478 subjects, 13,603 hours of electroencephalography (EEG) data, and 14 deep learning architectures assessed through a single standardized interface.

The Problem NeuralBench Solves
The broader field of NeuroAI — where deep learning intersects with neuroscience — has grown rapidly in recent years. Self-supervised learning techniques originally created for language, speech, and images are now being adapted to build brain foundation models: large-scale models pretrained on unlabeled brain recordings and fine-tuned for downstream applications ranging from clinical seizure detection to decoding what a person is seeing or hearing.
However, the evaluation landscape has been severely fragmented. Existing benchmarks like MOABB cover up to 148 brain-computer interface (BCI) datasets but restrict evaluation to only 5 downstream tasks. Other initiatives — EEG-Bench, EEG-FM-Bench, AdaBrain-Bench — each have their own limitations. For modalities such as magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI), no systematic benchmark exists at all.
As a result, claims about foundation models being “generalizable” or “foundational” are often based on selectively chosen tasks with no shared reference point for comparison.
What is NeuralBench?
NeuralBench is built on three core Python packages that together form a modular pipeline.
NeuralFetch manages dataset acquisition, pulling curated data from public repositories including OpenNeuro, DANDI, and NEMAR. NeuralSet organizes data into PyTorch-ready dataloaders, integrating existing neuroscience tools like MNE-Python and nilearn for preprocessing, along with HuggingFace for extracting stimulus embeddings (for tasks involving images, speech, or text). NeuralTrain offers modular training code built on PyTorch-Lightning, Pydantic, and the exca execution and caching library.
Once installed via pip install neuralbench, the framework is operated through a command-line interface (CLI). Running a task takes just three commands: download the data, prepare the cache, and execute. Every task is configured through a lightweight YAML file that defines the data source, train/validation/test splits, preprocessing steps, target processing, training hyperparameters, and evaluation metrics.


What NeuralBench-EEG v1.0 Covers
This initial release centers on EEG data and spans eight distinct task categories: cognitive decoding (covering image, sentence, speech, typing, video, and word decoding), brain-computer interfacing (BCI), evoked responses, clinical tasks, internal state, sleep, phenotyping, and miscellaneous.
Three categories of models are benchmarked:
- Task-specific architectures (roughly 1.5K–4.2M parameters, trained from the ground up): ShallowFBCSPNet, Deep4Net, EEGNet, BDTCN, ATCNet, EEGConformer, SimpleConvTimeAgg, and CTNet.
- EEG foundation models (approximately 3.2M–157.1M parameters, pretrained then fine-tuned): BENDR, LaBraM, BIOT, CBraMod, LUNA, and REVE.
- Handcrafted feature baselines: sklearn-style pipelines that use symmetric positive definite (SPD) matrix representations fed into logistic or Ridge regression classifiers.
Every foundation model is fine-tuned end-to-end following a common training protocol — AdamW optimizer, a learning rate of 10⁻⁴, weight decay set to 0.5, cosine-annealing scheduling with 10% warmup, training for up to 50 epochs with early stopping (patience=10). The one exception is BENDR, where the learning rate is reduced to 10⁻⁵ and gradient clipping at 0.5 is applied to achieve stable training curves. By design, this uniform setup strips away model-specific optimization techniques — such as layer-wise learning rate decay, two-stage probing, or LoRA — ensuring that what’s truly being assessed is the architecture and pretraining approach.
Data splitting strategies vary by task type to mirror real-world generalization challenges: predefined splits are used when provided by the original dataset authors, leave-concept-out splitting is applied for cognitive decoding tasks (all subjects appear in training, but a held-out set of stimuli is reserved for testing), cross-subject splits are used for most clinical and BCI tasks, and within-subject splits are adopted for datasets with very limited participant counts. Each model is trained three times per task using three different random seeds.
Evaluation metrics are standardized according to task type: balanced accuracy for binary and multiclass classification, macro F1-score for multilabel classification, Pearson correlation for regression, and top-5 accuracy for retrieval tasks. All results are also reported as normalized scores (s̃), where 0 represents dummy-level performance and 1 represents perfect performance, allowing fair comparisons across tasks regardless of the original metric scale.
One key methodological consideration: some EEG foundation models were pretrained on datasets that overlap with NeuralBench’s downstream evaluation sets. Rather than removing these findings, the benchmark marks them with hashed bars in result figures so readers can spot potential pretraining data leakage — no strong pattern indicating that leakage inflates performance was detected, but full transparency is maintained.
The benchmark comes in two versions: NeuralBench-EEG-Core v1.0, which uses a single representative dataset per task for broad coverage, and NeuralBench-EEG-Full v1.0, which expands to as many as 24 datasets per task to examine within-task variability across recording hardware, laboratories, and subject populations. A Kendall’s τ of 0.926 (p < 0.001) between Core and Full rankings confirms that the Core version is a dependable proxy — though a few model rankings do shift, including CTNet surpassing LUNA when additional datasets are factored in.


Two Key Findings
Finding 1: Foundation models offer only a slight edge over task-specific models. The overall top performers are REVE (69.2M parameters, mean normalized rank 0.20), LaBraM (5.8M, rank 0.21), and LUNA (40.4M, rank 0.30). Yet, several task-specific models trained from scratch — CTNet (150K parameters, rank 0.32), SimpleConvTimeAgg (4.2M, rank 0.35), and Deep4Net (146K, rank 0.43) — are close behind. Notably, CTNet surpasses the LUNA foundation model to claim third place in the Full variant, despite having roughly 270× fewer parameters. This indicates the performance gap is narrow enough that simply adding more datasets could shift the global rankings.
Finding 2: Many tasks are still genuinely difficult. Cognitive decoding tasks — which involve reconstructing detailed representations of images, speech, sentences, video, or words from brain signals — are especially tough, with even the best models scoring well below the maximum possible. Tasks like mental imagery, sleep arousal, psychopathology decoding, and cross-subject motor imagery and P300 classification often yield results close to random chance. These tasks are ideal for stress-testing the next wave of EEG foundation models.
Tasks nearing peak performance include SSVEP classification, pathology detection, seizure detection, sleep stage classification, and phenotyping tasks like age regression and sex classification.
Beyond EEG: MEG and fMRI
Even in this initial EEG-focused release, NeuralBench already includes MEG and fMRI tasks as a proof of concept. Impressively, the REVE model — pretrained solely on EEG data — achieves the top score among all tested models on the typing decoding task in MEG. This is a promising early sign that EEG-pretrained representations can effectively transfer across different brain recording methods, a hypothesis the framework is built to rigorously test in future updates.
The infrastructure is explicitly designed to expand to intracranial EEG (iEEG), functional near-infrared spectroscopy (fNIRS), and electromyography (EMG).
How to Get Started
Installation is simple: pip install neuralbench. To run the audiovisual stimulus classification task on EEG, use these commands:
neuralbench eeg audiovisual_stimulus --download # Download data
neuralbench eeg audiovisual_stimulus --prepare # Prepare cache
neuralbench eeg audiovisual_stimulus # Run the taskTo run all 36 tasks against all 14 EEG models, use the -m all_classic all_fm flag. Full benchmark storage needs are significant: approximately 11 TB total (~3.2 TB raw data, ~7.8 TB preprocessed cache, ~333 GB logged results), with one GPU of at least 32 GB VRAM per job — though average peak GPU usage across experiments is only ~1.3 GB (maximum ~30.3 GB).
The complete NeuralBench-EEG-Full v1.0 run requires approximately 1,751 GPU-hours across 4,947 experiments.
Key Takeaways
- Meta AI’s NeuralBench-EEG v1.0 is an open EEG benchmark — 36 tasks, 94 datasets, 9,478 subjects, and 14 deep learning architectures under one standardized interface.
- Despite having up to 270× more parameters, EEG foundation models like REVE only slightly outperform lightweight task-specific models like CTNet (150K params) across the benchmark.
- Cognitive decoding tasks (speech, video, sentence, word decoding from brain activity) and clinical predictions remain highly challenging, with most models scoring near random chance.
- REVE, pretrained only on EEG data, outperformed all models on MEG typing decoding — an early sign of meaningful cross-modality transfer.
- NeuralBench is MIT-licensed.
Check out the Paper and GitHub Repo. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us



