Automatic detection of sleep stages
Datasets: The IS-RC dataset includes one polysomnography (PSG) recording from each of 70 women, collected as part of a research study focused on sleep-disordered breathing in women aged 40 to 57 years18. All 70 recordings were scored by ten trained experts using the American Academy of Sleep Medicine (AASM) guidelines8, with annotations from six of these experts made publicly available. We combined these annotations into a single consensus using the method described by Stephansen et al.19. One recording was excluded because the filenames of its annotations did not match the corresponding EEG data.
The BD (Bipolar EEG Dataset) consists of one PSG recording each from 25 healthy individuals and 23 patients diagnosed with bipolar disorder17. These 48 recordings were each scored by a single expert in accordance with AASM standards. Because the EEG channel configurations varied across recordings, we limited our analysis to the channels present in all recordings: F3-A2, F4-A1, C3-A2, C4-A1, O1-A2, O2-A1, and A1-A2. Sampling rates differed across recordings and channels, ranging from 100 Hz to 500 Hz.
All recordings were segmented into 30-second intervals (epochs) and classified into one of five sleep stages: Wake, N1, N2, N3, or REM, based on expert annotations.
Model: For automated sleep staging, we employed RobustSleepNet (RSN), a deep learning model designed to work regardless of the number, type, or arrangement of PSG channels11. Guillot et al. released multiple pre-trained versions of RSN, each trained on EEG, EOG, and EMG data from various datasets35. We selected the version trained on 659 recordings from the MESA36, MrOS37, SHHS38, DODO39, DODH39, and MASS40 datasets. The model processes sequences of 21 consecutive sleep epochs (equivalent to 10.5 minutes) to provide enough context for accurate classification. For each input sequence, it generates probability scores for all five sleep stages for every epoch. Thanks to its flexible architecture, the model can handle input sequences with any number of EEG channels.
In line with standard RSN usage protocols11, each recording was preprocessed before being analyzed by the model. This involved applying a 4th-order Butterworth bandpass filter (0.3–30 Hz), downsampling to 60 Hz, and normalizing the signal by subtracting the median and dividing by the interquartile range. Any signal values outside the range of –20 to 20 were clipped to these limits. To avoid edge effects, we padded both the start and end of each recording with 20 epochs of zeros. Input sequences were then generated by sliding a 21-epoch window across the padded recording in steps of one epoch, meaning each epoch appeared in 21 different input sequences. The 21 resulting probability predictions for each epoch were combined using the geometric mean to produce the final stage prediction11.
Evaluation: To assess agreement between two sets of annotations for a given recording, we computed the Macro F1 score as follows. For each sleep stage, we first identified true positives (TP)—epochs where both annotations agreed on the stage. False positives (FP) were epochs labeled as that stage in one annotation but as a different stage in the other. False negatives (FN) were epochs labeled as that stage in one annotation but as a different stage in the other. Precision was calculated as TP / (TP + FP), and recall as TP / (TP + FN). The F1 score for each stage was then computed as 2 × (precision × recall) / (precision + recall). The overall Macro F1 score was obtained by averaging the F1 scores across all five sleep stages.
Automatic detection of sleep spindles
We identified sleep spindles using SUMOv2, an improved version of the publicly available SUMO model16. Our enhancements specifically targeted greater robustness to differences in signal amplitude, enabling more consistent spindle detection across varied datasets.
Datasets: The MASS dataset includes 200 PSG recordings from 200 mostly healthy participants (15 of whom had mild cognitive impairment), all sampled at 256 Hz40. The MODA dataset provides spindle annotations for selected segments from 180 of these MASS recordings21,41. Each recording was divided into either 10 (for 30 recordings) or 3 (for 150 recordings) non-overlapping 115-second blocks of artifact-free N2 sleep. These blocks were annotated for spindles by up to seven human experts, who used either the C3-A2 or C3-LE EEG channel depending on the recording. In total, the MODA dataset contains 749 blocks scored by 47 experts (one block was not reviewed by any expert). Individual annotations were merged into a consensus based on each expert’s reported confidence level21. For convenience, we refer to the combined MASS and MODA data simply as the MODA dataset.
The DREAMS dataset consists of eight 30-minute EEG segments from eight individuals with various sleep disorders (including dysomnia, restless legs syndrome, insomnia, and apnoea/hypopnoea syndrome), sampled at 50, 100, or 200 Hz20. These segments were extracted from full-night recordings without considering sleep stages or artifact presence. Each segment was annotated for spindles by two human experts, except for the last two segments, which were scored only by the first expert. Depending on the segment, experts used either the C3-A1 or CZ-A1 EEG channel and were unaware of sleep stage labels assigned by the other expert. Using the sleep stage annotations provided by the independent expert, we excluded any spindle markings that occurred outside of N2 sleep.
The BD dataset (also described in the “Automatic detection of sleep stages” section) includes spindle annotations for artifact-free periods of N1, N2, and N3 sleep17. Consistent with the original study17, we focused our analysis on spindles occurring during N2 sleep. Spindle annotations were created by one expert and verified by a second expert when uncertainty arose. Although the BD dataset does not explicitly state which EEG channels were used for annotation, it is reasonable to assume they matched those analyzed in the original study: F3-A2, F4-A1, C3-A2, and C4-A117.
Model: Following the approach used in the original SUMO study, we treated spindle detection as a segmentation task applied to individual EEG channels16. We adopted the same U-Net architecture for SUMOv2, featuring two encoder and two decoder blocks. SUMOv2 can process input sequences of any length and produces two output segmentation masks: one indicating the presence of a spindle (with the maximum value in the first mask) and the other indicating its absence (maximum in the second mask). Consecutive data points flagged as spindle-positive were merged into final spindle events, each defined by a start time and duration.
To detect spindles or train SUMOv2, we first preprocessed the data by isolating continuous segments of N2 sleep based on sleep stage labels. Each N2 segment was then filtered using a 20th-order Butterworth high-pass filter at 0.3 Hz and a 20th-order Butterworth low-pass filter at 30 Hz, downsampled to 100 Hz, and normalized by subtracting the median amplitude and dividing by the interquartile range. Signal amplitudes outside the –20 to 20 range were clipped to
Here is the paraphrased version of the article, with the HTML structure preserved and the text rewritten for clarity and readability:
Training: To train SUMOv2, we divided the MODA dataset into training and test sets. The test set included data from 36 subjects, each contributing 3 blocks of 115 seconds (refer to the SUMO study for details on how these subjects were chosen16). The training set was further divided into six cross-validation folds. Each fold contained data from five subjects with 10 blocks of 115 seconds each and 19 subjects with 3 blocks of 115 seconds each. After optimizing hyperparameters using this cross-validation setup, we retrained the final model on the full training set, setting aside 10% of the data for early stopping. The DREAMS and BD datasets were excluded from training and validation and were used solely for testing.
We trained the model using the Adam optimizer with a batch size of 12, a learning rate of 0.005, and a generalized Dice loss—a variant of the Dice loss that handles class imbalance more effectively42. To avoid overfitting, we stopped training when the IoU-F1 score (described in the next section) on the validation set did not improve for 300 consecutive epochs or when 800 epochs were completed.
We observed that the original SUMO model was sensitive to changes in EEG amplitude, which made it difficult to apply to datasets with different amplitude patterns, such as those from patients with weaker spindle activity or different recording setups. To improve robustness in SUMOv2, we introduced data augmentation by randomly rescaling EEG amplitudes during training. Each sample was randomly multiplied by a factor between 1 and 2 (upscaled) or between 0.5 and 1 (downscaled), helping the model adapt to varying amplitude distributions.
Evaluation: When testing SUMOv2 on the BD dataset, we applied the model independently to each of the four EEG channels and combined the detected spindles across channels by merging overlapping annotations and keeping non-overlapping ones separate.
Following standard methods from the MODA study21, we postprocessed all detected spindles by merging those shorter than 0.3 seconds if they were less than 0.1 seconds apart, and then removed any spindles shorter than 0.3 seconds or longer than 2.5 seconds.
To assess model performance, we used the Intersection-over-Union (IoU) F1 score. For two sets of spindle annotations, the IoU-F1 score was calculated per spindle. Each spindle in the first set was paired with the closest spindle in the second set. If the overlap between two paired spindles, divided by the total duration of both spindles, exceeded 20%, they were counted as a true positive (TP). Unmatched spindles in either set were labeled as false positives (FP) or false negatives (FN). TPs, FPs, and FNs were summed across all jointly annotated recordings or segments (e.g., when comparing two experts, we summed these values across all recordings both annotated). The IoU-F1 score was then computed as (2 cdot text{TP} / (2 cdot text{TP} + text{FP} + text{FN})).
For pairwise inter-rater agreement on the MODA dataset, we only included expert pairs who had annotated at least five blocks together, totaling about 9.5 minutes of EEG data (280 expert pairs met this requirement).
Spindle characteristics: For each detected spindle, we calculated density, duration, frequency, and amplitude. Spindle density (spindles per minute, SPM) was the number of spindles in N2 sleep divided by the total N2 sleep duration. Spindle duration was the time from the first to the last sample of each spindle. To determine frequency and amplitude, we first applied a 4th-order Butterworth band-pass filter (10–16 Hz) to the raw EEG signal. Spindle frequency was the average of instantaneous frequencies, calculated as half the reciprocal of zero-crossing intervals in the filtered signal. Spindle amplitude was the mean absolute value of the Hilbert-transformed filtered signal.



