Introduction

Decoding brain signals from non‑invasive or invasive recordings is one of the most ambitious frontiers in modern computational neuroscience. Electroencephalography (EEG), magnetoencephalography (MEG), functional magnetic resonance imaging (fMRI), and electrocorticography (ECoG) produce high‑dimensional, noisy, and non‑stationary streams of data. Deciphering the neural code from such signals has the potential to revolutionise clinical diagnostics, brain‑computer interfaces (BCIs), and neurorehabilitation. Over the past decade, deep learning – and in particular novel neural network architectures – has dramatically improved the accuracy, robustness, and generalisation of brain signal decoding systems. This article surveys the key architectural advances, from early multilayer perceptrons to modern transformers, and examines how each paradigm addresses the unique challenges of neural data.

Brain signals are characterised by low signal‑to‑noise ratios, high inter‑subject variability, and complex spatio‑temporal dynamics. Traditional machine learning pipelines required extensive hand‑crafted feature extraction, such as power spectral densities or common spatial patterns. Deep learning models, by contrast, can learn hierarchies of relevant features directly from raw or minimally pre‑processed recordings. This shift has enabled state‑of‑the‑art results in seizure detection, motor imagery classification, sleep staging, and even the decoding of imagined speech. The following sections detail the evolution of network architectures that have made these breakthroughs possible.

Traditional Neural Network Approaches

Early attempts to apply neural networks to brain signals relied on multilayer perceptrons (MLPs). These fully connected feedforward networks take a vector of features – typically engineered from the raw signal – and learn a non‑linear mapping to a desired output. For EEG‑based BCIs, common features included band‑power estimates from multiple frequency bands (delta, theta, alpha, beta, gamma) and spatial filters derived from common spatial pattern (CSP) algorithms. While MLPs could outperform linear classifiers such as LDA or SVM in certain tasks, they suffered from several fundamental limitations when applied to neural data.

First, MLPs treat input features as independent, ignoring the spatial topology of electrodes and the temporal ordering of samples. Second, the number of parameters grows quickly with input dimensionality, leading to a high risk of overfitting on the relatively small datasets typical of BCI experiments (often fewer than a thousand trials per subject). Third, MLPs are sensitive to noise and exhibit limited capacity to model the high‑variance, non‑stationary dynamics of brain signals. As a consequence, although MLPs provided a proof‑of‑concept for end‑to‑end learned decoding, their practical utility was circumscribed. Researchers soon turned to architectures that could exploit the inherent structure of neural recordings.

Limitations of Fully Connected Networks

Beyond the issues above, fully connected layers ignore the locality of information. In an EEG montage, adjacent electrodes often record correlated activity from nearby cortical sources. A convolutional layer that respects this local connectivity can leverage such spatial structure much more efficiently. Likewise, the temporal succession of samples contains critical information about event‑related potentials (ERPs) and oscillatory phase dynamics, which a standard MLP cannot capture without explicit time‑series feature engineering. These insights motivated the adoption of convolutional and recurrent designs.

Convolutional Neural Networks (CNNs) for Spatial Feature Extraction

Convolutional neural networks have become a cornerstone of brain signal decoding, especially for EEG and ECoG data. CNNs apply learnable filters across the spatial or spatio‑temporal dimensions of the input, preserving local relationships while drastically reducing the number of parameters compared to a fully connected network. Two main families of CNN architectures have emerged: those using 1D convolutions over time (temporal convolutions) and those using 2D convolutions over electrode channels arranged in a 2D grid (spatial convolutions). Many successful models combine both.

EEGNet and Shallow/Deep ConvNets

One of the most influential architectures is EEGNet, a compact CNN designed specifically for EEG classification. EEGNet uses depthwise and separable convolutions to learn subject‑specific spatial filters while keeping the model light enough to train on small datasets. The architecture consists of a temporal convolution (to learn frequency filters), followed by a depthwise convolution across channels (to learn spatial patterns), and then a separable convolution to combine feature maps. EEGNet has been shown to achieve competitive performance across multiple BCI paradigms (motor imagery, P300, error‑related negativity) with only a few thousand parameters. The original EEGNet paper provides detailed benchmarks.

Other widely used CNN variants include Shallow ConvNet and Deep ConvNet, proposed by Schirrmeister et al. (2017). The shallow variant processes raw EEG with a single temporal convolution, a spatial pooling across channels, and a final dense layer. It is effective for motor imagery tasks where frequency‑band power changes are prominent. The deep variant stacks multiple convolutional blocks with batch normalisation, achieving higher accuracy on more complex discrimination tasks but requiring more data to train. These models have become baselines in many BCI studies.

CNNs for Seizure Detection and Beyond

In clinical applications, CNNs have demonstrated remarkable performance in automated seizure detection from continuous EEG recordings. By treating the multichannel signal as a 2D image (channels × time), a CNN can learn to recognise the spatio‑temporal signatures of ictal and interictal activity. Large‑scale studies, such as those using the Temple University Hospital EEG Dataset, have shown that CNN‑based detectors can achieve sensitivity and specificity exceeding 90% while operating in real‑time. Similarly, CNNs have been applied to detect slow‑wave sleep, predict medication response, and identify biomarkers for disorders like schizophrenia and Alzheimer’s disease.

Recurrent Neural Networks (RNNs) and LSTM for Temporal Dynamics

Brain signals are intrinsically sequential: the state of the brain at time t is heavily influenced by its recent history. Recurrent neural networks (RNNs) are designed to process sequences by maintaining a hidden state that is updated at each time step. Early RNNs struggled with the vanishing gradient problem, which prevented them from learning long‑range dependencies – exactly the type of structure present in event‑related potentials, slow oscillations, and motor planning activity. The introduction of Long Short‑Term Memory (LSTM) and later Gated Recurrent Units (GRUs) addressed this issue by incorporating gating mechanisms that control the flow of information.

LSTMs for Continuous EEG Classification

LSTMs have been employed for tasks where temporal context is paramount, such as classifying cognitive load, sleep stages, or imagined speech from EEG micro‑states. For instance, in sleep staging, an 8‑second segment of polysomnography data may contain characteristic patterns that evolve over minutes; an LSTM can encode these longer‑range dependencies. Bidirectional LSTMs (Bi‑LSTMs) further improve performance by processing the signal both forward and backward in time, effectively capturing features that are symmetric around an event.

One of the challenges of using LSTMs for brain signals is the high sampling rate (often 250 Hz or higher), which results in very long input sequences. To mitigate this, many architectures first downsample the raw signal or apply a convolutional frontend to reduce the temporal resolution before feeding the features to the recurrent layers. The combination of CNN and LSTM – a convLSTM – has been especially effective for tasks like motor imagery, where the CNN extracts spatial patterns from each short time window and the LSTM tracks their evolution over several seconds. A comprehensive review of LSTM‑based EEG decoding can be found in this 2021 survey in Pattern Recognition.

GRU and Attention-Augmented RNNs

GRUs offer a simpler gating mechanism than LSTMs, with fewer parameters, and have shown comparable performance on many EEG datasets. Researchers have also augmented RNNs with attention, allowing the network to focus on the most informative time steps. For example, in a go‑/no‑go task, the model can learn to attend to the moment of stimulus presentation rather than the baseline preceding it. Attention mechanisms within RNN frameworks often serve as a stepping stone toward the fully attention‑based transformer architectures described next.

Transformers and Attention Mechanisms

The transformer architecture, originally developed for natural language processing, has been adapted for brain signal decoding with impressive results. At its core, the transformer replaces recurrence with a self‑attention mechanism that computes weighted sums of all input positions, allowing the model to directly capture long‑range dependencies without the sequential bottleneck of RNNs. For EEG and MEG signals, transformers can attend to both spatial (across channels) and temporal (across time points) relationships, making them highly expressive.

EEG‑Transformer and Variants

One of the earliest adaptations is the EEG‑Transformer, which tokenises the multichannel time series into patches (e.g., 1‑second windows) and then applies a standard encoder‑only transformer with positional embeddings. This architecture achieved state‑of‑the‑art results on motor imagery and emotion recognition benchmarks. To reduce computational complexity, many works apply a convolutional frontend to reduce the sequence length before the transformer layers. The EEG‑Transformer paper provides a full description of the design.

Another notable variant is the Vision Transformer (ViT) adapted for EEG by treating the 2D EEG image (channels × time) as a patch grid. The ViT approach has been used for cross‑subject seizure detection and sleep stage classification. More recent models, such as SPEECH‑TCNet and EEG‑Conformer, combine convolutional modules with transformer heads to balance local and global feature extraction.

Multi‑Modal Transformers and Cross‑Attention

Brain signal decoding often benefits from fusing multiple modalities, such as EEG and fMRI, or EEG and eye‑tracking. Transformers naturally support multi‑modal inputs by using separate encoders for each modality and then cross‑attending between them. For instance, a recent study on emotion recognition used a transformer to fuse EEG signals and peripheral physiological signals (ECG, GSR) via cross‑modal attention, achieving significantly higher accuracy than single‑modality baselines. This paradigm is promising for building robust BCIs that work in real‑world, noisy environments.

Hybrid Architectures: Combining CNNs, RNNs, and Transformers

No single architecture dominates all brain signal decoding tasks. The most effective contemporary models are often hybrids that exploit the strengths of each building block. A typical hybrid pipeline first applies a set of convolutional layers to extract local spatial features (e.g., from electrode montages) and short‑term temporal patterns (e.g., frequency bands). The output of the CNN is then fed into a recurrent or attention‑based module that captures longer‑range dependencies. Finally, a global pooling or dense layer produces the classification or regression output.

CNN‑LSTM and CNN‑Transformer

One of the most successful hybrid frameworks is the CNN‑LSTM. For example, in the public BCI Competition IV dataset (2a), a CNN with depthwise separable convolutions followed by a Bi‑LSTM achieved a kappa value above 0.70, significantly outperforming a CNN‑only baseline. Similarly, the CNN‑Transformer (sometimes called the Conformer) has been applied to raw EEG for speech decoding. The convolutional frontend reduces the sequence length and enriches local features, while the transformer captures global context. This design is particularly effective for high‑density EEG arrays (128 channels or more).

Ensemble and Multi‑Scale Approaches

Hybrid models can also incorporate multi‑scale processing. For instance, a model might process the signal at three temporal resolutions (e.g., using downsampling by factors of 2 and 4) and combine the features via attention. This mimics the way the brain itself processes information at multiple timescales, from fast spiking to slow cortical rhythms. Ensembles of diverse architectures have also been used in BCI competitions, though they come at the cost of increased inference time – a factor often critical for real‑time systems.

Future Directions and Open Challenges

Despite the impressive progress, several challenges remain before neural network decoding of brain signals becomes clinically and commercially viable.

Transfer Learning and Generalisation

A major bottleneck is the poor generalisation across subjects, sessions, and devices. Most deep learning models are trained and evaluated on data from a single subject (or small cohort) and fail when applied to a new user. Transfer learning methods aim to adapt a pre‑trained model to a new subject with minimal fine‑tuning. Approaches include domain adaptation (e.g., aligning feature distributions via adversarial training or maximum mean discrepancy) and parameter‑efficient fine‑tuning from large foundation models. Large public datasets such as TUH EEG and MNE‑Scan are enabling the pre‑training of general‑purpose EEG encoders, analogous to BERT or GPT for text. The "EEG‑BERT" proposal is one early example of this trend.

Unsupervised and Self‑Supervised Learning

Labelled brain signal data is expensive and time‑consuming to acquire. Unsupervised and self‑supervised methods learn meaningful representations from unlabelled signals, which can then be transferred to downstream tasks with fewer labels. Contrastive learning, predictive coding, and masked autoencoding have all been applied to EEG and MEG. For example, the SimCLR‑EEG framework learns invariances to noise and electrode shifts, enabling robust features for sleep stage classification with less than 10% of the original labelled data. This direction is crucial for widening the applicability of deep neural networks in real‑world clinical settings.

Real‑Time and Low‑Latency Decoding

For many BCI applications, such as cursor control or neuroprosthetic limbs, the decoding must happen in real time with minimal latency. Large transformer models can be computationally demanding. Researchers are exploring efficient architectures, including spiking neural networks that mimic biological neurons and operate on event‑based data, and knowledge distillation to compress large models into lightweight versions suitable for mobile or embedded hardware. Additionally, neuromorphic chips (e.g., Intel Loihi, Brainchip Akida) promise to run brain signal decoding at extremely low power, making wearable BCIs feasible.

Explainability and Trust

Clinicians and end‑users need to trust a decoding system, especially when it influences medical decisions. Deep learning models are often called black boxes, but recent work in explainable AI (XAI) for brain signals has produced saliency maps, layer‑wise relevance propagation (LRP), and attention visualisations that highlight which time points or electrodes the model is focusing on. Such explanations not only build trust but can also reveal physiologically relevant features – for example, that a detector is focusing on a specific frequency band or a particular brain region. Standardised evaluation of explainability methods for neural decoding is an active research area.

Multi‑Task and Multi‑Subject Modelling

Current models are usually trained for a single task (e.g., motor imagery or seizure detection). Multi‑task learning, where a shared representation is trained on several related tasks simultaneously, has shown promise for improving individual task performance. Similarly, multi‑subject models that learn a common anatomical and functional representation across subjects could enable zero‑shot transfer: decoding a novel subject without any calibration data. This would be a paradigm shift for BCIs, eliminating the tedious calibration sessions currently required.

Conclusion

Neural network architectures for brain signal decoding have evolved from simple fully connected networks to sophisticated hybrids of convolutional, recurrent, and attention‑based designs. CNNs excel at capturing spatial patterns; RNNs and LSTMs model temporal dependencies; transformers provide powerful global context and support multi‑modal fusion. The most effective modern models combine these elements, often with transfer and self‑supervised learning to overcome data scarcity. Challenges such as generalisation, real‑time performance, and interpretability remain, but the pace of innovation is accelerating. As larger labelled datasets and computational resources become available, we can expect neural network decoders to become a standard tool in neuroscience research and a key component of next‑generation brain‑computer interfaces. The ultimate goal – a robust, real‑time, and subject‑independent decoder – is within reach.