EmoNet 2026: Speaker-Aware Transformers For Emotion Recognition — Lessons And Future Improvements

I completed my master’s thesis on Emotion Recognition in Conversation (ERC). My model, EmoNet, achieved a Weighted F1 of 39.18 on EmoryNLP — a competitive result compared to the PapersWithCode leaderboard at the time, placing it between TUCORE-GCN_RoBERTa (39.24) and S+PAGE (39.14). It also outperformed my selected baseline model, CoMPM, by +1.81 F1.

Two years on, I revisited the field to see how things had evolved. The leaderboard looked completely different. The top entries were no longer encoder-only models with intricate attention mechanisms — they had become LLaMA-2–7B-based systems using LoRA fine-tuning and retrieval-augmented prompting: InstructERC, CKERC, BiosERC, LaERC-S. The techniques were different. The computational demands were different. The overall approach was different.

And yet — after closely reading these newer papers, the core ideas I introduced in EmoNet are present within them, just applied at a different level of the architecture. This is the story of what I built, how it measured up, and what I would do differently if I were starting today.

Understanding ERC and the challenge of text-only approaches

Emotion Recognition in Conversation involves assigning an emotion label to each statement in a multi-turn dialogue. It differs from standard sentiment analysis on isolated sentences in a crucial way: how someone feels in a given moment depends on what was said before and on who is speaking.

Take this exchange from the EmoryNLP dataset (drawn from the TV show Friends):

Monica: Wendy, we had a deal! Yeah, you promised! Wendy! Wendy! Wendy! [Mad]

Rachel: Who was that? [Neutral]

Monica: Wendy bailed. I have no waitress. [Mad]

Taken alone, “Who was that?” carries no particular emotion. The Neutral label only makes sense in context — it sits between two frustrated utterances from Monica, and an ERC model needs to understand that conversational dynamic to label it correctly.

There’s another layer of difficulty: cues from voice, facial expressions, and body language are absent. In real conversations, tone and body language convey a huge portion of emotional meaning. Text-only ERC removes all of that. The same words — “Oh, great.” — can be genuine or sarcastic, and written text alone often can’t resolve the ambiguity.

This loss of information is the fundamental obstacle. Models must detect emotion from a signal that is far less expressive than what humans naturally rely on.

The 2024 ERC landscape

When I began my thesis in late 2023, the EmoryNLP leaderboard was filled with transformer-based models, each with its own creative twist. A brief overview:

– KET (Zhong et al., 2019) — a knowledge-augmented transformer using affective graph attention; the first work to apply transformers to ERC.

– DialogueGCN (Ghosal et al., 2019) — a graph convolutional network that reframed dialogues as node-classification tasks.

– RGAT (Ishiwatari et al., 2020) — a relation-aware graph attention model using relational position encoding to capture speaker relationships.

– DialogXL (Shen et al., 2020) — an XLNet adaptation incorporating utterance recurrence and dialogue-level self-attention.

– HiTrans (Li et al., 2020) — a hierarchical transformer that used pairwise speaker verification as a secondary training objective.

– TUCORE-GCN (Lee & Choi, 2021) — a heterogeneous dialogue graph paired with speaker-aware BERT.

– CoMPM (Lee & Lee, 2021) — combined dialogue context with a pre-trained memory module that tracked each speaker’s state.

I selected CoMPM as my starting point for two reasons. First, it explicitly represented each speaker’s pre-trained memory in a dedicated module — which aligned with my belief that who is talking is just as important as what they’re saying. Second, its modular design made it easy to extend without redesigning the whole system. The original CoMPM paper demonstrated that bolting pre-trained speaker memory onto the context model produced a clear improvement — but speaker identity was still confined to a single dialogue. As soon as a new conversation started, everything the model had learned about a speaker was thrown away.

That felt like a gap worth addressing.

Three core contributions, explained intuitively

1. Global Speaker Identity

The issue. In CoMPM and most earlier approaches, speaker IDs existed only within a single dialogue. Speaker A in scene 1 had no connection to Speaker A in scene 14, even when they referred to the same character. Each dialogue started with a blank slate.

The reasoning. People express emotions in consistent, recognizable ways. Monica tends to get upset about particular situations; Phoebe is almost always upbeat; Ross has a predictable streak of self-doubt. If a model could retain knowledge about this particular speaker across different conversations, it could make more informed predictions whenever that speaker showed up again.

The solution. Every unique speaker across the entire dataset receives a consistent, dataset-wide ID. The first time Monica Geller appears, she gets assigned an ID — say, ID 7 — and keeps it. Every later appearance — spanning different episodes, seasons, or scenes — still maps to ID 7. This lets the model learn speaker-specific emotional patterns that carry over.

This may seem obvious in hindsight. In 2024, none of the leading models on the leaderboard worked this way.

2. Speaker Behaviour Module

The issue. A Global Speaker Identity by itself is just an identifier. To make it meaningful, the model must use the speaker’s accumulated experience. How do you give a transformer access to “everything Monica has ever said across this dataset,” without exceeding the context window or making training impossibly slow?

The reasoning. A GRU is well-suited for compressing a speaker’s past utterances into a single fixed-size vector. More recent contributions carry more weight; older ones fade over time. A configurable sliding window limits how many of the most recent utterances by a speaker feed into the GRU — keeping computational costs and memory usage manageable.

The solution. Each utterance is independently encoded by a pre-trained RoBERTa backbone.

The produced embeddings are passed into a unidirectional GRU network. The GRU produces a final hidden state, referred to as `kt`, which serves as a snapshot of the speaker’s behavioral signature at that particular point in the dialogue. This representation is then projected into the same dimensional space as the dialogue context output, where the two are merged together. The resulting combined signal is then passed into the final classification layer.

In terms of structure, this architecture bears a close resemblance to the pre-trained memory module found in CoMPM, with two distinct modifications: the speaker-history pool is global (rather than being restricted to the current conversation), and the GRU is designed to explicitly model temporal decay.

Figure: EmoNet Architecture (Image by author). This model is composed of two primary components: a Dialogue Context Embedding Module and a Speaker Behaviour Module. The diagram illustrates an example of predicting the emotion of utterance u6 within a 6-turn dialogue context. A, D, and Y denote participants in the conversation, where SA = Su1 = Su4 = Su6, SD = Su2, and SY = Su3 = Su5. Wo and Wp represent linear transformation matrices

3. Weighted Cross-Entropy Loss

The problem. The EmoryNLP dataset suffers from significant class imbalance — the Neutral class outnumbers Sad by approximately 4.5 to 1. The standard approach in most papers involves data augmentation or under-sampling techniques. However, conversational data is inherently sequential: removing or repeating utterances disrupts the natural progression of emotional dynamics, which is precisely the signal the model needs to capture.

The intuition. If the data itself cannot be safely altered, the loss function should be adjusted instead. By assigning higher weights to underrepresented classes, misclassifying a Sad instance incurs a proportionally greater penalty than misclassifying a Neutral one.

The implementation. Standard cross-entropy loss with per-class weights computed from inverse class frequencies, followed by normalization. The technique itself is not novel — but grounding it within the conversational-sequence justification transforms it from an arbitrary choice into a well-motivated design decision.

Results: What Worked and What Surprised Me

Below is the ablation study table extracted from the thesis:

The finding that genuinely surprised me — and what I consider the most candid aspect of this entire work — is visible in the second row. Incorporating Global Speaker Identity on its own caused a significant performance degradation (F1 dropped from 37.85 to 29.43). At first glance, this appeared to be a clear failure.

However, it was not. Global Speaker Identity is fundamentally a capability — it equips the model with the capacity to learn long-term speaker behavioral patterns. Without additional structural support, that capability introduced a representational burden that the rest of the model was unable to handle. It was only after the Speaker Behaviour module was integrated — providing the model with a structured mechanism for leveraging those global identities — that the underlying value of the feature emerged. By the time the full configuration was assembled, EmoNet had not only recovered but exceeded the CoMPM baseline by 1.81 points in F1 score.

This is the central takeaway from the ablation study: a feature does not deliver value in isolation; its value is realized only when paired with the mechanisms designed to consume it. Research papers reporting “this component contributed +X%” frequently omit ablation rows where the component, taken alone, actively harmed performance. I deliberately chose to retain that row.

The complete model performed well on Neutral, Joy, and Scared. Powerful proved to be the most challenging class — in part due to its scarcity, and in part because Powerful and Joy are nearly impossible to distinguish through textual cues alone without acoustic information. This is, at its core, a multimodal challenge disguised as a purely textual problem.

Reflection (2026): The Field Moved, and So Should We

Looking back two years later, the EmoryNLP leaderboard has been completely reshaped. The top-performing systems today include:

– InstructERC (Lei et al., 2023) — reframes Emotion Recognition in Conversation as a generative large language model task. It leverages retrieval-augmented instruction templates along with auxiliary objectives such as speaker identification and emotion inference, enabling the model to better capture dialogue roles and emotional dynamics.

– CKERC (Fu, 2024) — introduces commonsense-augmented ERC. For each utterance, an LLM produces commonsense annotations regarding speaker intent and probable listener reactions, supplying implicit social and emotional reasoning that extends beyond the surface-level dialogue context.

– BiosERC (Xue et al., 2024) — integrates LLM-generated biographical speaker profiles into the ERC pipeline, enabling the model to reason not only over conversational context but also over speaker-specific personality traits.

– LaERC-S (Fu et al., 2025) — employs a two-stage instruction-tuning process. Stage 1: instills speaker-specific characteristics into the LLM. Stage 2: leverages those learned characteristics during the core ERC task.

Examine those last two entries carefully.

BiosERC’s speaker biographical profiles are, conceptually, a scaled-up version of Global Speaker Identity — rather than a simple integer ID, they offer a rich textual description that the LLM can directly attend to. LaERC-S’s speaker characteristics are, in essence, an evolution of the Speaker Behaviour module — providing the model with historical speaker behavioral patterns — but implemented through instruction tuning rather than as a dedicated GRU component.

The underlying architectural intuitions from EmoNet have held up well. What has changed is the implementation layer.

This is the aspect I find genuinely compelling. While developing EmoNet in 2024, my thinking was firmly rooted in the encoder-only transformer paradigm: “how do I bolt another module onto this architecture?” The 2024–2025 papers operate within the LLM paradigm: “how do I encode this concept into instruction tuning or retrieval context?” The core ideas are closely related; the leverage points have shifted.

If I were to rebuild EmoNet today, I would not begin with RoBERTa-large. I would start from a compact open-source LLM — such as LLaMA-3.2–3B, Qwen-2.5–3B, or Phi-3.5 — and apply LoRA fine-tuning on EmoryNLP, following methodologies from the InstructERC family. Global Speaker Identity would be realized as a textual speaker biography retrieved from a vector database. The Speaker Behaviour module would take the form of a few-shot prompt incorporating the speaker’s most recent emotional trajectory. The Weighted Loss would remain largely unchanged — class imbalance is agnostic to the choice of model architecture.

The architecture diagram would look entirely different on the surface. Yet the conceptual lineage traceable back to the 2024 thesis would still be visible to anyone who knows where to look.

It taught me that research ideas have a much longer half-life than I had anticipated — concepts endure across paradigm shifts even when their specific implementations do not.

Where Things Stand Now

EmoNet is now publicly archived under DOI 10.5281/zenodo.20048006, with the full thesis, defense slides, and PyTorch implementation available on GitHub. I am currently working on the modernized version — a LoRA-fine-tuned LLM augmented with retrieval-based speaker context — as a follow-up project that I plan to document in an upcoming write-up.

If you are actively working in conversational AI, applied NLP, or LLM fine-tuning, I would love to hear about what you are building.

Top Posts

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

Wireless Logic Bolsters US IoT Reach with Strategic SIMETRY Acquisition

The Blackout Test: Crucial Mistakes I Made With Backup Power (And How You Can Avoid Them)

EmoNet 2026: Speaker-Aware Transformers for Emotion Recognition — Lessons and Future Improvements

Unlock Peak Performance: Your Command Protocol for GPT-5.6 Synergy

Beyond the Main Branch: Streamlining AI Workflows with Git Worktrees

The AI Safety Capital Rising: Beyond Silicon Valley’s Shadow

Beyond Context Engineering: The Loop Experiment Running Blind Without an LLM

The Agent Security Chasm: 54% of Enterprises Battling AI Breaches While Credentials Freely Roam

Unleashing Kimi K3: The 2.8 Trillion-Parameter Open MoE Powerhouse with Delta Attention and 1M Context Horizon

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

Wireless Logic Bolsters US IoT Reach with Strategic SIMETRY Acquisition

The Blackout Test: Crucial Mistakes I Made With Backup Power (And How You Can Avoid Them)

Unlock Peak Performance: Your Command Protocol for GPT-5.6 Synergy

Iran Hunts US Military Phones: CrashStealer macOS Malware & the CVD Blueprint Unmasked

Benjamin Cowen’s Bold Q4 Forecast: Bitcoin’s $44K Bottom is Imminent!

Hidden Fallout: The Lingering Echoes of the State Department RIF

Dell XPS 16: The Sleek Powerhouse Redefining Creativity for Pros

Trending

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

Wireless Logic Bolsters US IoT Reach with Strategic SIMETRY Acquisition

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

EmoNet 2026: Speaker-Aware Transformers for Emotion Recognition — Lessons and Future Improvements

Understanding ERC and the challenge of text-only approaches

The 2024 ERC landscape

Three core contributions, explained intuitively

1. Global Speaker Identity

2. Speaker Behaviour Module

3. Weighted Cross-Entropy Loss

Results: What Worked and What Surprised Me

Reflection (2026): The Field Moved, and So Should We

Where Things Stand Now

Related Posts