On 22 June 2023, Earkick proudly published its work at the prestigious Conference on Computer Vision and Pattern Recognition, CVPR’23, in Vancouver, Canada.
Together with us, the conference attracted many of the most influential researchers in AI sharing pioneering ideas on bringing machine learning (ML) systems to the next level and paving the future of human-machine interaction.
The work we presented is a data-centric AI approach to improve the accuracy and reliability of speech-based emotion recognition systems. The term was coined by one of the most prestigious names in AI, Andrew Ng, a professor at Stanford and co-founder of Coursera. In a data-centric AI approach, models are constantly upgraded, and the data used to create these models is continuously scrutinized.
Affective Computing and Unveiling Dataset Insights
At Earkick we are building state-of-the-art systems for mental health estimation, in real-time, and “in-the-wild”. For that, we measure mental health from different modalities, such as, video, speech and text (our CVPR paper focuses on the speech recognition part of our system as depicted in below illustration).

To do so, we began our research by looking into affective computing, a field of machine learning where algorithms try to infer the behavioral state of a person, for e.g. their emotional state, as observed in their speech (generally, audio) signals.
By examining four prominent datasets in speech emotion recognition, including CMU-MOSEI, CREMA-D, RAVDESS, and Aff-Wild2, we identified intriguing features of emotion labels that demanded further investigation.
Challenging the Norm: Critical Questions Answered
Driven by our commitment to advancing the field, we meticulously re-labeled 1% of the aforementioned datasets. This endeavor aimed to shed light on crucial questions, such as
- Are there emotions that are easier to recognize than others?
- Is it easier to recognize acted emotions or real emotions?
- How important is emotion strength in our ability to recognize it? Is there a “threshold” for detecting emotion reliably?
- Would an ML model created with acted data be generalizable to detect emotions in-the-wild and vice-versa?
Data-Centric AI in Action: Unlocking Fascinating Insights
Adhering to Andrew Ng’s data-centric approach, we not only addressed these pivotal questions but also assessed the consistency of our datasets, identified potential issues with input and output definitions, and uncovered limitations of public datasets.
Surprisingly, our findings revealed no substantial difference in our ability to judge emotion in acted speech versus naturally evoked, in-the-wild emotion. This was further reflected in the similar degree of knowledge transfer from acted to in-the-wild datasets and vice versa.
Discovering the Sweet Spot
The most exciting revelation of our work, however, was pinpointing the threshold for the number of samples used to build our models and the noise in these samples. This addresses a critical aspect of data-centric AI approaches, where identifying and removing these noisy examples led to a remarkable 10% improvement over the baseline model that utilizes all examples.
Access the Full Report
Curious to learn how we did this and how you too can take a data-centric approach to improve your datasets and models? Then read our full report “Analysis of Emotion Annotation Strength Improves Generalization in Speech Emotion Recognition Models“, freely available online.
One thought on “Earkick at CVPR 2023: Improving Speech Emotion Recognition”
Comments are closed.