
Our mission at Earkick is to support everyone’s mental health by assessing it and providing guidance for improvement whenever needed.
Since our app serves as a tool for positive change—in other words, an intervention that influences members’ thought patterns or behavior—we aim to measure these changes, track progress, and share our findings publicly to ensure full transparency.
The most reliable scientific method to validate a mental health app is through a “randomized control trial” (RCT). In an RCT, study participants are randomly assigned to one of two groups: one group receives the full app experience (the treatment), while the other group serves as a control, receiving either no app – or a much more simplified version lacking the main features of interest. For the random assignment of the participants, measures are taken to help balance attributes of the participants like age, gender, and socio-economic status, such that both groups are comparable.
Over a set period (e.g. two weeks), participants use either the app or the placebo/control app. Behavioral outcomes are measured before and after the intervention period and compared to obtain an effect size. RCTs are considered the gold standard way to determine the causal effect of an intervention on the population level.
We’re eager to conduct an RCT as soon as possible. In the meantime, we focus on exploring our data to uncover insights and trends that help us understand how our app members are responding to the app experience.
While these insights are still observational and correlational rather than causal, they can offer valuable perspectives.
Trends In Mood And Anxiety
When app members complete a “check-in”, we use machine learning models to assess their mood (on a five-point Likert scale, from terrible to great) and anxiety level (on a ten-point Likert scale from very low to very high). We also predict emotions, topics and symptoms to provide a quantifiable snapshot of how they are feeling at that moment. Our first goal was to figure out how mood and anxiety levels change over time for members who used the app between 1st January 2023 and 10 September 2024.
Figure 1 shows the normalized mood and anxiety trends over time, with each point representing the average across members. Each step along the time axis corresponds to a day when an app member completed a check-in, so the time isn’t absolute; members may have joined the app at different points and checked in at varying intervals. Representing time by days of check-ins, allows us to show all members on the same timeline without complex time resampling and windowing approaches.
As some members stop using the app over time, fewer members contribute data as days progress. To keep the variance low in the mean mood and anxiety estimates, we only include days with at least 50 check-ins. This resulted in data spanning from 1 to 421 days of use. The most “engaged” members are those who continued performing check-ins up to the final days of the timeframe.
Figure 1 shows that average mood gradually improves, while the average anxiety decreases over time, but only for members who continued using the app and checking in for at least 100 days. To inspect the trend, we plot a smoothed curve for both the mood and anxiety data points using a “loess” transformation.
To estimate the change in mood and anxiety in a given timeframe, we compute the average value of each variable over 3 days at the beginning of the timeframe and the 3 days at the end and then take the difference between these averages.
Over the complete timeframe, from days 1 to 3 to days 419 to 421 (inclusive), mood increased from 0.5213 to 0.6554, a 25.72% rise, and anxiety decreased from 0.3459 to 0.1503, a 56.53% reduction. Notably, substantial improvements began early, with mood rising 12.25% and anxiety decreasing 28.45% within the first 90 days of check-ins.
These observations suggest that using the app may have a beneficial effect on mood and anxiety. However, this data does not come from a causal experiment, so we cannot conclude that using Earkick directly causes improvements in mood and anxiety. Those who continue using the app may be qualitatively different from those who stop very early (indicating groups of different types of people), perhaps due to factors such as genetics, environmental context, socio-economic status, or simply the lack of time and private spaces to use the app regularly.
Individual Effects
Can we do better than simply averaging mood and anxiety across all Earkick members? To avoid any grouping effect in our analysis, we can analyze each individual’s data separately and then calculate an aggregate statistic.
We decided to closely observe the mood and anxiety individually between consecutive check-ins. Since the app also contains tools such as routines (to help members build healthy habits), sessions (exercises such as breathing and meditation), and chats with Earkick AI, we can assess how using any of these tools – an app event – influences mood and anxiety shifts between consecutive check-ins. We also consider the number of days between consecutive check-ins as a relevant feature.
To conduct this analysis, we format each member’s data to extract a sequence of periods, with each period defined as the time between two check-ins. Formally, if a check-in exists for a member at times t0 and t1 , then the data of that period P(t0, t1) is the tuple:
m(t0): mood at time t0
m(t1 ): mood at time t1
a(t0): anxiety at time t0
a(t1 ): anxiety at time t1
r: number of “routine” completion events in the period
s: number of sessions
c: number of chats
d: number of days between t0 and t1
Here ‘m’ is normalized mood and ‘a’ is normalized anxiety. Our primary outcome variables are the changes in mood: m =m(t1 ) – m(t0), and anxiety: a = a(t1) – a(t0).
Notice that for a period to be valid, a minimum of one day gap is required between two consecutive check-ins.
Statistical Comparisons
For each app member, we compared changes in mood (m) and anxiety (a ) between periods when the member completed one or more sessions between check-ins (set A) and periods when they did none (set B). We found significantly higher, positive mood changes in set A than in set B (average m set A = 0.045, average m set B = -0.004, n = 2169 members; p < 10-15, t-test of independent means; Cohen’s effect size = 0.23). Anxiety reductions after sessions were also significant, with larger negative changes for set A compared to set B (average a set A = -0.065, average a set B = -0.0003, n = 2169; p < 10-21, t-test of independent means; Cohen’s effect size = 0.29).
We then performed a similar comparison for chats. Here, set A represents periods with at least one chat, and set B represents periods without any chats. Mood changes were again significantly more positive in periods with at least one chat than in those without (average m set A = 0.033, average m set B = -0.013, n = 1473 members; p < 10-13, t-test of independent means; Cohen’s effect size = 0.27). Anxiety reductions were also more significant, with greater reductions in anxiety in set A than in set B (average a set A = -0.048, average a set B = 0.008, n = 1473; p < 10-14, t-test of independent means; Cohen’s effect size = 0.28).
Figure 2: Average mood and anxiety changes for cases with 0 vs at least one session (left) and 0 vs at least one chat (right) between consecutive check-ins (errorbar is the standard error of the mean).
Surprisingly, we found no significant difference in mood and anxiety changes between periods where members completed at least one routine (set A) and those with no routines completed (set B). We’ll explore this further in the next section.
Linear Regression
To understand how certain features – namely the number of sessions, routines, chats, and days between check-ins- influence mood and anxiety changes, we used linear regression analysis. We examined the coefficients of each feature within a simple linear model, with these features as independent variables. This approach assumes time stationarity, meaning we assume a member’s behavior does not change significantly over time – a major simplification. Importantly, each member was analyzed individually, and the average value of the coefficients was then reported across all members. We included only members with a minimum of 10 periods, resulting in a sample of 405 members.
Formally, our models were as follows:
mt0: t1 = m,0 + m,rr+m,ss+m,cc + m,dd + m where m~N(0,2) represents noise.
at0: t1 = a,0 + a,rr+a,ss+a,cc + a,dd + a where a~N(0,2) represents noise.
Using Ridge regression from the sci-kit-learn library helped us prevent overfitting. Figure 3 shows the average coefficient values k for each feature.
The average variance explained by the models is 0.21 for mood and 0.24 for anxiety, which is relatively low, indicating that these models capture correlations rather than making precise predictions. With more data and more powerful model classes like kernel methods or gradient boosting, we could likely improve the explained variance.
The coefficients align with our earlier hypothesis testing results. Notably, performing chats between check-ins emerged as the most significant factor in reducing anxiety (a,c-0.02) and improving mood (m,c= 0.013), indicating that each chat correlates with a 1.3% improvement in mood and a 2% decrease in anxiety. Similarly, sessions positively impacted mood and anxiety (m,s = 0.004, a,s= -0.008). Delays in check-ins were associated with worse outcomes, with mood decreasing and anxiety increasing as more days passed (m,d = -0.004, a,d= 0.01). Interestingly, routines appeared to have a counterintuitive effect, correlating with slightly worse mood and anxiety outcomes (m,r = -0.009, a,r= 0.008).
Conclusions
In our analysis highlights three key findings:
- Continued use of Earkick, along with continued measurement of the state of a person via check-ins correlates positively with better mood and anxiety outcomes.
- Using additional app features such as sessions and chats correlates with improvements in mood and anxiety across app members.
- The longer the interval between check-ins, the less favorable the outcomes.
These results suggest that Earkick may be beneficial for members’ mental health. The unexpected results with routines – showing either no effect or a slightly negative impact on mental health outcomes – may be due to various factors. For instance, some routines, like drinking water or reading a page of a book, etc., might occur multiple times a day, diluting their impact on mood or anxiety scores measured later. Additionally, members may track routines unrelated to mental health, such as completing homework or cleaning dishes. Deeper analysis is necessary to understand what exactly is resulting in non-intuitive results regarding routines.
Caveats
As mentioned before, our findings come with two main limitations:
- These results are purely observational, which means that correlations exist between app use and beneficial outcomes, but these may not be causal. Causation cannot be established without a randomized controlled trial.
- The regression analysis assumes behavioral stationarity over time. However, earlier results suggest that members tend to feel better (better mood, lower anxiety) as time progresses and they perform more check-ins, indicating this assumption does not hold. To address this, we could explore time series models like autoregressive linear models or recurrent neural networks.