Detecting Manic State of Bipolar Disorder Based on Support Vector Machine and Gaussian Mixture Model Using Spontaneous Speech
Article information
Abstract
Objective
This study was aimed to compare the accuracy of Support Vector Machine (SVM) and Gaussian Mixture Model (GMM) in the detection of manic state of bipolar disorders (BD) of single patients and multiple patients.
Methods
21 hospitalized BD patients (14 females, average age 34.5±15.3) were recruited after admission. Spontaneous speech was collected through a preloaded smartphone. Firstly, speech features [pitch, formants, mel-frequency cepstrum coefficients (MFCC), linear prediction cepstral coefficient (LPCC), gamma-tone frequency cepstral coefficients (GFCC) etc.] were preprocessed and extracted. Then, speech features were selected using the features of between-class variance and within-class variance. The manic state of patients was then detected by SVM and GMM methods.
Results
LPCC demonstrated the best discrimination efficiency. The accuracy of manic state detection for single patients was much better using SVM method than GMM method. The detection accuracy for multiple patients was higher using GMM method than SVM method.
Conclusion
SVM provided an appropriate tool for detecting manic state for single patients, whereas GMM worked better for multiple patients’ manic state detection. Both of them could help doctors and patients for better diagnosis and mood state monitoring in different situations.
INTRODUCTION
Bipolar disorder (BD) is a common but severe mental illness characterized by cyclic mood variations with manic, depressive and euthymic states. BD is the sixth leading cause of disability worldwide and has a lifetime prevalence of about 3% in general population, associating with high recurrence, morbidity and risk of suicide [1,2]. Mainly relying on clinicians’ interview and patients’ self-report, current BD diagnosis and treatment methods are often time-consuming and subject to a range of subjective biases. In clinic, approximately 25% of BD is misdiagnosed as major depression. The failure of timely diagnosis often leads to delayed treatment, increasing costs as well as poor outcomes [3]. Therefore, an objective biomarker to assist clinicians for better diagnosis and treatment is urgently needed.
It has long been known that speech characteristics of patients with mental disorders are different from healthy individuals, and the speech pattern could be influenced by patients’ mood and neurophysiological state [4]. Up to now, numerous studies have confirmed that the speech signal could be an objective biomarker to differentiate major depression from normal state [5-7]. According to the speech production model, current speech features that are related to major depression can be grouped into three categories: glottal features [e.g., glottal timing (GT) and glottal frequency (GF) etc.], spectral and cepstral features [e.g., spectral flux, spectral centroid, Mel-frequency Cepsturm Coefficients (MFCC), Linear Prediction Cepstral Coefficient (LPCC), shifted delta cepstrum (SDC), PLPCC and Gammatone Frequency Cepstral Coefficients (GFCC) etc.], and prosodic features [e.g., pitch, the first three formants, jitter, shimmer, loudness, harmonic-to-noise ratio (HNR), log of energy (LogE) and Teager Energy Operation (TEO)] [8-10]. Accumulating evidence suggests that different feature types and classifiers could result in moderate to significant accuracy rate of major depression detection [9-11].
Support Vector Machine (SVM) and Gaussian Mixture Model (GMM) are well-known speech/emotion/vision classifiers and were once used in mood state detection. Automatic speech recognition approaches have been carried out using a variety of classifiers, both generative and discriminative. SVM is a discriminative classifier whose maximum discrimination is obtained with an optimal placement of the separation plane between the borders of two classes. SVM solves non-linear problems by a transformation of the input feature vectors into a generally higher dimensional feature space. GMM is a generative classifier directly modeling low-level features regardless of speech duration. A comparative study of different classifiers detection accuracies including GMM, SVM, and Multilayer Perception (MLP) Neural networks has been reported in a 60-cohort spontaneous speech dataset for major depression detection showed that the hybrid classifier-GMM and SVM performed well (accuracy 81.61%) [11]. In addition, Low et al. [12] found that although SVM yielded very similar results with GMM did, SVM required more training time and was less efficient than GMM in their 139-adolescent cohort study. Given that SVM and GMM had different working mechanisms, the efficacy of these two machine learning techniques need to be further studied especially in the context of manic state detection of BD patients.
BD patients have more fluctuated emotions and speech pattern changes than those with major depression. Recently, a comparative study revealed that pitch and jitter showed statistically significant differences between different mood states among BD patients [13]. When BD patients are in the manic state, they often show emotional outburst, repeat the same idea and show witty irritability. Studies have also shown that the pause, intonation and emotional tension during speech could help to detect whether a BD patient is in manic state. Smartphones have been used for detecting different mood states and mood changes of BD by analyzing patient’s physiological activities such as Heart Rate Variability (HRV), Electro Dermal Response (EDR) [14,15] and behavior activities (such as geospatial information and phone call activities) [16]. Low-level speech features correlated with BD mood states, depressive and manic states can be detected using smartphones, although the detection accuracy was moderate [17,18].
We aimed to establish and explore a speech recognition system that could be used to monitor and eventually predict manic state of BD to aid diagnosis, modulate therapy and avoid dangerous events. Here we presented our primary study endpoints of manic state detection accuracies of BD using SVM and GMM with spontaneous speech. The novelty of this work is to select speech features that can represent manic status as much as possible. Of particular note is that the basis of this article is our previous work. As far as we know, this is the first report to detect manic state using different classifiers with optimized speech features. Our findings could provide evidence that speech signals can be biomarkers and serve as assistant monitoring tools for BD manic state detection.
Methods
Patients
The study group consisted of 21 hospitalized patients (14 females and 7 males, average age 34.52±15.32). The patients were diagnosed with BD and in manic episode after admission. Recruited patients were aged between 18 and 65, being able and willing to operate modern smartphone devices. The patients were selected by the ward’s psychiatrists who were capable of dealing with the study. Psychiatric assessment and the psychological state examination were performed in patients’ euthymic states at Shanghai Jiaotong University School of Medicine Mental Health Center (Shanghai, China) from October 2014 to January 2015. The study has been approved by the Shanghai Jiaotong University School of Medicine Mental Health Center (approval number: 2011-15). All participants signed the informed consent and the study strictly followed the guardians of the hospital.
BRMS score system
Bech-Rafaelsen Mania Rating Scale (BRMS) was used to assess the patients by a psychiatrist for determining manic state. BRMS score system was firstly developed by Bech et al. [19] in 1978. This 11-iterm system was developed to assess the severity of the manic state quantitatively. The system includes important items such as social contact, sleep and work activity etc [20]. The BRMS scale was used to measure mania, ranging from 0 point (not symptomatic) to 44 points (highly symptomatic). In order to classify extracted features, we binned the BRMS into categories of being manic and euthymic. Patients with scores under 6 points (the threshold) were euthymic, and those with scores above 22 were manic. All recruited patients were manic in this study. Table 1 shows the patients’ clinical and sociodemographic characteristics.
Speech collection
Each patient was provided with a preloaded Samsung GALAXY Mega 6.3 (a sampling frequency of 44 kHz and a resolution of 32 bit, purchased from Samsung China, Shanghai, China). The clinician would make a free open conversation with patient through a cellphone. In order to reduce the noise interference, the patients were comfortably alone in a double layer sound insulation glass room when talking to the clinician. Speech was recorded twice in each mood state (manic and euthymic) in the morning in consecutive 1–2 days. Each recording lasted for about 10–25 minutes. All collected speech is encrypted and transferred securely through Wi-Fi to a cloud database for further analysis. The implementation, data transfer and handling were done following the security and encryption guidelines approved by the internal review board to ensure the integrity and privacy of the collected data. In this study, approximately 50% of the speech data were used to train the manic and euthymic models, and the rest for testing. The speech duration of the manic was 775 minutes in total.
Speech features and classifiers
The speech recognition system of manic state in this work consisted of two main parts. The working flow was illustratred in Figure 1. Detection was achieved through astochastic modelling and matching processing carried out in the backend of the classifier. The speech features from speech signal were extracted in the front-end. Training phase and testing phase were carried in the back-end had. In the training phase, a model was established to predict speaker mood state using a given input (labelled speech sample). In the testing phase, the model was used to detect the mood state.
Features extraction
We restricted ourselves to automatically extract features regardless of the content of speech. The speech features explored in this study included prosodic features and spectral features. Each feature consisted of a number of sub-categories. All speech features were extracted using the publicly available open SMILE software (audEERING, Munich, Bavaria, Germany) [21]. During the preprocessing stage, only frames that contained vocal speech were concatenated. The frame size was set to 25 ms at a shift of 10 ms with a Hamming window. The main extracted features in the study were pitch, formants, LPCC, MFCC, and GFCC.
Classifiers
The LIBSVM toolbox (developed by Chih-Chung Chang and Chih-Jen Lin of National Taiwan University, Taipei, Taiwan) was used to implement SVM modeling [22]. In this work, we used the HTK toolbox (Speech Vision and Robotics Group, Cambridge University Engineering Department) [23] for GMM modeling. In the implementation, expectation-maximization (EM) algorithm was used for estimating parameters of mean, covariance, and mixture weight of each Gaussian component in the GMM. It should be pointed out that the complete algorithm of SVM and GMM were thoroughly depicted in a previous published study [24]. In this work, SVM and GMM were utilized to discriminate patients’ mental states as “manic” or “euthymic,” and their performance about manic detection accuracy was compared.
Statistical analysis
Three patients were chosen in the single patient experiments, whereas 21 patients were in the multiple patients experiments. Statistical analysis was performed using SPSS for Windows version 21.0 (IBM, Corp., Armonk, NY, USA). The manic state detection accuracies of SVM and GMM for single patients and multiple patients were compared using the Student’s t-test. All p values were two-tailed and statistical significance was accepted as p<0.05.
Results
Speech features with high ratio were used for manic state detection
To minimize the influence of noisess and to maximize the detection accuracy, manic state-related features were selected at the beginning. In this work, the optimal features selection was achieved using the features of between-class variance (δ2b) and within-class variance (δ2w) [25]. Generally, optimized features have larger δ2b and smaller δ2w. Considering the big fluctuation of the features, a feature level normalization has been performed before calculating the variances. In the present study, the following features were selected: LPCC, first six formants (Frequency, Amplitude), MFCC (Mean and Variance), GFCC and pitch. The ratios of speech features were calculated according to function (1) and the results were shown in Table 2.
δ2b, between-class variance; δ2w, within-class variance.
We also hypothesized that a single speech feature may not reflect all features of the manic state and therefore may not be suitable for detection. In order to verify this hypothesis, we compared the manic state detection accuracies of SVM and GMM using speech features extrated from a 3-minute-long speech of a single patient at manic state. As shown in Table 3, for both SVM and GMM, the manic state detection accuracy of single features was lower than that of multiple features.
SVM performed better in manic state detection for single patients
We randomly chose speeches of three patients to test the manic state detection accuracies of GMM and SVM for single patients. The speech features were further optimized by genetic algorithm after being processed and normalized [26]. The patients were chosen to be tested with their ID protected for privacy. The results of GMM and SVM detection accuracies were presented in Table 4. SVM classifier showed a better performance (88.56±8.56%) in the detection of mania state for single patients than GMM classifier (84.46±1.85%).
GMM performed better in manic state detection for multiple patients
The above results showed that SVM was highly effective to discriminate manic state from euthymic state for single patients. However, it remained to be determined whether SVM was also adept to detect manic state for multiple patients. Therefore, we trained SVM and GMM using speeches of 3 patients chosen in the above experiment and compared the manic state detection performance of 21 patients using GMM and SVM. The manic state detection accuracies of SVM and GMM for multiple patients were summarized in Table 5. We could see that the manic state detection accuracy of GMM (72.27±6.90%) was higher than that of SVM (60.87±18.90%), indicating GMM was more effective on multiple patients’ manic state detection than SVM classifier.
Discussion
Herein we presented a primary study of manic state detection of BD by selecting representative speech features and utilizing classifier SVM and GMM with spontaneous speeches. Results showed that SVM performed well in manic state detection for single patients while GMM was effective in manic state detection for multiple patients.
GMM and SVM classifiers have been popular partly because of their capacity of robustly handling smaller/sparse datasets as well as relatively low expenses. The two classifiers have been proved to be adept for depression detection by a previous study [27]. In this study, the experiment showed that SVM had higher manic state detection accuracy for a single patient compared with GMM. This finding suggested the effectiveness of SVM for a small sample size. To further explore the performance of SVM and GMM, the investigation about the manic state detection accuracies for multiple patients was conducted and GMM was proved to be more accurate in this case. Herein, our results demonstrated that mania state could be effectively differentiated from euthymic state using speechbased classifiers, which have been trained on unstructured smartphone recordings. This was in agreement with observations reported by Karam et al. that spontaneous speeches could effectively differentiate manic state from euthymic state [17,18]. Evidences have shown that spontaneous speeches (such as family conversation or clinical interview) have more variability and could increase depressive and manic mood state detection accuracies than fixed session speech (such as text reading, picture commenting) [9-12,18,28]. In addition, speech collection in natural environment highlights the applicability for autonomous ecologically valid monitoring for BD. The optimal features were selected according to the ratio of betweenclass variance and within-class variance together with the genetic algorithm [29-31]. Five features (pitch, LPCC, first six formants, MFCC, and GFCC) were extracted and formed a feature set. Ratios of single features for manic state discrimination suggested that LPCC and GFCC contained more important mood information of mania state than other features, simulating human cochlea auditory characteristics.
The results of our study would assist clinicians and patients for better diagnosis and mood state change monitoring. Yet, certain limitations exist in our study. Firstly, the lack of a larger study cohort and more types of speech features can be addressed in our future studies. BD is a severe mental illness characterized by cyclic mood variations with manic, depressive and euthymic states, we have only investigated the differentiation of manic state from euthymic states. Thus, we could conduct further studies to explore how to differentiate depressive state from euthymic state. In addition, a further analysis can be done to verify whether other features, such as vocal tract features and glottal features, can be utilized to diagnose BD.
Our system did not aim to replace professional expertise, but to supplement it. In this aspect, our results showed a promising diagnosis accuracy to determine BD manic state using SVM. Based on clinical findings and previous discoveries, spontaneous speeches could serve as effective tools for the determination of manic state of BD. Specifically, SVM was adept to detect manic state for a single patient or a small size, whereas GMM was more suitable to detect manic state for multiple patients.
Acknowledgements
This project was supported by the National Natural Science Foundation of China (NSFC) under Grant (No. 61271349, 61371147 and 11433002); the Shanghai Jiao Tong University Joint Research Fund for Biomedical Engineering under Grant (No. YG2012ZD04); the Shanghai Key Laboratory of Psychotic Disorders under Grant (No. 13dz2260500); and the National Key Research and Development Program of China under Grant (No. 2017YFC0909200).