
The paper investigates the long-run relationship between bitcoin and its marginal cost between July 2010 and July 2022. We derive Bitcoin's marginal cost of production from a model of Bitcoin mining grounded in the Bitcoin code, and show that its production cost is a function of only two variables, the electricity price and the mining hardware efficiency. We then estimate a time-varying vector error correction model, and also the cointegration between bitcoin's price and Bitcoin network's hash rate, a commonly used production cost proxy. Our results show that the time-varying cointegration between bitcoin's price and its hash rate is permanently in disequilibrium, bar a short time interval between March 2017 and January 2018. Consequently, although bitcoin's price and the hash rate are cointegrated, it is clear that the latter does not function as a stable long-run explanatory variable for bitcoin price dynamics. On the contrary, we found that bitcoin's price and its marginal cost of production have been cointegrated since its inception, and that their time-varying long-run relationship always reverts towards equilibrium - and often to equilibrium- after long periods of divergence. These results contrast with most of the empirical literature that attempted to model the relationship betweeen bitcoin and its fundamentals in a time-invariant framework, but are consistent with recent research showing a significant role for production cost in the determination of bitcoin's price dynamics.
Citation: Sylvia Gottschalk. Digital currency price formation: A production cost perspective[J]. Quantitative Finance and Economics, 2022, 6(4): 669-695. doi: 10.3934/QFE.2022030
[1] | Makoto Nakakita, Teruo Nakatsuma . Analysis of the trading interval duration for the Bitcoin market using high-frequency transaction data. Quantitative Finance and Economics, 2025, 9(1): 202-241. doi: 10.3934/QFE.2025007 |
[2] | Nilcan Mert, Mustafa Caner Timur . Bitcoin and money supply relationship: An analysis of selected country economies. Quantitative Finance and Economics, 2023, 7(2): 229-248. doi: 10.3934/QFE.2023012 |
[3] | Zheng Nan, Taisei Kaizoji . Bitcoin-based triangular arbitrage with the Euro/U.S. dollar as a foreign futures hedge: modeling with a bivariate GARCH model. Quantitative Finance and Economics, 2019, 3(2): 347-365. doi: 10.3934/QFE.2019.2.347 |
[4] | Lukáš Pichl, Taisei Kaizoji . Volatility Analysis of Bitcoin Price Time Series. Quantitative Finance and Economics, 2017, 1(4): 474-485. doi: 10.3934/QFE.2017.4.474 |
[5] | Takashi Kanamura . Supply-side perspective for carbon pricing. Quantitative Finance and Economics, 2019, 3(1): 109-123. doi: 10.3934/QFE.2019.1.109 |
[6] | Lennart Ante . Bitcoin transactions, information asymmetry and trading volume. Quantitative Finance and Economics, 2020, 4(3): 365-381. doi: 10.3934/QFE.2020017 |
[7] | Samuel Asante Gyamerah . Modelling the volatility of Bitcoin returns using GARCH models. Quantitative Finance and Economics, 2019, 3(4): 739-753. doi: 10.3934/QFE.2019.4.739 |
[8] | Hakan Pabuçcu, Serdar Ongan, Ayse Ongan . Forecasting the movements of Bitcoin prices: an application of machine learning algorithms. Quantitative Finance and Economics, 2020, 4(4): 679-692. doi: 10.3934/QFE.2020031 |
[9] | I-Chun Tsai, Huey-Cherng Tsai . The Price Concessions of High-and Low-Priced Housing in a Period of Financial Crisis. Quantitative Finance and Economics, 2017, 1(1): 94-113. doi: 10.3934/QFE.2017.1.94 |
[10] | Rubaiyat Ahsan Bhuiyan, Tanusree Chakravarty Mukherjee, Kazi Md Tarique, Changyong Zhang . Hedge asset for stock markets: Cryptocurrency, Cryptocurrency Volatility Index (CVI) or Commodity. Quantitative Finance and Economics, 2025, 9(1): 131-166. doi: 10.3934/QFE.2025005 |
The paper investigates the long-run relationship between bitcoin and its marginal cost between July 2010 and July 2022. We derive Bitcoin's marginal cost of production from a model of Bitcoin mining grounded in the Bitcoin code, and show that its production cost is a function of only two variables, the electricity price and the mining hardware efficiency. We then estimate a time-varying vector error correction model, and also the cointegration between bitcoin's price and Bitcoin network's hash rate, a commonly used production cost proxy. Our results show that the time-varying cointegration between bitcoin's price and its hash rate is permanently in disequilibrium, bar a short time interval between March 2017 and January 2018. Consequently, although bitcoin's price and the hash rate are cointegrated, it is clear that the latter does not function as a stable long-run explanatory variable for bitcoin price dynamics. On the contrary, we found that bitcoin's price and its marginal cost of production have been cointegrated since its inception, and that their time-varying long-run relationship always reverts towards equilibrium - and often to equilibrium- after long periods of divergence. These results contrast with most of the empirical literature that attempted to model the relationship betweeen bitcoin and its fundamentals in a time-invariant framework, but are consistent with recent research showing a significant role for production cost in the determination of bitcoin's price dynamics.
Globally, the number of persons suffering from mental health disorders is on the rise [1]. In 2015, an estimated 322 million people were living with depression worldwide [1]. With the recent COVID-19 pandemic, mental well-being was further challenged with fears of contracting an infection [2] and feelings of isolation [3]. Mental health conditions have been associated with stigma in society, causing an individual to perceive oneself as unacceptable [4],[5]. The impact of stigma often results in a reduced likelihood of seeking treatment [4],[6],[7]. In 2018, a USA survey reported that people suffering from depression were increasingly turning to the Internet for mental health-related support [8]. Among them, 90% had researched mental health information online, while 75% had accessed others' health stories through blogs, podcasts and videos [8]. Thus, it is not uncommon that many tend to opt for online support environments, including support groups and social media channels [5],[8].
In recent years, digital voice assistants (DVAs) have been increasingly adopted as digital health tools with the purpose of providing information regarding health-related queries for various health conditions, including minor ailments [9], postpartum depression [10], vaccinations [11],[12], cancer screening [13] and smoking cessation advice [14]. Smartphone-based DVAs, such as Apple Siri and Google Assistant, have been particularly popular [15]. According to Google, 27% of Internet searches in 2018 came from using the voice search feature on smartphones [16], with this trend posited to grow. The artificial intelligence (AI) component in DVAs enables voice recognition and responses in natural language [10],[17], thereby enabling these DVAs to participate in two-way conversations with users [18]. Given the growing popularity of using DVAs to search for online health information [8], it is crucial that DVAs are able to provide relevant, appropriate and easy-to understand responses to queries by users in relation to mental health literacy, such as symptom recognition, information sources, awareness of causes and risks and an understanding of treatment types [19],[20]. While there are quality assessment tools that evaluate the quality of online health information, such as the Health-on-the-Net Code (HONcode) [21], DISCERN [22] and Quality Evaluation Scoring Tool (QUEST) [23], from our knowledge, there are no existing ones for the purpose of assessing DVAs. On the other hand, studies that have evaluated the quality of information provided by DVAs [9]–[14] have not focused on mental health conditions.
As we move into a post-pandemic world, it is crucial that public mental health should not be ignored [24]. There is a need to evaluate the quality of information provided by DVAs in the mental health domain. Studies have suggested that providing useful and comprehensive online information about mental health conditions in a user-friendly way can help consumers gain a better understanding of the disease, which in turn can help prevent and/or reduce the severity of the mental health disorder [25]. Furthermore, providing high-quality information online on mental health conditions can potentially reduce the stigma and prejudice attached to these disorders [25]. With the increased popularity of consumers performing health information searches through DVAs, it is crucial that DVAs are able to provide high-quality information on mental health conditions through their responses. Our hypothesis is that DVAs are able to provide responses that are relevant, appropriate and easy-to understand in relation to mental health queries. Thus, the primary objective of this study was to evaluate the quality of DVA responses to mental health-related queries by using an in-house-developed quality assessment rubric. In this study, DVAs are defined as inanimate programs enhanced with AI that interact with human users using speech commands. These are different from other technologies such as chatbots [26] or automated telephone-response systems [27],[28].
In this study, the quality of DVAs was defined as the degree of excellence to which a DVA could fulfill the needs of mental health-related queries [29]. This definition was represented by six quality domains: comprehension ability, relevance, comprehensiveness, accuracy, understandability and reliability. The quality domains were adapted from tools evaluating the quality of online health information or sources. The relevance domain was adapted from the DISCERN [22] and CRAAP (currency, relevance, authority, accuracy and purpose) [30],[31] tools. The accuracy and reliability domains were adapted from DISCERN [22], CRAAP [30],[31] and HONcode [21]. In addition, the reliability domain was also adapted from the Ensuring Quality Information for Patients (EQIP) tool [32], LIDA Minervation validation instrument [33], QUEST [23] and Quality Component Scoring System [34]. The comprehensiveness domain was adapted from DISCERN and EQIP [22],[32], and understandability was adapted from EQIP and LIDA [32],[33].
The quality domains evaluated three aspects of DVA quality: the DVAs themselves (comprehension ability), the DVAs' responses (relevance, comprehensiveness, accuracy and understandability) and the answer sources provided by the DVAs (reliability) (Figure 1). The composite score for all domains added up to a maximum of 32 points. All DVA responses were classified into four types: verbal response only, web response only, verbal and web response and no response. “Verbal response only” referred to a short verbal text that directly answered the question without providing a link. Conversely, a “web response only” referred to a link without any verbal explanation provided. A “verbal and web response” consisted of both the aforementioned parts in a single response. If the DVA did not provide any responses, it would be classified as “no response”. Since understandability was evaluated for both the verbal and web responses, in cases where the DVA only provided one type of response, the composite score would be 30 points instead.
The DVA's comprehension ability was assessed based on its ability to accurately recognize and transcribe the question posed to it. Relevance of the DVA's responses was assessed based on whether the response had adequately addressed the question. For two questions, the DVAs were evaluated for their ability to successfully refer to a contact point in cases requiring immediate intervention. Comprehensiveness was assessed based on whether the DVA's response was complete and fulfilled all of the points in the answer sheet. In addition, two quality-of-life (QoL) criteria assessed whether the DVA described impacts of treatment or treatment choices on day-to-day living or activities, and whether it supported shared decision-making regarding treatment choices. Accuracy assessed whether each point in the DVA's response correctly matched the corresponding point in the answer sheet. Understandability was assessed based on whether a layman would easily understand the DVA response according to the Simple Measure of Gobbledygook (SMOG) readability test [35],[36], and whether it contained medical jargon/complex words. Lastly, the reliability of answer sources provided by the DVAs was evaluated based on six criteria: credibility of the sources and reference citations, how current/updated were the sources, presence/absence of bias and advertisements and whether there was a disclaimer stating that the information provided did not replace a healthcare professional's advice. All DVA responses were evaluated regardless of whether they were verbal or web responses.
A total of 66 questions on mental well-being and mental health conditions were compiled and categorized into five categories: general mental health, depression, anxiety, obsessive-compulsive disorder (OCD) and bipolar disorder. These conditions were chosen due to their rising prevalence in global and local data [1],[37]. Besides the section on general mental health, questions in the other sections on the specific mental health conditions were classified into three subcategories: disease state, symptoms and treatment (Appendix 1).
Questions and answers were sourced primarily from the American Psychiatric Association [38], National Institute of Mental Health [39], Medline Plus [40], World Health Organization [41], USA Centers for Disease Control and Prevention [42], Mayo Clinic [43], Cleveland Clinic [44], National Alliance on Mental Illness [45], Anxiety and Depression Association of America [46] and the International Obsessive-Compulsive Disorder Foundation [47]. In addition, questions were also sourced from AnswerThePublic [48] with the following keywords: “mental health”, “depression”, “anxiety”, “OCD” (obsessive-compulsive disorder) and “bipolar disorder”. Answers were also compiled from established clinical guidelines, including the Diagnostic and Statistical Manual of Mental Disorders, 5th edition (DSM-5) [49] and the Singapore Ministry of Health Clinical Practice Guidelines [50]. The questions and answers were reviewed by three reviewers (VC, WLL, KY). Any differences in opinions were resolved through discussions until consensus was reached. Two reviewers (JC and LL) pilot-tested half of the questions to ensure that the evaluation rubric could be applied across different questions. Their feedback was used to refine the rubric for the actual evaluation.
Four smartphone DVAs were employed for evaluation: Apple Siri, Samsung Bixby, Google Assistant and Amazon Alexa. Siri and Google Assistant were accessed by using an iPhone 6 (iOS14.7.1), while Bixby and Alexa were accessed by using a Samsung Galaxy Note 9 (OS10). All questions were posed to the DVAs in English by native English speakers—in the same order and in the exact way that the questions were phrased in Appendix 1. The evaluations and scoring were done independently on the same devices by three evaluators in a quiet room at their homes: VC (female), LSK (male) and AP (female). Each evaluator would ask all 66 questions to one DVA in one sitting. However, they would pose the questions to a different DVA in a separate sitting (i.e., four separate sessions). If the DVA was unable to capture the question and generate a response after three repeated attempts, the evaluation would end and no points would be awarded. Each evaluator completed the evaluation of all four DVAs within a week, after which, the devices were transferred to the next evaluator, who would then evaluate the DVAs on the same devices over the next consecutive week. As such, all evaluations were completed within 3 weeks. The search and internet histories for the individual DVAs were reset before and after each round of evaluation. The location function was turned on as the DVAs were evaluated for their ability to refer to a contact point. If the DVA provided more than one web link, the first web link was taken for evaluation.
Descriptive statistics (numbers and percentages) were employed to report the types of responses, proportion of successful responses and sources cited by the DVAs. The quality scores were calculated for each mental health category (general mental health, depression, anxiety, OCD, bipolar disorder) and question subcategory (disease state, symptoms, treatment), as well as for each quality domain (comprehension ability, relevance, comprehensiveness, accuracy, understandability, reliability, overall quality), by dividing the sum of points awarded for each DVA against the maximum possible number of points in each mental health category, question subcategory and quality domain (Equation 1). This calculation was also performed across all questions to generate a composite quality score. All quality scores were converted to percentages and reported as medians and interquartile ranges (IQRs). All results were taken as averages of the three evaluators.
All statistical analyses were performed at a significance level of 0.05 by using the Statistical Package for Social Sciences (SPSS) software (version 27). Normality tests, including Shapiro-Wilk tests (n < 50) and Kolmogorov-Smirnov tests (n ≥ 50) were conducted before Kruskal-Wallis tests were applied to compare the results across all four DVAs. Post-hoc analyses using Wilcoxon rank sum tests with Bonferroni adjustments were subsequently performed for each possible pairwise comparison among the DVAs. Wilcoxon rank sum testing was also used to compare the understandability of verbal and web responses. Inter-rater reliability was calculated by using the intraclass correlation coefficient (ICC) [51] based on a mean rating of three evaluators, absolute agreement, a two-way mixed-effects model and a 95% confidence interval (95% CI).
The majority of the responses by Siri were web responses (72.7%), while verbal responses formed the major proportion of responses by Alexa (62.1%) (Table 1). The largest proportion of responses from Google Assistant consisted of both verbal and web responses (78.8%). However, Bixby had a comparable distribution of verbal responses only (36.4%) and verbal and web responses (42.4%).
Number of responses (%), N = 66 a |
||||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | |
Types of responses by DVAs | ||||
Verbal response only b | 6 (9.1) | 24 (36.4) | 1 (1.5) | 41 (62.1) |
Web response only c | 48 (72.7) | 14 (21.2) | 13 (19.7) | 0 (0) |
Verbal and web response d | 11 (16.7) | 28 (42.4) | 32 (78.8) | 24 (36.4) |
No response | 1 (1.5) | 0 (0) | 0 (0) | 1 (1.5) |
Proportion of successful responses | ||||
Questions that were recognized e | 63 (95.5) | 46 (69.7) | 66 (100.0) | 60 (90.9) |
Relevant responses | 47 (71.2) | 38 (57.6) | 66 (100.0) | 44 (66.7) |
Proportion of sources provided in DVA responses | ||||
Tier A | 13 (19.7) | 19 (28.8) | 36 (54.5) | 20 (30.3) |
Tier B | 19 (28.8) | 18 (27.3) | 18 (27.3) | 8 (12.1) |
Tier C | 15 (22.7) | 1 (1.5) | 6 (9.1) | 15 (22.7) |
No sources provided, or sources that could not be evaluated | 19 (28.8) | 28 (42.4) | 6 (9.1) | 23 (34.8) |
Note: a Results were taken from the average of three evaluators. b A short verbal text that directly answered the question without providing a link. c A link was provided in response to the question without a verbal explanation. d Both a verbal explanation and a link were present in the response. e These were questions that were captured on the smartphone screen and induced a response by the DVA. Responses such as “I'm not sure I understood that” were classified as the DVA not recognizing the question.
The proportion of responses that were successfully recognized varied across the DVAs. Responses were deemed to be recognized successfully if the questions were captured on the smartphone screen and a response was provided by the DVA. If the DVA provided a response like “I'm not sure I understood that”, its response would be classified as not being recognized. Similarly, if the DVA provided a response that was relevant to the question, it would be classified as such. For the proportion of questions that were recognized, Google Assistant performed the best (100%), followed by Siri (95.5%), Alexa (90.9%) and Bixby (69.7%). The proportion of relevant responses followed the same trend, with Google Assistant performing the best (100%) and Bixby performing the worst (57.6%).
In terms of the credibility of the sources provided, Google Assistant (54.5%) and Siri (19.7%) had the highest and lowest proportions of Tier A sources, respectively. Over a quarter of the sources by Siri (28.8%), Bixby (27.3%) and Google Assistant (27.3%) were Tier B, while Siri and Alexa had the largest proportions of Tier C sources (22.7% each).
Across all 66 questions (Table 2), Google Assistant had the highest median composite quality score (78.9%) among the DVAs, while Alexa had the lowest median composite score (64.5%). Siri (83.9%), Bixby (87.7%) and Google Assistant (87.4%) scored the best for questions on depression, in contrast to Alexa (72.3%), which scored the best for OCD questions. Alexa scored significantly lower (63.0%, p < 0.001) than all other DVAs for questions on depression, and significantly lower (60.5%) than Bixby (75.9%, p < 0.001) and Google Assistant (76.4%, p = 0.004) for questions on anxiety. On the other hand, Bixby scored significantly lower than all other DVAs for questions on general mental health and OCD (0%, p < 0.001 each). Additionally, Siri scored significantly lower than Google Assistant for questions on OCD (61.7% versus 78.4%, p = 0.002).
Among the question subcategories, Siri (71.7%) and Google Assistant (80.5%) scored the best for questions on disease state, as compared to questions on symptoms and treatment (Table 2). On the other hand, Bixby had similar scores across all three subcategories of disease state, symptoms and treatment. In contrast, Alexa scored the highest for questions on symptoms (71.5%), but its score in the treatment subcategory (57.3%) was significantly lower than those of Bixby (78.3%, p < 0.001) and Google Assistant (77.3%, p < 0.001). Furthermore, Alexa's scores were also significantly lower than Google Assistant for questions in the subcategory of disease state (69.6% versus 80.5%, p = 0.004).
Classification of Questions | Median Quality Scores of DVAs [% (IQR)] |
p-values* | |||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | ||
Across all questions | 70.4 (60.9–79.3) | 72.8 (0–81.6) | 78.9 (73.9–85.2) | 64.5 (57.7–76.7) | <0.001 |
Mental Health Categories | |||||
General mental health | 77.1 (71.3–85.2) | 0 (0–16.7) | 80.5 (76.4–89.1) | 70.7 (57.1–79.7) | <0.001 |
Depression | 83.9 (76.0–86.9) | 87.7 (83.6–89.3) | 87.4 (79.4–88.7) | 63.0 (61.4–72.1) | <0.001 |
Anxiety | 71.8 (67.2–87.6) | 75.9 (71.3–80.7) | 76.4 (69.8–83.6) | 60.5 (42.5–68.4) | 0.006 |
Obsessive-compulsive disorder | 61.7 (53.7–69.8) | 0 (0–29.6) | 78.4 (73.6–85.7) | 72.3 (59.1–80.0) | <0.001 |
Bipolar disorder | 66.4 (44.4–70.4) | 77.5 (71.3–81.6) | 75.9 (70.1–81.6) | 63.0 (48.2–80.5) | 0.004 |
Question Subcategories | |||||
Disease state | 71.7 (66.9–79.0) | 71.6 (28.2–83.3) | 80.5 (73.8–84.8) | 69.6 (62.5–80.2) | 0.031 |
Symptoms | 66.7 (53.9–83.1) | 76.7 (25.0–80.9) | 77.5 (70.7–86.0) | 71.5 (57.7–80.4) | 0.239 |
Treatment | 60.5 (49.4–74.2) | 78.3 (63.0–85.0) | 77.3 (69.6–84.3) | 57.3 (30.6–62.1) | <0.001 |
Note: *Kruskal-Wallis test was performed among all the four DVAs with statistical significance defined as p < 0.05. Post-hoc analyses using the Wilcoxon rank sum test with Bonferroni adjustment were performed for each possible pairwise comparison among the DVAs, with statistical significance defined as p < 0.00833.
Across all quality domains, Google Assistant scored the highest while Alexa scored the lowest (Table 3). In terms of comprehension ability, Google Assistant scored significantly higher (100%, p < 0.001) than the other DVAs. In addition, Alexa (100%) scored significantly higher than Siri (88.9%, p < 0.001) and Bixby (94.5%, p = 0.03) in this domain. Google Assistant (100%) and Bixby (100%) also scored significantly higher than Siri (66.7%) and Alexa (75.0%) in terms of relevance. Only Google Assistant was successful in identifying situations that required immediate intervention from one evaluator (16.7%).
Alexa scored the worst among all DVAs in terms of comprehensiveness (22.2%, p < 0.001) and reliability (58.3%, p < 0.001). In addition, Alexa also performed the poorest when evaluated against the QoL criteria (10.0%), as compared to Bixby, which performed the best (76.7%). In contrast, Google Assistant scored the best (77.8%) in terms of comprehensiveness, but it had similar reliability scores as Bixby (75.0% each). In terms of accuracy, Alexa scored the lowest among the DVAs (75.0% versus 100% for other DVAs, p = 0.003). However, all DVAs had similar scores for understandability (50.0% each). The understandability of verbal responses was significantly lower than that of web responses (33.3% versus 50.0%, p = 0.004). Inter-rater reliability ranged from moderate to good for both the overall quality and the individual quality domains (Table 3).
Quality Domains | Median Quality Scores of DVAs [% (IQR)] |
p-value* | Intraclass Correlation Coefficient [ICC (95% CI)] a | |||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | |||
Comprehen-sion ability | 88.9 (70.8–100) | 94.5 (0–100) | 100 (100–100) | 100 (88.9–100) | <0.001 | 0.892 (0.868–0.913) |
Relevance | 66.7 (50.0–100) | 100 (66.7–100) | 100 (83.3–100) | 75.0 (33.3–100) | <0.001 | 0.753 (0.691–0.804) |
Comprehen-siveness | 66.7 (44.4–83.3) | 66.7 (55.6–88.9) | 77.8 (55.6–88.9) | 22.2 (0–66.7) | <0.001 | 0.747 (0.660–0.812) |
Accuracy | 100 (75.0–100) | 100 (83.3–100) | 100 (83.3–100) | 75.0 (50.0–100) | 0.003 | 0.691 (0.593–0.769) |
Understand-ability | 50.0 (25.0–75.0) | 50.0 (33.3–68.8) | 50.0 (33.3–66.7) | 50.0 (25.0–75.0) | 0.724 | 0.672 (0.513–0.775) |
Reliability | 72.9 (63.2–83.3) | 75.0 (63.9–84.3) | 75.0 (66.7–84.3) | 58.3 (49.1–63.9) | <0.001 | 0.896 (0.863–0.922) |
Overall quality | 70.4 (60.9–79.3) | 72.8 (0–81.6) | 78.9 (73.9–85.2) | 64.5 (57.7–76.7) | <0.001 | 0.848 (0.813–0.877) |
Note: * Kruskal-Wallis test was performed among all four DVAs with statistical significance defined as p < 0.05. Post-hoc analyses using the Wilcoxon rank sum test with Bonferroni adjustment were performed for each possible pairwise comparison among the DVAs, with statistical significance defined as p < 0.00833. a ICC values and their 95% CIs were calculated using the SPSS platform based on the mean rating of three evaluators, absolute agreement and a two-way mixed-effects model. ICC values indicate moderate-to-good inter-rater reliability.
In relation to our hypothesis, this study has shown that DVAs are able to provide relevant and appropriate responses to mental health-related queries. However, the understandability of their responses was relatively low. Furthermore, not all DVAs fared the same in terms of the different quality domains, and they also varied across the various mental health conditions. Overall, Google Assistant performed the best among all DVAs, suggesting that it was able to comprehend the queries and provide responses that were relevant and accurate across the various mental health categories. In comparison, Bixby fared the worst in terms of responding to questions on general mental health and OCD. On the other hand, Alexa's responses were the least comprehensive and reliable across all questions, as well as in the categories of depression, anxiety and bipolar disorder.
All DVAs performed well in terms of comprehension ability. This result was similar to a study by Yang and colleagues, who investigated the abilities of Siri, Google Assistant, Alexa and Cortana in terms of responding to questions on postpartum depression [10]. In their study, all DVAs performed well in terms of recognizing the postpartum depression questions, with scores ranging from 79% (Alexa) to 100% (Siri and Google Assistant). However, in our study, Siri and Bixby performed poorer than Google Assistant and Alexa. For Bixby, a quarter of the questions posed (27.3%, n = 18/66) were scored as 0%. In particular, Bixby often transcribed “OCD” as “o CD” (two separate words), resulting in a large proportion of questions failing to be recognized. In addition, while Bixby could accurately transcribe questions on general mental health, it could not generate responses for many of these questions (80%, n = 8/10) and frequently answered with “I'm not sure I understood that”. We postulate that our observations could be due to Bixby's primary design intent, which was to assist users in operating the phone via voice commands, rather than provide accurate responses to questions, as in the case of other DVAs [52]. On the other hand, while Siri could successfully capture all questions, it was penalized for transcribing errors. Siri tended to cut off the user before the entire question was posed, resulting in incomplete prompts being captured on the screen. Examples included “Can depression...” and “What is the difference between...”, when the entire questions that were meant to be asked were “Can depression be genetic?” and “What is the difference between normal behavior and OCD?”. respectively.
In regard to relevance, Siri and Alexa performed more poorly than Google Assistant and Bixby due to the irrelevant responses provided. For example, Siri responded with answers about medications when the question posed was “How are anxiety disorders diagnosed?” Similarly, Alexa responded with the effects of bipolar disorder to the question of “Who does bipolar disorder affect?” When the DVAs were evaluated for their ability to refer cases that required immediate intervention, only Google Assistant managed to respond appropriately to one evaluator. Interestingly, our observations differed from a study by Kocaballi and colleagues [53], who reported that Siri scored the highest for safety-critical prompts when compared to Google Assistant, Bixby and Alexa. In another study by Miner et al. [17], even though Google Now and Samsung S Voice (predecessor of Bixby) [54] managed to recognize queries on suicide as a cause for concern, Google Now did not recognize the cause for concern for queries on depression, while the responses from S Voice varied, with the cause of concern being recognized only in some instances. Nonetheless, the authors of both studies agreed that there was an inconsistency in the responses of the DVAs and that their abilities to recognize causes for concern should improve. It is unclear whether the inability of DVAs to respond to queries appropriately is due to system failure, a failure of the natural language understanding, a misrecognized prompt, the DVA being unable to find a response or the DVA deliberately not responding to particular types of queries [53]. However, we agree with Kocaballi and colleagues and advocate that the DVAs' capabilities should be made more transparent to users so that it can improve user experience and reduce confusion.
For comprehensiveness, Alexa performed the worst among the DVAs. It also scored significantly lower than Bixby and Google Assistant in terms of accuracy. In contrast, Alexa performed well in terms of comprehension ability, suggesting that, even though it could comprehend the questions being posed, it did not provide comprehensive and accurate responses. Our findings were consistent with a study by Alagha and Helbing, who evaluated the quality of responses to questions on vaccines by Google Assistant, Siri and Alexa [11]. In their study, the authors indicated that Alexa lacked in its ability to process health queries and generate responses from high-quality sources. Furthermore, in our study, Alexa performed significantly poorer than the other DVAs in terms of reliability. One reason was its tendency to only provide verbal responses, such as “Here's something I found on Mayo Clinic”, while the other DVAs provided specific links to webpages. In addition, Alexa provided invalid links to “reference.com”, which could not be accessed on several occasions. Our observations were also in line with the DVA vaccine information study by Alagha and Helbing [11], who reported that Google Assistant and Siri were more capable of directing the user to authoritative sources than Alexa, which did not provide answers from the same sources as the other DVAs. Hence, our recommendation is to supplement Alexa's responses to mental health queries with those of another DVA or other external resources so that any lack of or discrepancies in health-related information provided can be identified by the user.
There was a significant difference between the understandability of verbal responses versus web responses. Verbal responses were less easily understood, as according to the SMOG readability test, and contained more jargon than web responses. However, both types of responses also scored poorly, indicating that the responses of the DVAs to mental health queries are less likely to be understood by a layperson. Our results concurred with a study assessing the readability of online health information, which showed that, among 12 health conditions, the information on dementia and anxiety were the hardest to read [55]. As the understandability of health-related information is important to raise one's awareness and knowledge of mental health issues and self-care, we advocate that the information provided by DVAs should be complemented with other information online and shared between the patient and caregiver (or someone whom the patient trusts) in a close and private setting that is comfortable for the patient.
Across the mental health conditions, Siri, Bixby and Google Assistant scored the highest for questions on depression. Our results were similar to the study by Miner et al., which investigated the responses of Siri, Google Now, S Voice and Cortana to questions on depression [17]. In their study, the DVAs were generally able to recognize prompts, but they were not able to refer the user to a depression helpline. On the contrary, a study by Kocaballi et al. showed that DVAs had the lowest ratio of appropriate responses to mental health prompts, including those of depression [53]. Even though there have been studies investigating the quality of conversational agents on mental health conditions [56],[57], these studies focused on other types of conversational agents, such as chatbots and mobile apps, instead of DVAs. To the best of our knowledge, there is a paucity of studies that explore the quality of DVAs in relation to mental health conditions, especially OCD and bipolar disorder. While Google Assistant seems to be one of the top two DVAs that can potentially be recommended for queries on OCD and bipolar disorder (Figure 2), its ability to answer questions on these two conditions may not be as well established as that for general mental health and depression queries. Interestingly, Siri did not perform as well on either of these mental health conditions. As such, we recommend Apple users who seek information about OCD and/or bipolar disorder from Siri to supplement their responses with other online resources from Google Assistant or Google searches. In any case, our study presents new insight into the quality of DVAs across the span of these four mental health conditions—depression, anxiety, OCD and bipolar disorder.
The main limitation of this study is that we were only able to evaluate a subset of four DVAs and four mental health conditions. Therefore, our results might not be representative of the DVAs' performances for other mental health conditions, nor of the quality of other DVAs (e.g., Google Home Mini and Microsoft Cortana). Furthermore, as the location function of the DVAs were switched on during our evaluations, the search results might have been adapted to the local context, and minor variations could exist depending on the country and location of the user. Studies have shown that the responses of DVAs provided to the same questions can differ [17],[58]. Although the qualitative responses of the DVAs were not compared in this study, we tried to minimize this variability by having each evaluator use the same devices for their evaluations. In order to account for the variations in evaluation scores of the same DVA response by the different evaluators, we calculated the ICC values for each quality domain (Table 3) to determine the inter-rater reliability; our results indicated moderate-to-good reliability. Similarly, inter-rater reliability for the overall quality scores of the DVAs was good. Nonetheless, we acknowledge that this bias may exist in the DVA responses, and our study results should be interpreted with this limitation in mind. In addition, our evaluation protocol might not be reflective of real-life usage of DVAs by the layperson. In our study, when the question posed to the DVAs was not recognized on the first attempt, there would be two more attempts made before the evaluation ended. However, in real-life, users might forgo repeatedly asking the same question multiple times if they encountered an unsuccessful response on their first try. Next, due to time limitations, only the first web link provided by the DVAs was evaluated in this study, but, in reality, users might access other links as well if more than one link was provided by the DVAs. Lastly, our results only provide the quality of the DVAs in a snapshot of time. With advancements in voice recognition technologies, natural language processing and other AI-based algorithms, we expect that the quality of the DVAs will also improve over time. As such, we advise caution when extrapolating the results of this study to other DVAs, other countries/states, other mental health conditions or over time.
Overall, Google Assistant performed the best in terms of responding to mental health-related queries, while Alexa performed the worst. In terms of specific mental health conditions, Bixby performed the worst for questions on general mental health and OCD. While the comprehension abilities of the DVAs were generally good, our study showed that the DVAs had differing performances in the domains of relevance, comprehensiveness, accuracy and reliability. Moreover, the responses of the DVAs generally lacked in understandability. Based on our quality evaluations, we have provided a DVA recommendation list that users can potentially consider for the different mental health conditions (Figure 2). While Google Assistant generally works well across all of the included mental health conditions, Siri and Bixby can also be used for depression and anxiety. On the other hand, Alexa and Bixby may potentially be used for OCD and bipolar disorder, respectively. However, when depending on the DVA responses to their mental health-related queries, we caution the general public to supplement the information provided by the DVAs with other online information from authoritative healthcare organizations, and to always seek the help and advice of a healthcare professional when managing their mental health condition(s). In light of many organizations adapting to the post-pandemic world, future research should focus on other types of mental health conditions (e.g., stress) in patients, caregivers and healthcare professionals resulting from specific circumstances, such as workplace disruptions, loss of healthcare services and the accumulation of new job roles as healthcare undergoes a major digital transformation worldwide. In addition, further research can also be done to evaluate other types of DVAs' performance for mental health conditions that are relevant to the researchers' communities.
[1] | Aoyagi J, Hattori T (2019) The empirical analysis of bitcoin market in the general equilibrium framework. http://dx.doi.org/10.2139/ssrn.3433833 |
[2] |
Bedford Taylor M (2017) The evolution of Bitcoin hardware. Computer 50: 58-66. http://dx.doi.org/10.1109/MC.2017.3571056 doi: 10.1109/MC.2017.3571056
![]() |
[3] | Bianchetti M, Ricci C, Scaringi M (2018) Are cryptocurrencies real financial bubbles? evidence from quantitative analyses. http://dx.doi.org/10.2139/ssrn.3092427 |
[4] |
Bierens HJ, Martins LF (2010) Time-varying cointegration. Econ Theory 26: 1453–1490. https://doi.org/10.1017/S0266466609990648 doi: 10.1017/S0266466609990648
![]() |
[5] |
Bouri E, Gupta R, Roubaud D (2019) Herding behaviour in cryptocurrencies. Financ Res Lett 29: 216-221. https://doi.org/10.1016/j.frl.2018.07.008 doi: 10.1016/j.frl.2018.07.008
![]() |
[6] |
Caferra R, Tedeschi G, Morone A (2021) Bitcoin: Bubble that bursts or gold that glitters?. Econ Lett 205: 109942. https://doi.org/10.1016/j.econlet.2021.109942 doi: 10.1016/j.econlet.2021.109942
![]() |
[7] | Caporale GM, Timur Z (2019) Modelling volatility of cryptocurrencies using markov-switching GARCH models. Res Int Bus Financ 48: 143. |
[8] |
Chaim P, Laurini MP (2019) Is Bitcoin a bubble? Physica A 517: 222-232. https://doi.org/10.1016/j.physa.2018.11.031 doi: 10.1016/j.physa.2018.11.031
![]() |
[9] |
Cheah ET, Fry J (2015) Speculative bubbles in bitcoin markets? an empirical investigation into the fundamental value of bitcoin. Econ Lett 130: 32-36. https://doi.org/10.1016/j.econlet.2015.02.029 doi: 10.1016/j.econlet.2015.02.029
![]() |
[10] | Cheung AWK, Roca E, Su JJ (2015) Crypto-currency bubbles: An application of the Phillips–Shi–Yu (2013) methodology on Mt. Gox bitcoin prices. Appl Econ 47: 2348-2358. https://doi.org/10.1080/00036846.2015.1005827 |
[11] | Chu J, Chan S, Nadarajah S, et al. (2017) GARCH modelling of cryptocurrencies. J Risk Financial Manag 10. https://www.mdpi.com/1911-8074/10/4/17 |
[12] |
De Vries A (2018) Bitcoin's growing energy problem. Joule 2: 801-805. https://doi.org/10.3390/jrfm10040017 doi: 10.3390/jrfm10040017
![]() |
[13] |
Delgado-Mohatar O, Felis-Rota M, Fernández-Herraiz C (2019) The Bitcoin mining breakdown: Is mining still profitable? Econ Lett 184: 108492. https://doi.org/10.1016/j.econlet.2019.05.044 doi: 10.1016/j.econlet.2019.05.044
![]() |
[14] |
Derks J, Gordijn J, Siegmann A (2018) From chaining blocks to breaking even: A study on the profitability of bitcoin mining from 2012 to 2016. Electron Mark 28: 321-338.https://doi.org/10.1007/s12525-018-0308-3 doi: 10.1007/s12525-018-0308-3
![]() |
[15] |
Dodd N (2018) The social life of Bitcoin. Theory Cult Soc 35: 35-56. https://doi.org/10.1177/0263276417746464 doi: 10.1177/0263276417746464
![]() |
[16] | Fantazzini D, Kolodin N (2020) Does the hashrate affect the Bitcoin price? J Risk Financial Manag 13. https://doi.org/10.3390/jrfm13110263 |
[17] |
Garcia D, Tessone CJ, Mavrodiev P, et al. (2014) The digital traces of bubbles: Feedback cycles between socio-economic signals in the bitcoin economy. J R Soc Interface 11: 20140623. http://doi.org/10.1098/rsif.2014.0623 doi: 10.1098/rsif.2014.0623
![]() |
[18] | Gyamerah SA (2019) Modelling the volatility of bitcoin returns using GARCH models. Quant Financ Econ 3: 739-753. https://www.aimspress.com/article/doi/10.3934/QFE.2019.4.739 |
[19] |
Hafner CM (2018) Testing for Bubbles in Cryptocurrencies with Time-Varying Volatility. J Financ Econ 18: 233-249. https://doi.org/10.1093/jjfinec/nby023 doi: 10.1093/jjfinec/nby023
![]() |
[20] |
Hayes AS (2017) Cryptocurrency value formation: An empirical study leading to a cost of production model for valuing bitcoin. Telemat Inform 34: 1308-1321. https://doi.org/10.1016/j.tele.2016.05.005 doi: 10.1016/j.tele.2016.05.005
![]() |
[21] | Hayes AS (2019) Bitcoin price and its marginal cost of production: Support for a fundamental value Appl Econ Lett 26: 554-560. https://doi.org/10.1080/13504851.2018.1488040 |
[22] |
Johansen S (1991) Estimation and hypothesis testing of cointegration vectors in gaussian vector autoregressive models. Econometrica 59: 1551-1580. https://doi.org/10.2307/2938278 doi: 10.2307/2938278
![]() |
[23] | Kaiser B, Jurado M, Ledger A (2018) The looming threat of China: An analysis of Chinese influence on bitcoin. arXiv. https://doi.org/10.48550/arXiv.1810.02466 |
[24] |
Katsiampa P (2017) Volatility estimation for bitcoin: A comparison of GARCH models. Econ Lett 158: 3 - 6. https://doi.org/10.1016/j.econlet.2017.06.023 doi: 10.1016/j.econlet.2017.06.023
![]() |
[25] | Khatun MN, Mitra S, Sarker MNI (2021) Mobile banking during covid-19 pandemic in Bangladesh: A novel mechanism to change and accelerate people's financial access. Green Financ 3: 253-267. https://www.aimspress.com/article/doi/10.3934/GF.2021013 |
[26] | Kjærland F, Khazal A, Krogstad EA, et al. (2018) An analysis of bitcoin's price dynamics. J Risk Financial Manage 11. https://doi.org/10.3390/jrfm11040063 |
[27] |
Kristoufek L (2019) Is the bitcoin price dynamics economically reasonable? Evidence from fundamental laws. Physica A 536: 120873. https://doi.org/10.1016/j.physa.2019.04.109 doi: 10.1016/j.physa.2019.04.109
![]() |
[28] |
Kristoufek L (2020) Bitcoin and its mining on the equilibrium path. Energy Econ 85: 104588. https://doi.org/10.1016/j.eneco.2019.104588 doi: 10.1016/j.eneco.2019.104588
![]() |
[29] |
Kyriazis N, Papadamou S, Corbet S (2020) A systematic review of the bubble dynamics of cryptocurrency prices. Res Int Bus Finance 54: 101254. https://doi.org/10.1016/j.ribaf.2020.101254 doi: 10.1016/j.ribaf.2020.101254
![]() |
[30] |
Lee Y, Rhee JH (2022) A VECM analysis of bitcoin price using time-varying cointegration approach. J Deriv Quant Stud 30: 197-218. https://doi.org/10.1108/JDQS-01-2022-0001 doi: 10.1108/JDQS-01-2022-0001
![]() |
[31] |
Li X, Wang CA (2017) The technology and economic determinants of cryptocurrency exchange rates: The case of bitcoin. Decis Support Syst 95: 49-60. https://doi.org/10.1016/j.dss.2016.12.001 doi: 10.1016/j.dss.2016.12.001
![]() |
[32] |
Magtanggol III De Guzman RA, So MKP (2018) Empirical analysis of bitcoin prices using threshold time series models. Ann Financ Econ 134: 1-24. https://doi.org/10.1142/S2010495218500173 doi: 10.1142/S2010495218500173
![]() |
[33] | Makarov I, Schoar A (2021) Blockchain analysis of the bitcoin market. https://doi.org/10.3386/w29396 |
[34] |
Marthinsen JE, Gordon SR (2022) The price and cost of bitcoin. Quart Rev Econ Financ 85: 280-288. https://doi.org/10.1016/j.qref.2022.04.003 doi: 10.1016/j.qref.2022.04.003
![]() |
[35] |
Khatun MN, Mitra S, Sarker MNI (2022) Mobile banking during COVID-19 pandemic in Bangladesh: A novel mechanism to change and accelerate people's financial access. Green Financ 3: 253-267. https://doi.org/10.3934/GF.2021013 doi: 10.3934/GF.2021013
![]() |
[36] | Nakamoto S (2008) Bitcoin: A peer-to-peer electronic cash system. Decentralized Business Review. |
[37] | O'Dwyer KJ, Malone D (2014) Bitcoin mining and its energy footprint. In: 25th IET Irish Signals & Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communications Technologies (ISSC 2014/CIICT 2014), 280-285. https://doi.org/10.1049/cp.2014.0699 |
[38] | Park JY, Hahn SB (1999) Cointegrating regressions with time varying coefficients. Econ Theory 15: 664-703. |
[39] | Shiller RJ (2000) Irrational exuberance. Princeton University Press. |
[40] |
Siu TK, Elliott RJ (2021) Bitcoin option pricing with a SETAR-GARCH model. Eur J Finance 27: 564-595. https://doi.org/10.1080/1351847X.2020.1828962 doi: 10.1080/1351847X.2020.1828962
![]() |
[41] |
Song YD, Aste T (2020) The cost of Bitcoin mining has never really increased. Front Blockchain 3: 565497. https://doi.org/10.3389/fbloc.2020.565497 doi: 10.3389/fbloc.2020.565497
![]() |
[42] | Stoll C, Klaaßen L, Gallersdörfer U (2019) The carbon footprint of Bitcoin', Joule 3: 1647-1661.https://doi.org/10.1016/j.joule.2019.05.012 |
[43] |
Vranken H (2017) Sustainability of Bitcoin and blockchains. Curr Opin Env Sust 28: 1-9. https://doi.org/10.1016/j.cosust.2017.04.011 doi: 10.1016/j.cosust.2017.04.011
![]() |
[44] |
Xiong J, Liu Q, Zhao L (2020) A new method to verify bitcoin bubbles: Based on the production cost. North Am J Econ Finance 51: 101095. https://doi.org/10.1016/j.najef.2019.101095 doi: 10.1016/j.najef.2019.101095
![]() |
[45] | Youssef M (2022) What drives herding behavior in the cryptocurrency market? J Behav Financ 23: 230-239. https://doi.org/10.1080/15427560.2020.1867142 |
[46] | Zadé M, Myklebost J (2018) Bitcoin and Ethereum mining hardware. https://doi.org/10.17632/4dw6j3pxz5.1 |
[47] |
Zadé M, Myklebost J, Tzscheutschler P, et al. (2019) Is Bitcoin the only problem? A scenario model for the power demand of blockchains. Front Energy Res 7: 21. https://doi.org/10.3389/fenrg.2019.00021 doi: 10.3389/fenrg.2019.00021
![]() |
[48] |
Zhou S (2021) Exploring the driving forces of the bitcoin currency exchange rate dynamics: an EGARCH approach. Empir Econ 60: 557-606. https://doi.org/10.3389/fenrg.2019.00021 doi: 10.3389/fenrg.2019.00021
![]() |
![]() |
![]() |
Number of responses (%), N = 66 a |
||||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | |
Types of responses by DVAs | ||||
Verbal response only b | 6 (9.1) | 24 (36.4) | 1 (1.5) | 41 (62.1) |
Web response only c | 48 (72.7) | 14 (21.2) | 13 (19.7) | 0 (0) |
Verbal and web response d | 11 (16.7) | 28 (42.4) | 32 (78.8) | 24 (36.4) |
No response | 1 (1.5) | 0 (0) | 0 (0) | 1 (1.5) |
Proportion of successful responses | ||||
Questions that were recognized e | 63 (95.5) | 46 (69.7) | 66 (100.0) | 60 (90.9) |
Relevant responses | 47 (71.2) | 38 (57.6) | 66 (100.0) | 44 (66.7) |
Proportion of sources provided in DVA responses | ||||
Tier A | 13 (19.7) | 19 (28.8) | 36 (54.5) | 20 (30.3) |
Tier B | 19 (28.8) | 18 (27.3) | 18 (27.3) | 8 (12.1) |
Tier C | 15 (22.7) | 1 (1.5) | 6 (9.1) | 15 (22.7) |
No sources provided, or sources that could not be evaluated | 19 (28.8) | 28 (42.4) | 6 (9.1) | 23 (34.8) |
Note: a Results were taken from the average of three evaluators. b A short verbal text that directly answered the question without providing a link. c A link was provided in response to the question without a verbal explanation. d Both a verbal explanation and a link were present in the response. e These were questions that were captured on the smartphone screen and induced a response by the DVA. Responses such as “I'm not sure I understood that” were classified as the DVA not recognizing the question.
Classification of Questions | Median Quality Scores of DVAs [% (IQR)] |
p-values* | |||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | ||
Across all questions | 70.4 (60.9–79.3) | 72.8 (0–81.6) | 78.9 (73.9–85.2) | 64.5 (57.7–76.7) | <0.001 |
Mental Health Categories | |||||
General mental health | 77.1 (71.3–85.2) | 0 (0–16.7) | 80.5 (76.4–89.1) | 70.7 (57.1–79.7) | <0.001 |
Depression | 83.9 (76.0–86.9) | 87.7 (83.6–89.3) | 87.4 (79.4–88.7) | 63.0 (61.4–72.1) | <0.001 |
Anxiety | 71.8 (67.2–87.6) | 75.9 (71.3–80.7) | 76.4 (69.8–83.6) | 60.5 (42.5–68.4) | 0.006 |
Obsessive-compulsive disorder | 61.7 (53.7–69.8) | 0 (0–29.6) | 78.4 (73.6–85.7) | 72.3 (59.1–80.0) | <0.001 |
Bipolar disorder | 66.4 (44.4–70.4) | 77.5 (71.3–81.6) | 75.9 (70.1–81.6) | 63.0 (48.2–80.5) | 0.004 |
Question Subcategories | |||||
Disease state | 71.7 (66.9–79.0) | 71.6 (28.2–83.3) | 80.5 (73.8–84.8) | 69.6 (62.5–80.2) | 0.031 |
Symptoms | 66.7 (53.9–83.1) | 76.7 (25.0–80.9) | 77.5 (70.7–86.0) | 71.5 (57.7–80.4) | 0.239 |
Treatment | 60.5 (49.4–74.2) | 78.3 (63.0–85.0) | 77.3 (69.6–84.3) | 57.3 (30.6–62.1) | <0.001 |
Note: *Kruskal-Wallis test was performed among all the four DVAs with statistical significance defined as p < 0.05. Post-hoc analyses using the Wilcoxon rank sum test with Bonferroni adjustment were performed for each possible pairwise comparison among the DVAs, with statistical significance defined as p < 0.00833.
Quality Domains | Median Quality Scores of DVAs [% (IQR)] |
p-value* | Intraclass Correlation Coefficient [ICC (95% CI)] a | |||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | |||
Comprehen-sion ability | 88.9 (70.8–100) | 94.5 (0–100) | 100 (100–100) | 100 (88.9–100) | <0.001 | 0.892 (0.868–0.913) |
Relevance | 66.7 (50.0–100) | 100 (66.7–100) | 100 (83.3–100) | 75.0 (33.3–100) | <0.001 | 0.753 (0.691–0.804) |
Comprehen-siveness | 66.7 (44.4–83.3) | 66.7 (55.6–88.9) | 77.8 (55.6–88.9) | 22.2 (0–66.7) | <0.001 | 0.747 (0.660–0.812) |
Accuracy | 100 (75.0–100) | 100 (83.3–100) | 100 (83.3–100) | 75.0 (50.0–100) | 0.003 | 0.691 (0.593–0.769) |
Understand-ability | 50.0 (25.0–75.0) | 50.0 (33.3–68.8) | 50.0 (33.3–66.7) | 50.0 (25.0–75.0) | 0.724 | 0.672 (0.513–0.775) |
Reliability | 72.9 (63.2–83.3) | 75.0 (63.9–84.3) | 75.0 (66.7–84.3) | 58.3 (49.1–63.9) | <0.001 | 0.896 (0.863–0.922) |
Overall quality | 70.4 (60.9–79.3) | 72.8 (0–81.6) | 78.9 (73.9–85.2) | 64.5 (57.7–76.7) | <0.001 | 0.848 (0.813–0.877) |
Note: * Kruskal-Wallis test was performed among all four DVAs with statistical significance defined as p < 0.05. Post-hoc analyses using the Wilcoxon rank sum test with Bonferroni adjustment were performed for each possible pairwise comparison among the DVAs, with statistical significance defined as p < 0.00833. a ICC values and their 95% CIs were calculated using the SPSS platform based on the mean rating of three evaluators, absolute agreement and a two-way mixed-effects model. ICC values indicate moderate-to-good inter-rater reliability.
Number of responses (%), N = 66 a |
||||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | |
Types of responses by DVAs | ||||
Verbal response only b | 6 (9.1) | 24 (36.4) | 1 (1.5) | 41 (62.1) |
Web response only c | 48 (72.7) | 14 (21.2) | 13 (19.7) | 0 (0) |
Verbal and web response d | 11 (16.7) | 28 (42.4) | 32 (78.8) | 24 (36.4) |
No response | 1 (1.5) | 0 (0) | 0 (0) | 1 (1.5) |
Proportion of successful responses | ||||
Questions that were recognized e | 63 (95.5) | 46 (69.7) | 66 (100.0) | 60 (90.9) |
Relevant responses | 47 (71.2) | 38 (57.6) | 66 (100.0) | 44 (66.7) |
Proportion of sources provided in DVA responses | ||||
Tier A | 13 (19.7) | 19 (28.8) | 36 (54.5) | 20 (30.3) |
Tier B | 19 (28.8) | 18 (27.3) | 18 (27.3) | 8 (12.1) |
Tier C | 15 (22.7) | 1 (1.5) | 6 (9.1) | 15 (22.7) |
No sources provided, or sources that could not be evaluated | 19 (28.8) | 28 (42.4) | 6 (9.1) | 23 (34.8) |
Classification of Questions | Median Quality Scores of DVAs [% (IQR)] |
p-values* | |||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | ||
Across all questions | 70.4 (60.9–79.3) | 72.8 (0–81.6) | 78.9 (73.9–85.2) | 64.5 (57.7–76.7) | <0.001 |
Mental Health Categories | |||||
General mental health | 77.1 (71.3–85.2) | 0 (0–16.7) | 80.5 (76.4–89.1) | 70.7 (57.1–79.7) | <0.001 |
Depression | 83.9 (76.0–86.9) | 87.7 (83.6–89.3) | 87.4 (79.4–88.7) | 63.0 (61.4–72.1) | <0.001 |
Anxiety | 71.8 (67.2–87.6) | 75.9 (71.3–80.7) | 76.4 (69.8–83.6) | 60.5 (42.5–68.4) | 0.006 |
Obsessive-compulsive disorder | 61.7 (53.7–69.8) | 0 (0–29.6) | 78.4 (73.6–85.7) | 72.3 (59.1–80.0) | <0.001 |
Bipolar disorder | 66.4 (44.4–70.4) | 77.5 (71.3–81.6) | 75.9 (70.1–81.6) | 63.0 (48.2–80.5) | 0.004 |
Question Subcategories | |||||
Disease state | 71.7 (66.9–79.0) | 71.6 (28.2–83.3) | 80.5 (73.8–84.8) | 69.6 (62.5–80.2) | 0.031 |
Symptoms | 66.7 (53.9–83.1) | 76.7 (25.0–80.9) | 77.5 (70.7–86.0) | 71.5 (57.7–80.4) | 0.239 |
Treatment | 60.5 (49.4–74.2) | 78.3 (63.0–85.0) | 77.3 (69.6–84.3) | 57.3 (30.6–62.1) | <0.001 |
Quality Domains | Median Quality Scores of DVAs [% (IQR)] |
p-value* | Intraclass Correlation Coefficient [ICC (95% CI)] a | |||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | |||
Comprehen-sion ability | 88.9 (70.8–100) | 94.5 (0–100) | 100 (100–100) | 100 (88.9–100) | <0.001 | 0.892 (0.868–0.913) |
Relevance | 66.7 (50.0–100) | 100 (66.7–100) | 100 (83.3–100) | 75.0 (33.3–100) | <0.001 | 0.753 (0.691–0.804) |
Comprehen-siveness | 66.7 (44.4–83.3) | 66.7 (55.6–88.9) | 77.8 (55.6–88.9) | 22.2 (0–66.7) | <0.001 | 0.747 (0.660–0.812) |
Accuracy | 100 (75.0–100) | 100 (83.3–100) | 100 (83.3–100) | 75.0 (50.0–100) | 0.003 | 0.691 (0.593–0.769) |
Understand-ability | 50.0 (25.0–75.0) | 50.0 (33.3–68.8) | 50.0 (33.3–66.7) | 50.0 (25.0–75.0) | 0.724 | 0.672 (0.513–0.775) |
Reliability | 72.9 (63.2–83.3) | 75.0 (63.9–84.3) | 75.0 (66.7–84.3) | 58.3 (49.1–63.9) | <0.001 | 0.896 (0.863–0.922) |
Overall quality | 70.4 (60.9–79.3) | 72.8 (0–81.6) | 78.9 (73.9–85.2) | 64.5 (57.7–76.7) | <0.001 | 0.848 (0.813–0.877) |