
In this work we explore a recently proposed biphasic cell-fluid chemotaxis-Stokes model which is able to represent two competing cancer cell migration mechanisms reported from experimental studies. Both mechanisms depend on the fluid flow but in a completely different way. One mechanism depends on chemical signaling and leads to migration in the downstream direction. The other depends on mechnical signaling and triggers cancer cells to go upstream. The primary objective of this paper is to explore an alternative numerical discretization of this model by borrowing ideas from [Qiao et al. (2020), M3AS 30]. Numerical investigations give insight into which parameters that are critical for the ability to generate aggressive cancer cell behavior in terms of detachment of cancer cells from the primary tumor and creation of isolated groups of cancer cells close to the lymphatic vessels. The secondary objective is to propose a reduced model by exploiting the fact that the fluid velocity field is largely dictated by the draining fluid from the leaky tumor vasculature and collecting peritumoral lymphatics and is more weakly coupled to the cell phase. This suggests that the fluid flow equations to a certain extent might be decoupled from the cell phase equations. The resulting model, which represents a counterpart of the much studied chemotaxis-Stokes model model proposed by [Tuval, et al. (2005), PNAS 102], is explored by numerical experiments in a one-dimensional tumor setting. We find that the model largely coincides with the original as assessed through numerical solutions computed by discrete schemes. This model might be more amenable for further explorations and analysis. We also investigate how to exploit the weaker coupling between cell phase dynamics and fluid dynamics to do more efficient calculations with fewer updates of the fluid pressure and velocity field.
Citation: Yangyang Qiao, Qing Li, Steinar Evje. On the numerical discretization of a tumor progression model driven by competing migration mechanisms[J]. Mathematics in Engineering, 2022, 4(6): 1-24. doi: 10.3934/mine.2022046
[1] | Hongyong Zhao, Qianjin Zhang, Linhe Zhu . The spatial dynamics of a zebrafish model with cross-diffusions. Mathematical Biosciences and Engineering, 2017, 14(4): 1035-1054. doi: 10.3934/mbe.2017054 |
[2] | Xiaomei Bao, Canrong Tian . Turing patterns in a networked vegetation model. Mathematical Biosciences and Engineering, 2024, 21(11): 7601-7620. doi: 10.3934/mbe.2024334 |
[3] | Nazanin Zaker, Christina A. Cobbold, Frithjof Lutscher . The effect of landscape fragmentation on Turing-pattern formation. Mathematical Biosciences and Engineering, 2022, 19(3): 2506-2537. doi: 10.3934/mbe.2022116 |
[4] | Jichun Li, Gaihui Guo, Hailong Yuan . Nonlocal delay gives rise to vegetation patterns in a vegetation-sand model. Mathematical Biosciences and Engineering, 2024, 21(3): 4521-4553. doi: 10.3934/mbe.2024200 |
[5] | Tingting Ma, Xinzhu Meng . Global analysis and Hopf-bifurcation in a cross-diffusion prey-predator system with fear effect and predator cannibalism. Mathematical Biosciences and Engineering, 2022, 19(6): 6040-6071. doi: 10.3934/mbe.2022282 |
[6] | Mingzhu Qu, Chunrui Zhang, Xingjian Wang . Analysis of dynamic properties on forest restoration-population pressure model. Mathematical Biosciences and Engineering, 2020, 17(4): 3567-3581. doi: 10.3934/mbe.2020201 |
[7] | Maya Mincheva, Gheorghe Craciun . Graph-theoretic conditions for zero-eigenvalue Turing instability in general chemical reaction networks. Mathematical Biosciences and Engineering, 2013, 10(4): 1207-1226. doi: 10.3934/mbe.2013.10.1207 |
[8] | Swadesh Pal, Malay Banerjee, Vitaly Volpert . Spatio-temporal Bazykin’s model with space-time nonlocality. Mathematical Biosciences and Engineering, 2020, 17(5): 4801-4824. doi: 10.3934/mbe.2020262 |
[9] | Yue Xing, Weihua Jiang, Xun Cao . Multi-stable and spatiotemporal staggered patterns in a predator-prey model with predator-taxis and delay. Mathematical Biosciences and Engineering, 2023, 20(10): 18413-18444. doi: 10.3934/mbe.2023818 |
[10] | Yongli Cai, Malay Banerjee, Yun Kang, Weiming Wang . Spatiotemporal complexity in a predator--prey model with weak Allee effects. Mathematical Biosciences and Engineering, 2014, 11(6): 1247-1274. doi: 10.3934/mbe.2014.11.1247 |
In this work we explore a recently proposed biphasic cell-fluid chemotaxis-Stokes model which is able to represent two competing cancer cell migration mechanisms reported from experimental studies. Both mechanisms depend on the fluid flow but in a completely different way. One mechanism depends on chemical signaling and leads to migration in the downstream direction. The other depends on mechnical signaling and triggers cancer cells to go upstream. The primary objective of this paper is to explore an alternative numerical discretization of this model by borrowing ideas from [Qiao et al. (2020), M3AS 30]. Numerical investigations give insight into which parameters that are critical for the ability to generate aggressive cancer cell behavior in terms of detachment of cancer cells from the primary tumor and creation of isolated groups of cancer cells close to the lymphatic vessels. The secondary objective is to propose a reduced model by exploiting the fact that the fluid velocity field is largely dictated by the draining fluid from the leaky tumor vasculature and collecting peritumoral lymphatics and is more weakly coupled to the cell phase. This suggests that the fluid flow equations to a certain extent might be decoupled from the cell phase equations. The resulting model, which represents a counterpart of the much studied chemotaxis-Stokes model model proposed by [Tuval, et al. (2005), PNAS 102], is explored by numerical experiments in a one-dimensional tumor setting. We find that the model largely coincides with the original as assessed through numerical solutions computed by discrete schemes. This model might be more amenable for further explorations and analysis. We also investigate how to exploit the weaker coupling between cell phase dynamics and fluid dynamics to do more efficient calculations with fewer updates of the fluid pressure and velocity field.
Globally, the number of persons suffering from mental health disorders is on the rise [1]. In 2015, an estimated 322 million people were living with depression worldwide [1]. With the recent COVID-19 pandemic, mental well-being was further challenged with fears of contracting an infection [2] and feelings of isolation [3]. Mental health conditions have been associated with stigma in society, causing an individual to perceive oneself as unacceptable [4],[5]. The impact of stigma often results in a reduced likelihood of seeking treatment [4],[6],[7]. In 2018, a USA survey reported that people suffering from depression were increasingly turning to the Internet for mental health-related support [8]. Among them, 90% had researched mental health information online, while 75% had accessed others' health stories through blogs, podcasts and videos [8]. Thus, it is not uncommon that many tend to opt for online support environments, including support groups and social media channels [5],[8].
In recent years, digital voice assistants (DVAs) have been increasingly adopted as digital health tools with the purpose of providing information regarding health-related queries for various health conditions, including minor ailments [9], postpartum depression [10], vaccinations [11],[12], cancer screening [13] and smoking cessation advice [14]. Smartphone-based DVAs, such as Apple Siri and Google Assistant, have been particularly popular [15]. According to Google, 27% of Internet searches in 2018 came from using the voice search feature on smartphones [16], with this trend posited to grow. The artificial intelligence (AI) component in DVAs enables voice recognition and responses in natural language [10],[17], thereby enabling these DVAs to participate in two-way conversations with users [18]. Given the growing popularity of using DVAs to search for online health information [8], it is crucial that DVAs are able to provide relevant, appropriate and easy-to understand responses to queries by users in relation to mental health literacy, such as symptom recognition, information sources, awareness of causes and risks and an understanding of treatment types [19],[20]. While there are quality assessment tools that evaluate the quality of online health information, such as the Health-on-the-Net Code (HONcode) [21], DISCERN [22] and Quality Evaluation Scoring Tool (QUEST) [23], from our knowledge, there are no existing ones for the purpose of assessing DVAs. On the other hand, studies that have evaluated the quality of information provided by DVAs [9]–[14] have not focused on mental health conditions.
As we move into a post-pandemic world, it is crucial that public mental health should not be ignored [24]. There is a need to evaluate the quality of information provided by DVAs in the mental health domain. Studies have suggested that providing useful and comprehensive online information about mental health conditions in a user-friendly way can help consumers gain a better understanding of the disease, which in turn can help prevent and/or reduce the severity of the mental health disorder [25]. Furthermore, providing high-quality information online on mental health conditions can potentially reduce the stigma and prejudice attached to these disorders [25]. With the increased popularity of consumers performing health information searches through DVAs, it is crucial that DVAs are able to provide high-quality information on mental health conditions through their responses. Our hypothesis is that DVAs are able to provide responses that are relevant, appropriate and easy-to understand in relation to mental health queries. Thus, the primary objective of this study was to evaluate the quality of DVA responses to mental health-related queries by using an in-house-developed quality assessment rubric. In this study, DVAs are defined as inanimate programs enhanced with AI that interact with human users using speech commands. These are different from other technologies such as chatbots [26] or automated telephone-response systems [27],[28].
In this study, the quality of DVAs was defined as the degree of excellence to which a DVA could fulfill the needs of mental health-related queries [29]. This definition was represented by six quality domains: comprehension ability, relevance, comprehensiveness, accuracy, understandability and reliability. The quality domains were adapted from tools evaluating the quality of online health information or sources. The relevance domain was adapted from the DISCERN [22] and CRAAP (currency, relevance, authority, accuracy and purpose) [30],[31] tools. The accuracy and reliability domains were adapted from DISCERN [22], CRAAP [30],[31] and HONcode [21]. In addition, the reliability domain was also adapted from the Ensuring Quality Information for Patients (EQIP) tool [32], LIDA Minervation validation instrument [33], QUEST [23] and Quality Component Scoring System [34]. The comprehensiveness domain was adapted from DISCERN and EQIP [22],[32], and understandability was adapted from EQIP and LIDA [32],[33].
The quality domains evaluated three aspects of DVA quality: the DVAs themselves (comprehension ability), the DVAs' responses (relevance, comprehensiveness, accuracy and understandability) and the answer sources provided by the DVAs (reliability) (Figure 1). The composite score for all domains added up to a maximum of 32 points. All DVA responses were classified into four types: verbal response only, web response only, verbal and web response and no response. “Verbal response only” referred to a short verbal text that directly answered the question without providing a link. Conversely, a “web response only” referred to a link without any verbal explanation provided. A “verbal and web response” consisted of both the aforementioned parts in a single response. If the DVA did not provide any responses, it would be classified as “no response”. Since understandability was evaluated for both the verbal and web responses, in cases where the DVA only provided one type of response, the composite score would be 30 points instead.
The DVA's comprehension ability was assessed based on its ability to accurately recognize and transcribe the question posed to it. Relevance of the DVA's responses was assessed based on whether the response had adequately addressed the question. For two questions, the DVAs were evaluated for their ability to successfully refer to a contact point in cases requiring immediate intervention. Comprehensiveness was assessed based on whether the DVA's response was complete and fulfilled all of the points in the answer sheet. In addition, two quality-of-life (QoL) criteria assessed whether the DVA described impacts of treatment or treatment choices on day-to-day living or activities, and whether it supported shared decision-making regarding treatment choices. Accuracy assessed whether each point in the DVA's response correctly matched the corresponding point in the answer sheet. Understandability was assessed based on whether a layman would easily understand the DVA response according to the Simple Measure of Gobbledygook (SMOG) readability test [35],[36], and whether it contained medical jargon/complex words. Lastly, the reliability of answer sources provided by the DVAs was evaluated based on six criteria: credibility of the sources and reference citations, how current/updated were the sources, presence/absence of bias and advertisements and whether there was a disclaimer stating that the information provided did not replace a healthcare professional's advice. All DVA responses were evaluated regardless of whether they were verbal or web responses.
A total of 66 questions on mental well-being and mental health conditions were compiled and categorized into five categories: general mental health, depression, anxiety, obsessive-compulsive disorder (OCD) and bipolar disorder. These conditions were chosen due to their rising prevalence in global and local data [1],[37]. Besides the section on general mental health, questions in the other sections on the specific mental health conditions were classified into three subcategories: disease state, symptoms and treatment (Appendix 1).
Questions and answers were sourced primarily from the American Psychiatric Association [38], National Institute of Mental Health [39], Medline Plus [40], World Health Organization [41], USA Centers for Disease Control and Prevention [42], Mayo Clinic [43], Cleveland Clinic [44], National Alliance on Mental Illness [45], Anxiety and Depression Association of America [46] and the International Obsessive-Compulsive Disorder Foundation [47]. In addition, questions were also sourced from AnswerThePublic [48] with the following keywords: “mental health”, “depression”, “anxiety”, “OCD” (obsessive-compulsive disorder) and “bipolar disorder”. Answers were also compiled from established clinical guidelines, including the Diagnostic and Statistical Manual of Mental Disorders, 5th edition (DSM-5) [49] and the Singapore Ministry of Health Clinical Practice Guidelines [50]. The questions and answers were reviewed by three reviewers (VC, WLL, KY). Any differences in opinions were resolved through discussions until consensus was reached. Two reviewers (JC and LL) pilot-tested half of the questions to ensure that the evaluation rubric could be applied across different questions. Their feedback was used to refine the rubric for the actual evaluation.
Four smartphone DVAs were employed for evaluation: Apple Siri, Samsung Bixby, Google Assistant and Amazon Alexa. Siri and Google Assistant were accessed by using an iPhone 6 (iOS14.7.1), while Bixby and Alexa were accessed by using a Samsung Galaxy Note 9 (OS10). All questions were posed to the DVAs in English by native English speakers—in the same order and in the exact way that the questions were phrased in Appendix 1. The evaluations and scoring were done independently on the same devices by three evaluators in a quiet room at their homes: VC (female), LSK (male) and AP (female). Each evaluator would ask all 66 questions to one DVA in one sitting. However, they would pose the questions to a different DVA in a separate sitting (i.e., four separate sessions). If the DVA was unable to capture the question and generate a response after three repeated attempts, the evaluation would end and no points would be awarded. Each evaluator completed the evaluation of all four DVAs within a week, after which, the devices were transferred to the next evaluator, who would then evaluate the DVAs on the same devices over the next consecutive week. As such, all evaluations were completed within 3 weeks. The search and internet histories for the individual DVAs were reset before and after each round of evaluation. The location function was turned on as the DVAs were evaluated for their ability to refer to a contact point. If the DVA provided more than one web link, the first web link was taken for evaluation.
Descriptive statistics (numbers and percentages) were employed to report the types of responses, proportion of successful responses and sources cited by the DVAs. The quality scores were calculated for each mental health category (general mental health, depression, anxiety, OCD, bipolar disorder) and question subcategory (disease state, symptoms, treatment), as well as for each quality domain (comprehension ability, relevance, comprehensiveness, accuracy, understandability, reliability, overall quality), by dividing the sum of points awarded for each DVA against the maximum possible number of points in each mental health category, question subcategory and quality domain (Equation 1). This calculation was also performed across all questions to generate a composite quality score. All quality scores were converted to percentages and reported as medians and interquartile ranges (IQRs). All results were taken as averages of the three evaluators.
All statistical analyses were performed at a significance level of 0.05 by using the Statistical Package for Social Sciences (SPSS) software (version 27). Normality tests, including Shapiro-Wilk tests (n < 50) and Kolmogorov-Smirnov tests (n ≥ 50) were conducted before Kruskal-Wallis tests were applied to compare the results across all four DVAs. Post-hoc analyses using Wilcoxon rank sum tests with Bonferroni adjustments were subsequently performed for each possible pairwise comparison among the DVAs. Wilcoxon rank sum testing was also used to compare the understandability of verbal and web responses. Inter-rater reliability was calculated by using the intraclass correlation coefficient (ICC) [51] based on a mean rating of three evaluators, absolute agreement, a two-way mixed-effects model and a 95% confidence interval (95% CI).
The majority of the responses by Siri were web responses (72.7%), while verbal responses formed the major proportion of responses by Alexa (62.1%) (Table 1). The largest proportion of responses from Google Assistant consisted of both verbal and web responses (78.8%). However, Bixby had a comparable distribution of verbal responses only (36.4%) and verbal and web responses (42.4%).
Number of responses (%), N = 66 a |
||||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | |
Types of responses by DVAs | ||||
Verbal response only b | 6 (9.1) | 24 (36.4) | 1 (1.5) | 41 (62.1) |
Web response only c | 48 (72.7) | 14 (21.2) | 13 (19.7) | 0 (0) |
Verbal and web response d | 11 (16.7) | 28 (42.4) | 32 (78.8) | 24 (36.4) |
No response | 1 (1.5) | 0 (0) | 0 (0) | 1 (1.5) |
Proportion of successful responses | ||||
Questions that were recognized e | 63 (95.5) | 46 (69.7) | 66 (100.0) | 60 (90.9) |
Relevant responses | 47 (71.2) | 38 (57.6) | 66 (100.0) | 44 (66.7) |
Proportion of sources provided in DVA responses | ||||
Tier A | 13 (19.7) | 19 (28.8) | 36 (54.5) | 20 (30.3) |
Tier B | 19 (28.8) | 18 (27.3) | 18 (27.3) | 8 (12.1) |
Tier C | 15 (22.7) | 1 (1.5) | 6 (9.1) | 15 (22.7) |
No sources provided, or sources that could not be evaluated | 19 (28.8) | 28 (42.4) | 6 (9.1) | 23 (34.8) |
Note: a Results were taken from the average of three evaluators. b A short verbal text that directly answered the question without providing a link. c A link was provided in response to the question without a verbal explanation. d Both a verbal explanation and a link were present in the response. e These were questions that were captured on the smartphone screen and induced a response by the DVA. Responses such as “I'm not sure I understood that” were classified as the DVA not recognizing the question.
The proportion of responses that were successfully recognized varied across the DVAs. Responses were deemed to be recognized successfully if the questions were captured on the smartphone screen and a response was provided by the DVA. If the DVA provided a response like “I'm not sure I understood that”, its response would be classified as not being recognized. Similarly, if the DVA provided a response that was relevant to the question, it would be classified as such. For the proportion of questions that were recognized, Google Assistant performed the best (100%), followed by Siri (95.5%), Alexa (90.9%) and Bixby (69.7%). The proportion of relevant responses followed the same trend, with Google Assistant performing the best (100%) and Bixby performing the worst (57.6%).
In terms of the credibility of the sources provided, Google Assistant (54.5%) and Siri (19.7%) had the highest and lowest proportions of Tier A sources, respectively. Over a quarter of the sources by Siri (28.8%), Bixby (27.3%) and Google Assistant (27.3%) were Tier B, while Siri and Alexa had the largest proportions of Tier C sources (22.7% each).
Across all 66 questions (Table 2), Google Assistant had the highest median composite quality score (78.9%) among the DVAs, while Alexa had the lowest median composite score (64.5%). Siri (83.9%), Bixby (87.7%) and Google Assistant (87.4%) scored the best for questions on depression, in contrast to Alexa (72.3%), which scored the best for OCD questions. Alexa scored significantly lower (63.0%, p < 0.001) than all other DVAs for questions on depression, and significantly lower (60.5%) than Bixby (75.9%, p < 0.001) and Google Assistant (76.4%, p = 0.004) for questions on anxiety. On the other hand, Bixby scored significantly lower than all other DVAs for questions on general mental health and OCD (0%, p < 0.001 each). Additionally, Siri scored significantly lower than Google Assistant for questions on OCD (61.7% versus 78.4%, p = 0.002).
Among the question subcategories, Siri (71.7%) and Google Assistant (80.5%) scored the best for questions on disease state, as compared to questions on symptoms and treatment (Table 2). On the other hand, Bixby had similar scores across all three subcategories of disease state, symptoms and treatment. In contrast, Alexa scored the highest for questions on symptoms (71.5%), but its score in the treatment subcategory (57.3%) was significantly lower than those of Bixby (78.3%, p < 0.001) and Google Assistant (77.3%, p < 0.001). Furthermore, Alexa's scores were also significantly lower than Google Assistant for questions in the subcategory of disease state (69.6% versus 80.5%, p = 0.004).
Classification of Questions | Median Quality Scores of DVAs [% (IQR)] |
p-values* | |||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | ||
Across all questions | 70.4 (60.9–79.3) | 72.8 (0–81.6) | 78.9 (73.9–85.2) | 64.5 (57.7–76.7) | <0.001 |
Mental Health Categories | |||||
General mental health | 77.1 (71.3–85.2) | 0 (0–16.7) | 80.5 (76.4–89.1) | 70.7 (57.1–79.7) | <0.001 |
Depression | 83.9 (76.0–86.9) | 87.7 (83.6–89.3) | 87.4 (79.4–88.7) | 63.0 (61.4–72.1) | <0.001 |
Anxiety | 71.8 (67.2–87.6) | 75.9 (71.3–80.7) | 76.4 (69.8–83.6) | 60.5 (42.5–68.4) | 0.006 |
Obsessive-compulsive disorder | 61.7 (53.7–69.8) | 0 (0–29.6) | 78.4 (73.6–85.7) | 72.3 (59.1–80.0) | <0.001 |
Bipolar disorder | 66.4 (44.4–70.4) | 77.5 (71.3–81.6) | 75.9 (70.1–81.6) | 63.0 (48.2–80.5) | 0.004 |
Question Subcategories | |||||
Disease state | 71.7 (66.9–79.0) | 71.6 (28.2–83.3) | 80.5 (73.8–84.8) | 69.6 (62.5–80.2) | 0.031 |
Symptoms | 66.7 (53.9–83.1) | 76.7 (25.0–80.9) | 77.5 (70.7–86.0) | 71.5 (57.7–80.4) | 0.239 |
Treatment | 60.5 (49.4–74.2) | 78.3 (63.0–85.0) | 77.3 (69.6–84.3) | 57.3 (30.6–62.1) | <0.001 |
Note: *Kruskal-Wallis test was performed among all the four DVAs with statistical significance defined as p < 0.05. Post-hoc analyses using the Wilcoxon rank sum test with Bonferroni adjustment were performed for each possible pairwise comparison among the DVAs, with statistical significance defined as p < 0.00833.
Across all quality domains, Google Assistant scored the highest while Alexa scored the lowest (Table 3). In terms of comprehension ability, Google Assistant scored significantly higher (100%, p < 0.001) than the other DVAs. In addition, Alexa (100%) scored significantly higher than Siri (88.9%, p < 0.001) and Bixby (94.5%, p = 0.03) in this domain. Google Assistant (100%) and Bixby (100%) also scored significantly higher than Siri (66.7%) and Alexa (75.0%) in terms of relevance. Only Google Assistant was successful in identifying situations that required immediate intervention from one evaluator (16.7%).
Alexa scored the worst among all DVAs in terms of comprehensiveness (22.2%, p < 0.001) and reliability (58.3%, p < 0.001). In addition, Alexa also performed the poorest when evaluated against the QoL criteria (10.0%), as compared to Bixby, which performed the best (76.7%). In contrast, Google Assistant scored the best (77.8%) in terms of comprehensiveness, but it had similar reliability scores as Bixby (75.0% each). In terms of accuracy, Alexa scored the lowest among the DVAs (75.0% versus 100% for other DVAs, p = 0.003). However, all DVAs had similar scores for understandability (50.0% each). The understandability of verbal responses was significantly lower than that of web responses (33.3% versus 50.0%, p = 0.004). Inter-rater reliability ranged from moderate to good for both the overall quality and the individual quality domains (Table 3).
Quality Domains | Median Quality Scores of DVAs [% (IQR)] |
p-value* | Intraclass Correlation Coefficient [ICC (95% CI)] a | |||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | |||
Comprehen-sion ability | 88.9 (70.8–100) | 94.5 (0–100) | 100 (100–100) | 100 (88.9–100) | <0.001 | 0.892 (0.868–0.913) |
Relevance | 66.7 (50.0–100) | 100 (66.7–100) | 100 (83.3–100) | 75.0 (33.3–100) | <0.001 | 0.753 (0.691–0.804) |
Comprehen-siveness | 66.7 (44.4–83.3) | 66.7 (55.6–88.9) | 77.8 (55.6–88.9) | 22.2 (0–66.7) | <0.001 | 0.747 (0.660–0.812) |
Accuracy | 100 (75.0–100) | 100 (83.3–100) | 100 (83.3–100) | 75.0 (50.0–100) | 0.003 | 0.691 (0.593–0.769) |
Understand-ability | 50.0 (25.0–75.0) | 50.0 (33.3–68.8) | 50.0 (33.3–66.7) | 50.0 (25.0–75.0) | 0.724 | 0.672 (0.513–0.775) |
Reliability | 72.9 (63.2–83.3) | 75.0 (63.9–84.3) | 75.0 (66.7–84.3) | 58.3 (49.1–63.9) | <0.001 | 0.896 (0.863–0.922) |
Overall quality | 70.4 (60.9–79.3) | 72.8 (0–81.6) | 78.9 (73.9–85.2) | 64.5 (57.7–76.7) | <0.001 | 0.848 (0.813–0.877) |
Note: * Kruskal-Wallis test was performed among all four DVAs with statistical significance defined as p < 0.05. Post-hoc analyses using the Wilcoxon rank sum test with Bonferroni adjustment were performed for each possible pairwise comparison among the DVAs, with statistical significance defined as p < 0.00833. a ICC values and their 95% CIs were calculated using the SPSS platform based on the mean rating of three evaluators, absolute agreement and a two-way mixed-effects model. ICC values indicate moderate-to-good inter-rater reliability.
In relation to our hypothesis, this study has shown that DVAs are able to provide relevant and appropriate responses to mental health-related queries. However, the understandability of their responses was relatively low. Furthermore, not all DVAs fared the same in terms of the different quality domains, and they also varied across the various mental health conditions. Overall, Google Assistant performed the best among all DVAs, suggesting that it was able to comprehend the queries and provide responses that were relevant and accurate across the various mental health categories. In comparison, Bixby fared the worst in terms of responding to questions on general mental health and OCD. On the other hand, Alexa's responses were the least comprehensive and reliable across all questions, as well as in the categories of depression, anxiety and bipolar disorder.
All DVAs performed well in terms of comprehension ability. This result was similar to a study by Yang and colleagues, who investigated the abilities of Siri, Google Assistant, Alexa and Cortana in terms of responding to questions on postpartum depression [10]. In their study, all DVAs performed well in terms of recognizing the postpartum depression questions, with scores ranging from 79% (Alexa) to 100% (Siri and Google Assistant). However, in our study, Siri and Bixby performed poorer than Google Assistant and Alexa. For Bixby, a quarter of the questions posed (27.3%, n = 18/66) were scored as 0%. In particular, Bixby often transcribed “OCD” as “o CD” (two separate words), resulting in a large proportion of questions failing to be recognized. In addition, while Bixby could accurately transcribe questions on general mental health, it could not generate responses for many of these questions (80%, n = 8/10) and frequently answered with “I'm not sure I understood that”. We postulate that our observations could be due to Bixby's primary design intent, which was to assist users in operating the phone via voice commands, rather than provide accurate responses to questions, as in the case of other DVAs [52]. On the other hand, while Siri could successfully capture all questions, it was penalized for transcribing errors. Siri tended to cut off the user before the entire question was posed, resulting in incomplete prompts being captured on the screen. Examples included “Can depression...” and “What is the difference between...”, when the entire questions that were meant to be asked were “Can depression be genetic?” and “What is the difference between normal behavior and OCD?”. respectively.
In regard to relevance, Siri and Alexa performed more poorly than Google Assistant and Bixby due to the irrelevant responses provided. For example, Siri responded with answers about medications when the question posed was “How are anxiety disorders diagnosed?” Similarly, Alexa responded with the effects of bipolar disorder to the question of “Who does bipolar disorder affect?” When the DVAs were evaluated for their ability to refer cases that required immediate intervention, only Google Assistant managed to respond appropriately to one evaluator. Interestingly, our observations differed from a study by Kocaballi and colleagues [53], who reported that Siri scored the highest for safety-critical prompts when compared to Google Assistant, Bixby and Alexa. In another study by Miner et al. [17], even though Google Now and Samsung S Voice (predecessor of Bixby) [54] managed to recognize queries on suicide as a cause for concern, Google Now did not recognize the cause for concern for queries on depression, while the responses from S Voice varied, with the cause of concern being recognized only in some instances. Nonetheless, the authors of both studies agreed that there was an inconsistency in the responses of the DVAs and that their abilities to recognize causes for concern should improve. It is unclear whether the inability of DVAs to respond to queries appropriately is due to system failure, a failure of the natural language understanding, a misrecognized prompt, the DVA being unable to find a response or the DVA deliberately not responding to particular types of queries [53]. However, we agree with Kocaballi and colleagues and advocate that the DVAs' capabilities should be made more transparent to users so that it can improve user experience and reduce confusion.
For comprehensiveness, Alexa performed the worst among the DVAs. It also scored significantly lower than Bixby and Google Assistant in terms of accuracy. In contrast, Alexa performed well in terms of comprehension ability, suggesting that, even though it could comprehend the questions being posed, it did not provide comprehensive and accurate responses. Our findings were consistent with a study by Alagha and Helbing, who evaluated the quality of responses to questions on vaccines by Google Assistant, Siri and Alexa [11]. In their study, the authors indicated that Alexa lacked in its ability to process health queries and generate responses from high-quality sources. Furthermore, in our study, Alexa performed significantly poorer than the other DVAs in terms of reliability. One reason was its tendency to only provide verbal responses, such as “Here's something I found on Mayo Clinic”, while the other DVAs provided specific links to webpages. In addition, Alexa provided invalid links to “reference.com”, which could not be accessed on several occasions. Our observations were also in line with the DVA vaccine information study by Alagha and Helbing [11], who reported that Google Assistant and Siri were more capable of directing the user to authoritative sources than Alexa, which did not provide answers from the same sources as the other DVAs. Hence, our recommendation is to supplement Alexa's responses to mental health queries with those of another DVA or other external resources so that any lack of or discrepancies in health-related information provided can be identified by the user.
There was a significant difference between the understandability of verbal responses versus web responses. Verbal responses were less easily understood, as according to the SMOG readability test, and contained more jargon than web responses. However, both types of responses also scored poorly, indicating that the responses of the DVAs to mental health queries are less likely to be understood by a layperson. Our results concurred with a study assessing the readability of online health information, which showed that, among 12 health conditions, the information on dementia and anxiety were the hardest to read [55]. As the understandability of health-related information is important to raise one's awareness and knowledge of mental health issues and self-care, we advocate that the information provided by DVAs should be complemented with other information online and shared between the patient and caregiver (or someone whom the patient trusts) in a close and private setting that is comfortable for the patient.
Across the mental health conditions, Siri, Bixby and Google Assistant scored the highest for questions on depression. Our results were similar to the study by Miner et al., which investigated the responses of Siri, Google Now, S Voice and Cortana to questions on depression [17]. In their study, the DVAs were generally able to recognize prompts, but they were not able to refer the user to a depression helpline. On the contrary, a study by Kocaballi et al. showed that DVAs had the lowest ratio of appropriate responses to mental health prompts, including those of depression [53]. Even though there have been studies investigating the quality of conversational agents on mental health conditions [56],[57], these studies focused on other types of conversational agents, such as chatbots and mobile apps, instead of DVAs. To the best of our knowledge, there is a paucity of studies that explore the quality of DVAs in relation to mental health conditions, especially OCD and bipolar disorder. While Google Assistant seems to be one of the top two DVAs that can potentially be recommended for queries on OCD and bipolar disorder (Figure 2), its ability to answer questions on these two conditions may not be as well established as that for general mental health and depression queries. Interestingly, Siri did not perform as well on either of these mental health conditions. As such, we recommend Apple users who seek information about OCD and/or bipolar disorder from Siri to supplement their responses with other online resources from Google Assistant or Google searches. In any case, our study presents new insight into the quality of DVAs across the span of these four mental health conditions—depression, anxiety, OCD and bipolar disorder.
The main limitation of this study is that we were only able to evaluate a subset of four DVAs and four mental health conditions. Therefore, our results might not be representative of the DVAs' performances for other mental health conditions, nor of the quality of other DVAs (e.g., Google Home Mini and Microsoft Cortana). Furthermore, as the location function of the DVAs were switched on during our evaluations, the search results might have been adapted to the local context, and minor variations could exist depending on the country and location of the user. Studies have shown that the responses of DVAs provided to the same questions can differ [17],[58]. Although the qualitative responses of the DVAs were not compared in this study, we tried to minimize this variability by having each evaluator use the same devices for their evaluations. In order to account for the variations in evaluation scores of the same DVA response by the different evaluators, we calculated the ICC values for each quality domain (Table 3) to determine the inter-rater reliability; our results indicated moderate-to-good reliability. Similarly, inter-rater reliability for the overall quality scores of the DVAs was good. Nonetheless, we acknowledge that this bias may exist in the DVA responses, and our study results should be interpreted with this limitation in mind. In addition, our evaluation protocol might not be reflective of real-life usage of DVAs by the layperson. In our study, when the question posed to the DVAs was not recognized on the first attempt, there would be two more attempts made before the evaluation ended. However, in real-life, users might forgo repeatedly asking the same question multiple times if they encountered an unsuccessful response on their first try. Next, due to time limitations, only the first web link provided by the DVAs was evaluated in this study, but, in reality, users might access other links as well if more than one link was provided by the DVAs. Lastly, our results only provide the quality of the DVAs in a snapshot of time. With advancements in voice recognition technologies, natural language processing and other AI-based algorithms, we expect that the quality of the DVAs will also improve over time. As such, we advise caution when extrapolating the results of this study to other DVAs, other countries/states, other mental health conditions or over time.
Overall, Google Assistant performed the best in terms of responding to mental health-related queries, while Alexa performed the worst. In terms of specific mental health conditions, Bixby performed the worst for questions on general mental health and OCD. While the comprehension abilities of the DVAs were generally good, our study showed that the DVAs had differing performances in the domains of relevance, comprehensiveness, accuracy and reliability. Moreover, the responses of the DVAs generally lacked in understandability. Based on our quality evaluations, we have provided a DVA recommendation list that users can potentially consider for the different mental health conditions (Figure 2). While Google Assistant generally works well across all of the included mental health conditions, Siri and Bixby can also be used for depression and anxiety. On the other hand, Alexa and Bixby may potentially be used for OCD and bipolar disorder, respectively. However, when depending on the DVA responses to their mental health-related queries, we caution the general public to supplement the information provided by the DVAs with other online information from authoritative healthcare organizations, and to always seek the help and advice of a healthcare professional when managing their mental health condition(s). In light of many organizations adapting to the post-pandemic world, future research should focus on other types of mental health conditions (e.g., stress) in patients, caregivers and healthcare professionals resulting from specific circumstances, such as workplace disruptions, loss of healthcare services and the accumulation of new job roles as healthcare undergoes a major digital transformation worldwide. In addition, further research can also be done to evaluate other types of DVAs' performance for mental health conditions that are relevant to the researchers' communities.
[1] |
T. Black, Global very weak solutions to a chemotaxis-fluid system with nonlinear diffusion, SIAM J. Math. Anal., 50 (2018), 4087–4116. doi: 10.1137/17M1159488
![]() |
[2] |
H. M. Byrne, M. R. Owen, A new interpretation of the Keller-Segel model based on multiphase modelling, J. Math. Biol., 49 (2004), 604–626. doi: 10.1007/s00285-004-0276-4
![]() |
[3] |
X. Cao, Fluid interaction does not affect the critical exponent in a three-dimensional Keller-Segel-Stokes model, Z. Angew. Math. Phys., 71 (2020), 61. doi: 10.1007/s00033-020-1285-x
![]() |
[4] |
A. Chhetri, J. V. Rispoli, S. A. Lelievre, 3D cell culture for the study of microenvironment-mediated mechanostimuli to the cell nucleus: An important step for cancer research, Front. Mol. Biosci., 8 (2021), 628386. doi: 10.3389/fmolb.2021.628386
![]() |
[5] | D. A. Drew, S. L. Passman, Theory of multicomponent fluids, Springer, 1999. |
[6] | M. Di Francesco, A. Lorz, P. Markowich, Chemotaxis-fluid coupled model for swimming bacteria with nonlinear diffusion: global existence and asymptotic behavior, DCDS, 28 (2010) 1437–1453. |
[7] |
S. Evje, An integrative multiphase model for cancer cell migration under influence of physical cues from the microenvironment, Chem. Eng. Sci., 165 (2017), 240–259. doi: 10.1016/j.ces.2017.02.045
![]() |
[8] |
S. Evje, J. O. Waldeland, How tumor cells can make use of interstitial fluid flow in a strategy for metastasis, Cell. Mol. Bioeng., 12 (2019), 227–254. doi: 10.1007/s12195-019-00569-0
![]() |
[9] |
S. Evje, H. Wen, A Stokes two-fluid model for cell migration that can account for physical cues in the microenvironment, SIAM J. Math. Anal., 50 (2018), 86–118. doi: 10.1137/16M1078185
![]() |
[10] |
S. Evje, M. Winkler, Mathematical analysis of two competing cancer cell migration mechanisms driven by interstitial fluid flow, J. Nonlinear Sci., 30 (2020), 1809–1847. doi: 10.1007/s00332-020-09625-w
![]() |
[11] |
G. Follain, D. Herrmann, S. Harlepp, V. Hyenne, N. Osmani, S. C. Warren, et al., Fluids and their mechanics in tumour transit: shaping metastasis, Nat. Rev. Cancer, 20 (2020), 107–124. doi: 10.1038/s41568-019-0221-x
![]() |
[12] |
U. Haessler, J. C. M. Teo, D. Foretay, P. Renaud, M. A. Swartz, Migration dynamics of breast cancer cells in a tunable 3D interstitial flow chamber, Integr. Biol., 4 (2012), 401–409. doi: 10.1039/c1ib00128k
![]() |
[13] |
A. Lorz, Coupled Keller-Segel-Stokes model: global existence for small initial data and blow-up delay, Commun. Math. Sci., 10 (2012), 555–574. doi: 10.4310/CMS.2012.v10.n2.a7
![]() |
[14] | S. Mishra, A machine learning framework for data driven acceleration of computations of differential equations, Math. Eng., 1 (2019), 118–146. |
[15] |
J. A. Pedersen, S. Lichter, M. A. Swartz, Cells in 3D matrices under interstitial flow: effects of extracellular matrix alignment on cell shear stress and drag forces, J. Biomech., 43 (2010), 900–905. doi: 10.1016/j.jbiomech.2009.11.007
![]() |
[16] |
W. J. Polacheck, J. L. Charest, R. D. Kamm, Interstitial flow influences direction of tumor cell migration through competing mechanisms, Proc. Natl. Acad. Sci. U.S.A., 108 (2011), 11115–11120. doi: 10.1073/pnas.1103581108
![]() |
[17] |
W. J. Polacheck, A. E. German, A. Mammoto, D. E. Ingber, R. D. Kamm, Mechanotransduction of fluid stresses governs 3D cell migration, Proc. Natl. Acad. Sci. U.S.A., 111 (2014), 2447–2452. doi: 10.1073/pnas.1316848111
![]() |
[18] |
Y. Qiao, P. Ø. Andersen, S. Evje, D. C. Standnes, A mixture theory approach to model co-and counter-current two-phase flow in porous media accounting for viscous coupling, Adv. Water Resour., 112 (2018), 170–188. doi: 10.1016/j.advwatres.2017.12.016
![]() |
[19] |
Y. Qiao, S. Evje, A general cell–fluid {N}avier-{S}tokes model with inclusion of chemotaxis, Math. Models Methods Appl. Sci., 30 (2020), 1167–1215. doi: 10.1142/S0218202520400096
![]() |
[20] |
Y. Qiao, S. Evje, A compressible viscous three-phase model for porous media flow based on the theory of mixtures, Adv. Water Resour., 141 (2020), 103599. doi: 10.1016/j.advwatres.2020.103599
![]() |
[21] |
Y. Qiao, H. Wen, S. Evje, Viscous two-phase flow in porous media driven by source terms: analysis and numerics, SIAM J. Math Anal., 51 (2019), 5103–5140. doi: 10.1137/19M1252491
![]() |
[22] |
K. R. Rajagopal, On a hierarchy of approximate models for flows of incompressible fluids through porous solids, Math. Models Methods Appl. Sci., 17 (2007), 215–252. doi: 10.1142/S0218202507001899
![]() |
[23] |
G. S. Rosalem, E. B. L. Casas, T. P. Lima, L. A. Gonzalez-Torres, A mechanobiological model to study upstream cell migration guided by tensotaxis, Biomech. Mod. Mech., 19 (2020), 1537–1549. doi: 10.1007/s10237-020-01289-5
![]() |
[24] |
J. D. Shields, M. E. Fleury, C. Yong, A. A. Tomei, G. J. Randolph, M. A. Swartz, Autologous chemotaxis as a mechanism of tumor cell homing to lymphatics via interstitial flow and autocrine CCR7 signaling, Cancer Cell, 11 (2007), 526–538. doi: 10.1016/j.ccr.2007.04.020
![]() |
[25] |
D. C. Standnes, S. Evje, P. Ø. Andersen, A novel relative permeability model based on mixture theory approach accounting for solid–fluid and fluid–fluid interactions, Tran. Por. Med., 119 (2017), 707–738. doi: 10.1007/s11242-017-0907-z
![]() |
[26] |
M. A. Swartz, M. E. Fleury, Interstitial flow and its effects in soft tissues, Annu. Rev. Biomed. Eng., 9 (2007), 229–256. doi: 10.1146/annurev.bioeng.9.060906.151850
![]() |
[27] |
Y. Tao, M. Winkler, Global existence and boundedness in a Keller-Segel-Stokes model with arbitrary porous medium diffusion, DCDS, 32 (2012), 1901–1914. doi: 10.3934/dcds.2012.32.1901
![]() |
[28] |
Y. Tao, M. Winkler, Locally bounded global solutions in a three-dimensional chemotaxis-Stokes system with nonlinear diffusion, Ann. Inst. H. Poincaré Anal. Non Linéaire, 30 (2013), 157–178. doi: 10.1016/j.anihpc.2012.07.002
![]() |
[29] |
I. Tuval, L. Cisneros, C. Dombrowski, C. W. Wolgemuth, J. O. Kessler, R. E. Goldstein, Bacterial swimming and oxygen transport near contact lines, Proc. Natl. Acad. Sci. U.S.A., 102 (2005), 2277–2282. doi: 10.1073/pnas.0406724102
![]() |
[30] |
J. O. Waldeland, S. Evje, A multiphase model for exploring cancer cell migration driven by autologous chemotaxis, Chem. Eng. Sci., 191 (2018), 268–287. doi: 10.1016/j.ces.2018.06.076
![]() |
[31] |
J. O. Waldeland, S. Evje, Competing tumor cell migration mechanisms caused by interstitial fluid flow, J. Biomech., 81 (2018), 22–35. doi: 10.1016/j.jbiomech.2018.09.011
![]() |
[32] |
H. Wiig, M. A. Swartz, Interstitial fluid and lymph formation and transport: physiological regulation and roles in inflammation and cancer, Physiol Rev., 92 (2012), 1005–1060. doi: 10.1152/physrev.00037.2011
![]() |
[33] |
M. Winkler, Stabilization in a two-dimensional chemotaxis-Navier-Stokes system, Arch. Rational Mech. Anal., 211 (2014), 455–487. doi: 10.1007/s00205-013-0678-9
![]() |
[34] |
M. Winkler, Does fluid interaction affect regularity in the three-dimensional Keller-Segel system with saturated sensitivity?, J. Math. Fluid Mech., 20 (2018), 1889–1909. doi: 10.1007/s00021-018-0395-0
![]() |
[35] |
M. Winkler, Small-mass solutions in the two-dimensional Keller-Segel system coupled to the Navier–Stokes equations, SIAM J. Math. Anal., 52 (2020), 2041–2080. doi: 10.1137/19M1264199
![]() |
[36] |
M. Winkler, Global weak solutions in a three-dimensional Keller-Segel-Navier-Stokes system with gradient-dependent flux limitation, Nonlinear Anal. Real, 59 (2021), 103257. doi: 10.1016/j.nonrwa.2020.103257
![]() |
[37] | Y. S. Wu, Multiphase fluid flow in porous and fractured reservoirs, Elsevier, 2016. |
[38] |
H. Zhou, P. Lei, T. P. Padera, Progression of metastasis through lymphatic system, Cells, 10 (2021), 627. doi: 10.3390/cells10030627
![]() |
Number of responses (%), N = 66 a |
||||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | |
Types of responses by DVAs | ||||
Verbal response only b | 6 (9.1) | 24 (36.4) | 1 (1.5) | 41 (62.1) |
Web response only c | 48 (72.7) | 14 (21.2) | 13 (19.7) | 0 (0) |
Verbal and web response d | 11 (16.7) | 28 (42.4) | 32 (78.8) | 24 (36.4) |
No response | 1 (1.5) | 0 (0) | 0 (0) | 1 (1.5) |
Proportion of successful responses | ||||
Questions that were recognized e | 63 (95.5) | 46 (69.7) | 66 (100.0) | 60 (90.9) |
Relevant responses | 47 (71.2) | 38 (57.6) | 66 (100.0) | 44 (66.7) |
Proportion of sources provided in DVA responses | ||||
Tier A | 13 (19.7) | 19 (28.8) | 36 (54.5) | 20 (30.3) |
Tier B | 19 (28.8) | 18 (27.3) | 18 (27.3) | 8 (12.1) |
Tier C | 15 (22.7) | 1 (1.5) | 6 (9.1) | 15 (22.7) |
No sources provided, or sources that could not be evaluated | 19 (28.8) | 28 (42.4) | 6 (9.1) | 23 (34.8) |
Note: a Results were taken from the average of three evaluators. b A short verbal text that directly answered the question without providing a link. c A link was provided in response to the question without a verbal explanation. d Both a verbal explanation and a link were present in the response. e These were questions that were captured on the smartphone screen and induced a response by the DVA. Responses such as “I'm not sure I understood that” were classified as the DVA not recognizing the question.
Classification of Questions | Median Quality Scores of DVAs [% (IQR)] |
p-values* | |||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | ||
Across all questions | 70.4 (60.9–79.3) | 72.8 (0–81.6) | 78.9 (73.9–85.2) | 64.5 (57.7–76.7) | <0.001 |
Mental Health Categories | |||||
General mental health | 77.1 (71.3–85.2) | 0 (0–16.7) | 80.5 (76.4–89.1) | 70.7 (57.1–79.7) | <0.001 |
Depression | 83.9 (76.0–86.9) | 87.7 (83.6–89.3) | 87.4 (79.4–88.7) | 63.0 (61.4–72.1) | <0.001 |
Anxiety | 71.8 (67.2–87.6) | 75.9 (71.3–80.7) | 76.4 (69.8–83.6) | 60.5 (42.5–68.4) | 0.006 |
Obsessive-compulsive disorder | 61.7 (53.7–69.8) | 0 (0–29.6) | 78.4 (73.6–85.7) | 72.3 (59.1–80.0) | <0.001 |
Bipolar disorder | 66.4 (44.4–70.4) | 77.5 (71.3–81.6) | 75.9 (70.1–81.6) | 63.0 (48.2–80.5) | 0.004 |
Question Subcategories | |||||
Disease state | 71.7 (66.9–79.0) | 71.6 (28.2–83.3) | 80.5 (73.8–84.8) | 69.6 (62.5–80.2) | 0.031 |
Symptoms | 66.7 (53.9–83.1) | 76.7 (25.0–80.9) | 77.5 (70.7–86.0) | 71.5 (57.7–80.4) | 0.239 |
Treatment | 60.5 (49.4–74.2) | 78.3 (63.0–85.0) | 77.3 (69.6–84.3) | 57.3 (30.6–62.1) | <0.001 |
Note: *Kruskal-Wallis test was performed among all the four DVAs with statistical significance defined as p < 0.05. Post-hoc analyses using the Wilcoxon rank sum test with Bonferroni adjustment were performed for each possible pairwise comparison among the DVAs, with statistical significance defined as p < 0.00833.
Quality Domains | Median Quality Scores of DVAs [% (IQR)] |
p-value* | Intraclass Correlation Coefficient [ICC (95% CI)] a | |||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | |||
Comprehen-sion ability | 88.9 (70.8–100) | 94.5 (0–100) | 100 (100–100) | 100 (88.9–100) | <0.001 | 0.892 (0.868–0.913) |
Relevance | 66.7 (50.0–100) | 100 (66.7–100) | 100 (83.3–100) | 75.0 (33.3–100) | <0.001 | 0.753 (0.691–0.804) |
Comprehen-siveness | 66.7 (44.4–83.3) | 66.7 (55.6–88.9) | 77.8 (55.6–88.9) | 22.2 (0–66.7) | <0.001 | 0.747 (0.660–0.812) |
Accuracy | 100 (75.0–100) | 100 (83.3–100) | 100 (83.3–100) | 75.0 (50.0–100) | 0.003 | 0.691 (0.593–0.769) |
Understand-ability | 50.0 (25.0–75.0) | 50.0 (33.3–68.8) | 50.0 (33.3–66.7) | 50.0 (25.0–75.0) | 0.724 | 0.672 (0.513–0.775) |
Reliability | 72.9 (63.2–83.3) | 75.0 (63.9–84.3) | 75.0 (66.7–84.3) | 58.3 (49.1–63.9) | <0.001 | 0.896 (0.863–0.922) |
Overall quality | 70.4 (60.9–79.3) | 72.8 (0–81.6) | 78.9 (73.9–85.2) | 64.5 (57.7–76.7) | <0.001 | 0.848 (0.813–0.877) |
Note: * Kruskal-Wallis test was performed among all four DVAs with statistical significance defined as p < 0.05. Post-hoc analyses using the Wilcoxon rank sum test with Bonferroni adjustment were performed for each possible pairwise comparison among the DVAs, with statistical significance defined as p < 0.00833. a ICC values and their 95% CIs were calculated using the SPSS platform based on the mean rating of three evaluators, absolute agreement and a two-way mixed-effects model. ICC values indicate moderate-to-good inter-rater reliability.
Number of responses (%), N = 66 a |
||||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | |
Types of responses by DVAs | ||||
Verbal response only b | 6 (9.1) | 24 (36.4) | 1 (1.5) | 41 (62.1) |
Web response only c | 48 (72.7) | 14 (21.2) | 13 (19.7) | 0 (0) |
Verbal and web response d | 11 (16.7) | 28 (42.4) | 32 (78.8) | 24 (36.4) |
No response | 1 (1.5) | 0 (0) | 0 (0) | 1 (1.5) |
Proportion of successful responses | ||||
Questions that were recognized e | 63 (95.5) | 46 (69.7) | 66 (100.0) | 60 (90.9) |
Relevant responses | 47 (71.2) | 38 (57.6) | 66 (100.0) | 44 (66.7) |
Proportion of sources provided in DVA responses | ||||
Tier A | 13 (19.7) | 19 (28.8) | 36 (54.5) | 20 (30.3) |
Tier B | 19 (28.8) | 18 (27.3) | 18 (27.3) | 8 (12.1) |
Tier C | 15 (22.7) | 1 (1.5) | 6 (9.1) | 15 (22.7) |
No sources provided, or sources that could not be evaluated | 19 (28.8) | 28 (42.4) | 6 (9.1) | 23 (34.8) |
Classification of Questions | Median Quality Scores of DVAs [% (IQR)] |
p-values* | |||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | ||
Across all questions | 70.4 (60.9–79.3) | 72.8 (0–81.6) | 78.9 (73.9–85.2) | 64.5 (57.7–76.7) | <0.001 |
Mental Health Categories | |||||
General mental health | 77.1 (71.3–85.2) | 0 (0–16.7) | 80.5 (76.4–89.1) | 70.7 (57.1–79.7) | <0.001 |
Depression | 83.9 (76.0–86.9) | 87.7 (83.6–89.3) | 87.4 (79.4–88.7) | 63.0 (61.4–72.1) | <0.001 |
Anxiety | 71.8 (67.2–87.6) | 75.9 (71.3–80.7) | 76.4 (69.8–83.6) | 60.5 (42.5–68.4) | 0.006 |
Obsessive-compulsive disorder | 61.7 (53.7–69.8) | 0 (0–29.6) | 78.4 (73.6–85.7) | 72.3 (59.1–80.0) | <0.001 |
Bipolar disorder | 66.4 (44.4–70.4) | 77.5 (71.3–81.6) | 75.9 (70.1–81.6) | 63.0 (48.2–80.5) | 0.004 |
Question Subcategories | |||||
Disease state | 71.7 (66.9–79.0) | 71.6 (28.2–83.3) | 80.5 (73.8–84.8) | 69.6 (62.5–80.2) | 0.031 |
Symptoms | 66.7 (53.9–83.1) | 76.7 (25.0–80.9) | 77.5 (70.7–86.0) | 71.5 (57.7–80.4) | 0.239 |
Treatment | 60.5 (49.4–74.2) | 78.3 (63.0–85.0) | 77.3 (69.6–84.3) | 57.3 (30.6–62.1) | <0.001 |
Quality Domains | Median Quality Scores of DVAs [% (IQR)] |
p-value* | Intraclass Correlation Coefficient [ICC (95% CI)] a | |||
Apple Siri | Samsung Bixby | Google Assistant | Amazon Alexa | |||
Comprehen-sion ability | 88.9 (70.8–100) | 94.5 (0–100) | 100 (100–100) | 100 (88.9–100) | <0.001 | 0.892 (0.868–0.913) |
Relevance | 66.7 (50.0–100) | 100 (66.7–100) | 100 (83.3–100) | 75.0 (33.3–100) | <0.001 | 0.753 (0.691–0.804) |
Comprehen-siveness | 66.7 (44.4–83.3) | 66.7 (55.6–88.9) | 77.8 (55.6–88.9) | 22.2 (0–66.7) | <0.001 | 0.747 (0.660–0.812) |
Accuracy | 100 (75.0–100) | 100 (83.3–100) | 100 (83.3–100) | 75.0 (50.0–100) | 0.003 | 0.691 (0.593–0.769) |
Understand-ability | 50.0 (25.0–75.0) | 50.0 (33.3–68.8) | 50.0 (33.3–66.7) | 50.0 (25.0–75.0) | 0.724 | 0.672 (0.513–0.775) |
Reliability | 72.9 (63.2–83.3) | 75.0 (63.9–84.3) | 75.0 (66.7–84.3) | 58.3 (49.1–63.9) | <0.001 | 0.896 (0.863–0.922) |
Overall quality | 70.4 (60.9–79.3) | 72.8 (0–81.6) | 78.9 (73.9–85.2) | 64.5 (57.7–76.7) | <0.001 | 0.848 (0.813–0.877) |