Loading [MathJax]/jax/output/SVG/jax.js
Review Special Issues

Hardware-friendly compression and hardware acceleration for transformer: A survey


  • The transformer model has recently been a milestone in artificial intelligence. The algorithm has enhanced the performance of tasks such as Machine Translation and Computer Vision to a level previously unattainable. However, the transformer model has a strong performance but also requires a high amount of memory overhead and enormous computing power. This significantly hinders the deployment of an energy-efficient transformer system. Due to the high parallelism, low latency, and low power consumption of field-programmable gate arrays (FPGAs) and application specific integrated circuits (ASICs), they demonstrate higher energy efficiency than Graphics Processing Units (GPUs) and Central Processing Units (CPUs). Therefore, FPGA and ASIC are widely used to accelerate deep learning algorithms. Several papers have addressed the issue of deploying the Transformer on dedicated hardware for acceleration, but there is a lack of comprehensive studies in this area. Therefore, we summarize the transformer model compression algorithm based on the hardware accelerator and its implementation to provide a comprehensive overview of this research domain. This paper first introduces the transformer model framework and computation process. Secondly, a discussion of hardware-friendly compression algorithms based on self-attention and Transformer is provided, along with a review of a state-of-the-art hardware accelerator framework. Finally, we considered some promising topics in transformer hardware acceleration, such as a high-level design framework and selecting the optimum device using reinforcement learning.

    Citation: Shizhen Huang, Enhao Tang, Shun Li, Xiangzhan Ping, Ruiqi Chen. Hardware-friendly compression and hardware acceleration for transformer: A survey[J]. Electronic Research Archive, 2022, 30(10): 3755-3785. doi: 10.3934/era.2022192

    Related Papers:

    [1] Lakehal Belarbi . Ricci solitons of the H2×R Lie group. Electronic Research Archive, 2020, 28(1): 157-163. doi: 10.3934/era.2020010
    [2] Jie Wang . A Schwarz lemma of harmonic maps into metric spaces. Electronic Research Archive, 2024, 32(11): 5966-5974. doi: 10.3934/era.2024276
    [3] Kingshook Biswas, Rudra P. Sarkar . Dynamics of Lp multipliers on harmonic manifolds. Electronic Research Archive, 2022, 30(8): 3042-3057. doi: 10.3934/era.2022154
    [4] Yuriĭ G. Nikonorov, Irina A. Zubareva . On the behavior of geodesics of left-invariant sub-Riemannian metrics on the group Aff0(R)×Aff0(R). Electronic Research Archive, 2025, 33(1): 181-209. doi: 10.3934/era.2025010
    [5] Marco G. Ghimenti, Anna Maria Micheletti . Compactness and blow up results for doubly perturbed Yamabe problems on manifolds with non umbilic boundary. Electronic Research Archive, 2022, 30(4): 1209-1235. doi: 10.3934/era.2022064
    [6] Vladimir Rovenski . Willmore-type variational problem for foliated hypersurfaces. Electronic Research Archive, 2024, 32(6): 4025-4042. doi: 10.3934/era.2024181
    [7] Zhengmao Chen . A priori bounds and existence of smooth solutions to a Lp Aleksandrov problem for Codazzi tensor with log-convex measure. Electronic Research Archive, 2023, 31(2): 840-859. doi: 10.3934/era.2023042
    [8] Sergio Zamora . Tori can't collapse to an interval. Electronic Research Archive, 2021, 29(4): 2637-2644. doi: 10.3934/era.2021005
    [9] Fanqi Zeng . Some almost-Schur type inequalities and applications on sub-static manifolds. Electronic Research Archive, 2022, 30(8): 2860-2870. doi: 10.3934/era.2022145
    [10] Shiyong Zhang, Qiongfen Zhang . Normalized solution for a kind of coupled Kirchhoff systems. Electronic Research Archive, 2025, 33(2): 600-612. doi: 10.3934/era.2025028
  • The transformer model has recently been a milestone in artificial intelligence. The algorithm has enhanced the performance of tasks such as Machine Translation and Computer Vision to a level previously unattainable. However, the transformer model has a strong performance but also requires a high amount of memory overhead and enormous computing power. This significantly hinders the deployment of an energy-efficient transformer system. Due to the high parallelism, low latency, and low power consumption of field-programmable gate arrays (FPGAs) and application specific integrated circuits (ASICs), they demonstrate higher energy efficiency than Graphics Processing Units (GPUs) and Central Processing Units (CPUs). Therefore, FPGA and ASIC are widely used to accelerate deep learning algorithms. Several papers have addressed the issue of deploying the Transformer on dedicated hardware for acceleration, but there is a lack of comprehensive studies in this area. Therefore, we summarize the transformer model compression algorithm based on the hardware accelerator and its implementation to provide a comprehensive overview of this research domain. This paper first introduces the transformer model framework and computation process. Secondly, a discussion of hardware-friendly compression algorithms based on self-attention and Transformer is provided, along with a review of a state-of-the-art hardware accelerator framework. Finally, we considered some promising topics in transformer hardware acceleration, such as a high-level design framework and selecting the optimum device using reinforcement learning.



    Social robots, known for their human-friendly interactions, are becoming common in a myriad of domains, from healthcare and education to the comfort of our homes [1,2,3,4]. These robots are often designed with the dual purpose of executing functional tasks while also establishing a dynamic rapport through communication and interaction. At the heart of social robotics lies the principle of human-robot interaction (HRI), which has advanced significantly over the years, adapting to the complexities of human communication and behavior [5]. However, a critical component of this progress is the development and integration of language translation and understanding capabilities, which is foundational to a truly versatile and effective social robot [6,7]. This advancement not only enables these robots to cross the barrier of language, facilitating interaction with humans in a more context-specific and nuanced manner, but also resonates with the multicultural and multilingual reality of our global society. The need for effective and inclusive communication with robots, irrespective of language barriers, has never been more relevant [8].

    However, as we strive to advance the frontier of social robot interaction, we are met with significant challenges. Primary among these is the inherent limitation of monolingual interaction. Many current social robots can only communicate in a single language, often English, which restricts their functionality and universal appeal. Expanding their linguistic repertoire is therefore a critical step towards ensuring more inclusive and effective interactions. While efforts have been made to incorporate multilingual capabilities into social robots, these attempts have been plagued with issues of translation inaccuracies [9]. Understanding the subtleties and nuances inherent in human language, such as idioms, metaphors or cultural references, can pose significant difficulties for AI systems and thus hamper effective HRI. Moreover, another hurdle in this journey is the difficulty of understanding context and human intent. Human communication is rarely devoid of context; the same set of words can convey drastically different meanings depending on the surrounding conversation and nonverbal cues. While humans are naturally adept at grasping such subtleties, replicating this skill in social robots proves challenging [10]. Additionally, understanding human intent extends beyond the spoken words. It requires the ability to decipher indirect communication, sarcasm and cultural nuances, areas that are yet to be fully explored in social robotics.

    With these challenges in mind, the urgency and significance of further exploration into AI-based language translation and understanding of social robot interaction come into clear focus. In the face of our multicultural, multilingual world, the ability of robots to understand and interact in multiple languages can break down barriers and foster a more universal adoption of social robots across diverse cultural contexts [11]. For instance, consider the case of healthcare robots deployed in eldercare facilities where residents may come from diverse linguistic backgrounds. These robots could provide comfort, monitor health and even assist in therapeutic activities [12]. However, the effectiveness of such robots would be drastically limited if they could not understand or respond accurately to the multilingual needs of the residents. Moreover, in an educational setting, a social robot that understands the language and culture of the learners can offer personalized assistance, helping to bridge the educational gap in linguistically diverse classrooms. Our study, through this comprehensive review, seeks to address these challenges by systematically examining the existing literature in the field, identifying gaps in current knowledge and presenting opportunities for further research. This work is driven by the belief that advancements in AI-based language translation and understanding can fundamentally transform the interaction between humans and robots, paving the way for a future where social robots are not merely tools but companions capable of understanding and communicating with us in the language we speak, the way we speak it.

    The remainder of this paper is structured as follows. Section 2 explores language translation and understanding methods. Section 3 provides an overview of social robot interaction, covering text-based and language translation-based human-robot interaction. Section 4 showcases cutting-edge applications of language translation and understanding in various domains, such as domestic assistance, education, customer service and cross-cultural collaboration. Then, Section 5 discusses current challenges and future directions in the field. Finally, in Section 6, we summarize our key findings, emphasize the impact of AI in machine translation and human-robot interaction and highlight opportunities for future research.

    Figure 1 illustrates the process of speech interaction, encompassing speech input processing, language processing, dialogue management, and speech output processing. It begins with speech recognition (ASR) to convert speech into text, followed by machine translation (MT) and natural language understanding (NLU) for language processing. Dialogue management involves user intent recognition, contextual understanding, and response generation, {enabling a deeper understanding of the user's input. Finally, text-to-speech synthesis (TTS) converts the response text into synthesized speech output, allowing the robot to effectively communicate its response to the user. Then, this section will provide detailed explanations of the following four parts: speech recognition, machine translation (MT), sentiment analysis, and natural language understanding and generation (NLU/NLG).

    Figure 1.  Natural language processing workflow.

    Speech recognition serves as a cornerstone for communication between humans and social robots. This technology's origin can be traced back to the 1960s, with IBM's Shoebox being one of the earliest systems capable of recognizing spoken digits and a limited set of words [13]. Over the decades, speech recognition technology has evolved significantly, with advancements fueled by machine learning and deep learning techniques. Microsoft's work in this arena, especially with products like Azure Speech Services, has contributed immensely to enhancing the accuracy and versatility of speech recognition systems across different applications, augmenting the linguistic capacities of social robots. In the context of social robot interaction, speech recognition forms the first line of processing in language translation and understanding [14]. It converts spoken language into written text, facilitating the robot's comprehension and subsequent response generation. This process encompasses challenges such as the diversity of human languages, accents and the presence of ambient noise, to name a few [15,16,17].

    Advanced speech recognition algorithms, powered by deep learning methodologies, have shown remarkable capabilities in handling these challenges. For instance, Google's speech recognition system has been instrumental in providing accurate transcription services, even in noisy environments, paving the way for more efficient HRI. Similarly, Apple's Siri has been valuing robust speech recognition technology, enabling the assistant to understand and execute a wide range of user commands. Furthermore, Microsoft's work in this arena, especially with products like Azure Speech Services, has contributed immensely to enhancing the accuracy and versatility of speech recognition systems across different applications, augmenting the linguistic capacities of social robots.

    Machine translation (MT) has been a pivotal component in the evolution of HRI, enabling robots to understand and generate language beyond their programmed instructions [18]. Since its inception in the 1950s, machine translation has passed through several stages of development, from rule-based systems to statistical and, more recently, to neural network approaches [19,20,21]. Statistical machine translation (SMT) is a prominent example of the early stage of MT. Introduced in the late 1980s, SMT models rely on the analysis of bilingual text corpora to predict translations [22]. The introduction of neural machine translation (NMT) marked a significant leap forward. With its deep learning architecture, NMT provides end-to-end learning and can generate more natural and accurate translations. NMT models are capable of capturing the context and semantics of sentences, contributing to a more nuanced and effective translation [23]. In the context of social robot interaction, machine translation plays a crucial role in bridging the gap between different languages. Once the speech is recognized and converted into text, MT steps in to convert the text into a language that the robot understands. Subsequently, the robot's responses are translated back into the human user's language. The seamless integration of speech recognition and machine translation technologies enables social robots to communicate effectively and naturally with users of different languages [24].

    Sentiment analysis, also known as opinion mining, is a subfield of natural language processing (NLP) that identifies and extracts subjective information from source materials [25,26,27]. It is primarily used to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The technology has been broadly applied in text analysis, business analytics and social media monitoring. For social robots, sentiment analysis plays a crucial role in understanding human emotions, which is essential for effective HRI. By analyzing the sentiment of the user's input, social robots can adjust their responses accordingly, leading to more engaging and personalized interactions. For instance, if a user's input is detected as negative, the robot might respond in a way that shows empathy or attempts to uplift the user's mood [28].

    Natural language understanding (NLU) and generation (NLG) are two critical aspects of NLP that deal with machine reading comprehension and the production of human-like text, respectively. NLU is the process of understanding and interpreting human language in a valuable way, which enables the social robot to understand and interpret the user's commands, questions and statements [29]. On the other hand, NLG is the task of converting information from computer databases or semantic intents into readable human language, which allows it to generate responses that are coherent, relevant and human-like [30,31]. Together, NLU and NLG form the backbone of the conversational capabilities of social robots, enabling them to carry out meaningful and natural interactions with users.

    The evolution of AI-based translation and understanding can be traced through several key applications that have significantly influenced the field. One of the earliest and most well-known is Google Translate. Launched in 2006, Google Translate initially used statistical machine translation, which translates text based on the analysis of bilingual text corpora. However, in 2016, Google introduced a neural machine translation system, which translates entire sentences at a time rather than piece by piece, providing more fluent and natural sounding translations [32]. Following Google Translate, iFlytek Translator made its debut. Developed by iFlytek, a Chinese information technology company, it is renowned for its high accuracy in speech recognition and translation, especially between English and Mandarin. The device uses deep learning technologies and can support translation between 50 languages [33].

    In the realm of personal assistants, Apple's Siri is a notable example. Introduced in 2011, Siri uses natural language processing to interpret voice commands, answer questions, make recommendations and perform actions by delegating requests to a set of Internet services [34]. After that, Apple's Siri has been valuing robust language translation and understanding technology, enabling the assistant to understand and execute a wide range of user commands. Currently, Siri supports a multitude of languages and dialects, and can adapt to users' individual language usage and search preferences over time. Most recently, OpenAI's ChatGPT has emerged as a state-of-the-art language model. Trained on a diverse range of internet text, ChatGPT generates human-like text based on the input provided. It can translate languages, answer questions, write essays and even generate creative content like poetry [35,36].

    In the early days of computing, the primary mode of social robot interaction was through textual commands and responses. This form of interaction, although seemingly primitive by today's standards, laid the foundation for more complex forms of HRI that we see today [37,38,39]. One of the earliest examples of text-based social robot interaction is the ELIZA program developed by Weizenbaum at MIT in the 1960s [40]. ELIZA was a computer program that emulated a psychotherapist by using pattern matching and substitution methodology to simulate conversation. Despite its simplicity, ELIZA was able to demonstrate the illusion of understanding, which marked a significant milestone in the field of artificial intelligence and social robot interaction. However, text-based interaction lacks the richness of non-verbal cues, such as facial expressions and body language, which play a crucial role in human communication [41].

    The evolution of social robot interaction took a significant leap with the introduction of speech-based interaction. This development was largely facilitated by advancements in speech recognition technology, which allowed robots to understand and respond to spoken language. Compared to text-based social robot interaction, the interaction based on speech can provide a more natural and convenient interactive experience for users.

    Among a variety of languages, English is the most often used language for social robot interaction. Replika is a chatbot application designed to provide users with conversation and emotional support [42]. The initial version of it only supported English communication. In addition, many chatbots are used to enhance students' English language learning. In [43], Kanda et al. examined the potential for robots to form relationships with children and facilitate learning. A field trial at a Japanese elementary school involving English-speaking Robovie robots showed that initial interactions were frequent but declined over time. However, continued interaction during the second week predicted improvements in English skills, especially for children with prior proficiency or interest in English. Zakos [44] invented CLIVE, an artificially intelligent chatbot designed to facilitate English language learning through engaging and natural conversations. Unlike other tutoring systems, CLIVE offered an open and diverse range of topics, providing users with a lifelike and immersive language learning experience. In [45], Mini, a social robot, was designed to assist and accompany the elderly in various aspects of their daily lives. The robot offered services in personal assistance, entertainment, safety and stimulation, supporting cognitive and mental tasks.

    However, while the use of English as the primary language for speech-based interaction has its advantages, such as a large user base and extensive research and resources, and also presents significant limitations. The primary limitation is the exclusion of non-English speakers, which constitutes a large portion of the global population [46]. This has led to a growing recognition of the need for multilingual capabilities in social robot interaction. Moreover, even within English speech-based interaction, there are challenges related to understanding accents, dialects and cultural nuances. This highlights the need for more advanced language understanding capabilities that can cater to the diversity of users.

    The development of language translation and understanding technologies has significantly broadened the scope of social robot interaction [47]. Unlike single-language-based virtual chatbots, most physical social robots are designed for multilingual interaction to cater to diverse user populations, thereby enhancing the user's understanding and engagement. This multilingual capability is particularly beneficial in multicultural and multilingual settings, where users may speak different languages.

    One notable example of a social robot that leverages language translation and understanding is SoftBank's NAO robot, which has been used to teach English to native speakers of Dutch, German and Turkish as well as Dutch or German to Turkish-speaking children living in the Netherlands or Germany [48]. The NAO robot, through its ability to produce speech in various languages, provides a personalized, one-on-one tutoring experience that is both engaging and effective. Pepper, another social robot developed by SoftBank, utilizes natural language processing and speech recognition technologies, enabling it to recognize and understand text and speech inputs in multiple languages [49]. It supports various commonly used languages, including English, French, German, Italian, Spanish, Japanese and more. Therefore, users can communicate with Pepper in their familiar language of choice. In linguistically diverse L2 classrooms, social robots, which have been programmed to communicate in multiple languages, were used to assist L2 vocabulary learning [50]. Surprisingly, providing L1 translations through the robot did not demonstrate a facilitating effect on bilingual children's L2 word learning, contrary to initial predictions.

    In recent years, the field of social robotics has witnessed significant advancements with the introduction of AI models like ChatGPT by OpenAI. ChatGPT, a large-scale language model, has been instrumental in enhancing the language translation and understanding capabilities of social robots. In the military settings, ChatGPT is expected to play a role in various applications, including military robotics, battle space autonomy, automated target recognition and language translation [51]. Specifically, ChatGPT could be utilized for translating messages between several languages to improve understanding and communication between various military units or between the local communities and military in operational regions. In the industrial area, Ye et al. [52] investigated the impact of incorporating ChatGPT in a human-robot assembly task, where a robot arm controlled by RoboGPT assisted human operators. The study demonstrated that integrating ChatGPT significantly enhanced trust, attributed to improved communication and the robot's ability to understand and respond appropriately to human language. However, it is important to note that while ChatGPT represents a significant step forward, there are still challenges to overcome, such as ensuring the accuracy and appropriateness of its responses, and improving its ability to understand and respond to nonverbal cues.

    In order to achieve natural and effective HRIs, social robots also need to possess the ability to perceive and identify complex emotional body language as well as display their own behaviors using similar communication modes [53,54]. In [55], McColl and Nejat focused on the design of emotional body language for Brian 2.0, i.e., a human-like social robot, by incorporating various body postures and movements found in human emotion research. In a more recent study, Hong et al. [56] presented a novel multimodal emotional HRI architecture that combined body language and vocal intonation to detect user affect. Not only can the social robot interact with a human user in English but it can also determine its own emotional behavior based on user affect. For deaf and hard-of-hearing children, sign language plays a more critical role in their life rather than other oral languages. In [57], Meghdari et al. presented RASA (Robot Assistant for Social Aims), an educational social robot specifically designed for teaching Persian Sign Language (PSL) to children with hearing disabilities. RASA is characterized by its interactive social functionality, the ability to perform PSL with a dexterous upper-body and its cost-effectiveness. These examples demonstrate the fact that the integration of language translation and understanding technologies in social robots holds great promise for the future of HRI.

    In this section, we explore the forefront applications of AI-driven language translation and understanding in social robot interactions, spanning family assistance, educational support, service provision, travel guidance and cross-cultural collaboration. The documents referenced in this section have been meticulously chosen based on specific criteria. The majority of these were sourced from Google Scholar, while a minor fraction was curated from the broader internet, particularly from dedicated robot websites featuring news reports. It is imperative to highlight that the articles searched in Google Scholar underwent peer-review processes, ensuring their credibility and relevance. The content of these papers predominantly pertains to applications of robots equipped with multilingual comprehension and translation capabilities in HRIs, encompassing family, education, service, travel guide and cross-cultural collaboration domains. In terms of temporal relevance, the literature incorporated herein has been predominantly published within the last decade, with a significant emphasis on studies and advancements from the past five years. Figure 2 presents the various applications of social robot interaction based on language translation and understanding.

    Figure 2.  Various applications of social robot interaction based on language translation and understanding.

    In order to enhance overall family well-being and convenience, there is an emerging demand for intelligent family robots. These assistive robots, with outstanding language abilities and cute appearances, can assist with various aspects of family life, including companionship, communication, home automation and care support.

    Assistive robots in the family are represented by companion chatbots. By extracting a formal meaning representation from natural language utterances, Atzeni and Atzori [58] proposed a language-independent approach for creating smart personal assistants and chatbots dubbed AskCo, which supported multiple languages. The system enables easy extensibility through Java source code and eliminates the need for training on large datasets, making it a flexible and efficient solution. Jelly, an AI-based chatbot developed using Facebook's Blenderbot, overcame language barriers by conversing with users in their native language, Nepali. It aimed to provide a comfortable and engaging conversation experience for those who struggle with English bots. The use of powerful text generation models enabled Jelly to understand romanized Nepali with English alphabets [59]. Regarding the Buddy robot [60], as shown in Figure 3, it is designed as an affordable family companion aimed at facilitating communication, ensuring home security, providing educational entertainment and even assisting with eldercare. Buddy is capable of autonomous movement and interacts with the environment through its integrated sensors, enabling object and facial recognition as well as language understanding and generation. It comes with pre-set languages of French and English and supports additional language downloads such as Japanese, Mandarin, Korean, etc.

    Figure 3.  Appearance design of Buddy [60].

    With the increasing development of the Internet of Things (IoT) and home automation systems, there is a growing need for multilingual support to overcome language barriers, particularly for non-English speakers. In 2017, Eric et al. [61] explored the integration of voice control into a smart home automation system by leveraging voice recognition tools. Then the authors discussed different architectures for voice-enabled systems and evaluated available speech-to-text and text-to-speech engines, with a focus on the Google Cloud Speech API, which supported multi-language. Two years later, the smartphone is usually used to manage multiple remote controllers. Bajpai and Radha [62] proposed a solution using a smartphone and Arduino microcontroller to create a universal remote controller for cost-effective and convenient home automation. The study focused on developing a voice recognition system to control electronic appliances in a signal-based smart home network, enhancing the ease of use and accessibility for users. In 2021, Ta Multilanguage IoT Home Automation System was developed, specifically targeting elderly individuals in Malaysia, enabling them to control their home appliances using voice commands in their preferred language [63]. This research contributed to enhancing accessibility and usability for individuals with physical disabilities and older adults, who may face language challenges in utilizing smart home technologies. More recently, Soni et al. [64] introduced a novel approach to remotely control home appliances using smartphones, leveraging IoT technology. The system allows users to control appliances through voice commands in multiple languages, addressing the language barrier and enhancing system robustness and user convenience. The experimental results demonstrate a high level of performance, with an average success rate of 95.4%.

    Smart and multipurpose voice recognition guiding robots have been developed to assist disabled people. For visually impaired individuals, Kalpana et al. [65] proposed an RTOS-enabled smart and multipurpose voice recognition guiding robot, which supported multiple regional languages. The robot, designed in the form of a dog, utilized Google Voice Recognition API to recognize user commands and employed light detection signals and a corner crossing algorithm for obstacle avoidance. Besides, It included a watchdog mode for abnormal movement detection and a self-charging feature using photovoltaic cells. A project aimed to develop a multi-language reading device to aid visually impaired individuals in accessing information from regular books. The device utilized conversational AI technology, including image-to-text, translation and text-to-speech modules from Google Cloud. It supported multiple languages and could be used in public areas [66].

    Education applications are the primary source of the birth of multilingual social robots. Students from diverse linguistic backgrounds have different needs for language learning and educational experiences. The developed tutoring robots, especially NAO, can be applied in various educational aspects, including language tutoring, STEM training, metacognition tutoring, geometrical thinking training, oral proficiency development and facilitating communication and engagement in hybrid language classrooms. Therefore, developing various multilingual tutoring robots has become a trend. Table 1 illustrates the comparison of multilingual assistive robots in education applications.

    Table 1.  Comparison of multilingual assistive robots in education applications. (processed by authors).
    Ref. Robots Languages Subjects Applications
    [67] NAO robot English, Dutch 194 children Tutoring children English vocabulary
    [68] Keepon robot English, Spanish First-graders Personalized robot tutoring
    [69] Mobile robot Korean, Vietnamese, English N/A STEM training; user interaction
    [70] Chatbot 26 languages 51 people Exam preparation
    [71] NAO robot English, German 40 participants Tutoring foreign language words
    [72] NAO robot Chinese, English Preschoolers Teaching preschoolers read and spell
    [73] NAO robot English, Chinese, Japanese and Korean 19 college students Individual tutoring and interactive learning experiences for students
    [74] NAO robot Norwegian, English 20 children Children's language learning progress in Norwegian day-care centers
    [75] NAO robot Chinese, English 24 primary school students English teaching
    [76] Telepresence robot Finnish, German, Swedish and English 10–20 students Classroom interaction; supporting the remote students
    [77] Telepresence robot Japanese, English more than 50 children International communication between distant classrooms
    [78] EngSISLA English, Hindi, Punjabi different age group speakers Translating the speech to Indian Sign Language

     | Show Table
    DownLoad: CSV

    Various studies have explored the effectiveness of social robots and mobile robots in language tutoring and STEM training. For example, Vogt et al. [67] implemented a large-scale experiment using a social robot to tutor young children in English vocabulary. Figure 4 illustrates the conic gestures the tutoring robot used. The robot, capable of translating between English and Dutch, was compared to a tablet application in terms of teaching new words. The results indicated that children were equally able to acquire and retain vocabulary from both the robot and the tablet. In another study, Leyzberg et al. [68] investigated the effectiveness of a personalization system for social robot tutors in a five-session English language learning task with native Spanish-speaking first-graders. The system, based on an adaptive Hidden Markov Model, ordered the curriculum to target individual skill proficiencies (Figure 5). The results demonstrated that participants who received personalized lessons from the robot tutor outperformed those who received non-personalized lessons. More recently, [69] explored the application of mobile robots in STEM training and proposed models that combined a mobile robot with an Android OS tablet for user interaction and voice control. They conducted experiments using an AI Processor to control the robot through voice commands in three languages (Korean, Vietnamese and English). The results showed high average confidence levels, providing a foundation for developing systems that support student learning through voice interaction with multi-language mobile robots. Furthermore, Schlippe et al. [70] developed a multilingual interactive conversational AI tutoring system for exam preparation. The system utilized a multilingual bidirectional encoder representations from transformers (M-BERT) model to automatically score free-text answers in 26 languages. It leveraged learning analytics, crowdsourcing and gamification to enhance the learning experience and adapt the system.

    Figure 4.  Examples of iconic gestures used in this study, photographed from the learner's perspective [67].
    Figure 5.  (a) A first-grade student interacts with the robot tutor. The caption here is an English translation of what the robot is saying in Spanish. (b) A Keepon robot [68].

    \newpage Several studies have explored the use of NAO robots in language tutoring and educational settings, showcasing their potential to personalize tutoring, engage learners, enhance language proficiency and address educational challenges such as teacher shortages. In their novel approach, Schodde et al. [71] utilized a Bayesian knowledge tracing model combined with tutoring actions to personalize language tutoring in HRI, using word pairs in the artificial language Vimmi to prevent associations with known words or languages. Evaluation results demonstrated the superior effectiveness of the adaptive model in facilitating successful L2 word learning compared to randomized training. Another study by He et al. [72] focused on educational purposes and introduced a multi-language robot system that employed voice interaction and automatic questioning to engage learners in metacognition tutoring and geometrical thinking training. In the context of enhancing oral English proficiency, Lin et al. [73] developed the English oral training robot tutor system (EOTRTS), utilizing NAO, a social robot, to provide individual tutoring and interactive learning experiences for students in Taiwan. This system also had the potential to facilitate the learning of other foreign languages such as Japanese and Korean. Furthermore, a study in 2021 transformed the language shower program into a digital solution using a smartphone/tablet app and an NAO robot, demonstrating its positive impact on children's language learning progress in Norwegian day-care centers [74]. This highlights the potential of social robots in enhancing language learning. Addressing the shortage of English teachers in Taiwan, the modular English teaching multi-robot system (METMRS) employed NAO as the main teacher and Zenbo Junior robots as assistants, offering an innovative solution for English education [75].

    Telepresence robots have emerged as valuable tools in educational settings, facilitating communication and engagement across language barriers and enhancing the learning experience for remote students. Jakonen and Jauni [76] explored the use of telepresence robots in hybrid language classrooms, where remote students participate through videoconferencing technology. Their findings highlight how telepresence robots enhance remote students' engagement and contribute to the multimodal meaning-making in hybrid language teaching. Similarly, Tanaka et al. [77] discussed the outcomes of a JST PRESTO project that utilized child-operated telepresence robots to facilitate international communication between classrooms, demonstrating the effectiveness of the system in enabling young children to communicate across language barriers.

    Emerging technologies in education and communication, such as multilingual tutoring and speech-to-sign language translation systems, are transforming learning experiences and facilitating effective communication across language barriers. For example, Roybi, one of the most popular multilingual tutoring robots on the market, offers children an individualized educational experience through the use of AI. This interactive robot introduces children to technology, mathematics and science while engaging with them in various languages, including Spanish, French, English and Mandarin. In a similar field, a system called SISLA was proposed in [78], which utilizes a 3D avatar to translate speech into Indian Sign Language. With impressive accuracy rates of 91% for English and 89% for Punjabi and Hindi, usability testing confirms its effectiveness for educational and communication purposes, particularly for the hearing impaired.

    Driven by the need to improve customer experiences, various assistive robots in service have been developed. For instance, interactive information support systems and banking robots, have been designed to enhance concierge service. Additionally, assistive robots in healthcare, rehabilitation and mental health have emerged as innovative solutions, providing support in hospitals, monitoring emotional well-being and improving accessibility for individuals with disabilities or elderly individuals. These cutting-edge applications demonstrate the transformative impact of language translation and understanding in social robot interaction across various domains.

    To upgrade concierge services in hotels, Yamamoto et al. [79] proposed an interactive information support system utilizing smart devices and robot partners. The system comprises robot partners for communication and interaction with users and informationally structured space servers for data processing and personalized recommendations. It should be noted that the cute robots can select their communication language based on voice recognition through greeting. Its basic conversation flow is shown in Figure 6.

    Figure 6.  Scene transition [79].

    Advancements in automatic speech recognition (ASR) and humanoid robot technologies are transforming the banking industry, enhancing customer experiences and overcoming language barriers. In a study conducted in Greece [80], researchers developed innovative methodologies for voice activity detection and noise elimination in budget robots, enabling effective ASR in challenging acoustically quasi-stationary environments. Furthermore, showcasing the potential of AI-driven robots in the banking sector, Pepper, a multi-linguistic humanoid robot, has made a positive impact at BBBank [81]. With its friendly and helpful demeanor, Pepper has assisted customers in various tasks, such as blocking stolen credit cards and providing relevant information. This successful integration of Pepper highlights its ability to enhance customer experience and illustrates the growing significance of robots in the banking industry.

    Leveraging recent advancements in mobile speech translation and cognitive architectures, multilingual promotional robots have emerged with great potential. In the pursuit of user-friendly and adaptable speech-to-speech translation systems for mobile devices, Yun et al. [82] developed a robust system by leveraging a large language and speech database. This research showcased the successful creation of a mobile-based, multi-language translation system capable of operating in real-world environments. Building upon this, Romero et al. [83] introduced the CORTEX cognitive architecture for social robots. By integrating different levels of abstraction into a unified deep space representation (DSR), this architecture facilitated agent interaction and behavior execution. The utilization of Microsoft's Kinect program in a separate embedded computer further enhanced the system's multi-language and multi-OS capabilities, exemplifying the potential of such technology in robotics.

    In recent years, multilingual service robots have gained popularity, empowering diverse user groups with multimodal capabilities and extensive language support. Therefore, many innovative solutions are proposed. The PaeLife project, conducted in 2015, aimed to develop AALFred, a multimodal and multilingual virtual personal life assistant for senior citizens [84]. The project focused on various aspects, including collecting elderly speech corpora, optimizing speech recognition for elderly speakers, designing a reusable speech modality component and enabling automatic grammar translation to support multiple languages. After a few years, a software robot named Xiaoming was introduced. Xiaoming possessed multilingual and multimodal capabilities, allowing it to generate news, perform translation, read and animate avatars [85]. Voice cloning technology was utilized to synthesize speech in multiple languages, and Xiaomingbot achieved significant popularity on social media platforms by writing a substantial number of articles. Another notable research effort by Doumbouya et al. [86] addressed the challenge of providing speech recognition technology to illiterate populations. They explored unsupervised speech representation learning using noisy radio broadcasting archives and released datasets such as the West African Radio Corpus and West African Virtual Assistant Speech Recognition Corpus. Their work introduced the West African wav2vec speech encoder, which showed promising performance in multilingual speech recognition and language identification tasks.

    In addition, according to [87], dependency on internet connectivity and language constraints hinder the effectiveness of smart assistants, such as Google Assistant, Siri and Alexa. To address these issues, a multilingual voice assistant system was developed using Raspberry Pi, enabling offline access to various languages and it allowed users to access information and perform tasks in their preferred language.

    As the aging population grows and caregiver resources become limited, the demand for innovative technologies to assist and care for the elderly is on the rise. Socially assistive robots emerge as promising solutions for long-term elderly care. In 2016, Nuovo et al. [88] conducted an evaluation and development of a multi-modal user interface (MMUI) to enhance the usability and acceptance of assistive robot systems among elderly users. The experimental results demonstrated the effectiveness of the MMUI in improving flexibility and naturalness in interactions with the elderly. They also implemented multi-language speech recognition and text-to-speech (TTS) modules to facilitate communication using Nuance- and Acapela-VAAS respectively. Later, in 2018, a group of researchers further discussed the implementation of a user-friendly and acceptable service robotic system for the elderly, focusing on a web-based multi-modal user interface. Notably, it supported multi-language such as English, Italian and Swedish so as to enhance flexibility, naturalness, and acceptability of elderly-robot interaction [89]. In order to assist elderly individuals in adhering to their medication regimen, a novel robotic system was designed and evaluated using the NAO robot, which supported multi-language capabilities [90] (in Figure 7). This system utilized computer vision and a database to identify medication packaging, detect the intended recipient, and ensure timely administration. Additionally, Giorgi et al. [91] enhanced HRI by developing human-like verbal and nonverbal behaviors in an NAO robot companion. It is worth noting that the robot served as a communicator in community activities with the elderly, offering multi-language translation capabilities through Cloud Services.

    Figure 7.  Robotic system overview (by: crisostomo) [90].

    Advancements in voice-controlled robots and robust voice control systems have revolutionized the healthcare and rehabilitation sectors, providing potential support to hospitals and rehabilitation centers. In [92], Pramanik and his colleagues introduced a fully voice-controlled robot designed for hospitals, addressing staff overload and worker shortage situations. The robot's flexibility in movement, user-friendly characteristics and ability to accommodate diverse voices and languages make it suitable for satisfying the needs of both hospitals and patients. To meet the needs of patients with amputation, paralysis, and quadriplegia, a robust voice control system for rehabilitation robots was developed [93]. The system utilized advanced voice-recognition algorithms, such as hidden Markov model and dynamic time warping, to enhance accuracy and reduce errors (Figure 8). Its effectiveness in diverse noise environments and with multiple languages was demonstrated through validation experiments. In [94], the development of CLARA, a socially assistive robot (SAR), was discussed. Its appearance is presented in Figure 9. CLARA offers potential support for caregivers through its proactive, autonomous and adaptable nature. The integration of a multi-language interface using the Microsoft SDK enhances CLARA's perceptive and reactive capabilities, making it effective in various healthcare settings.

    Figure 8.  Voice-recognition modules implemented (by:Ruzaij) [93].
    Figure 9.  (a) One of the CLARA robots with the two RFID antennas. (b) External aspect of CLARA after adding (left) a first version and (right) the second version of the external housing [94].

    Assistive technologies for individuals with physical disabilities, such as multi-input control systems and bilingual social networking service robots, improve accessibility and communication. These innovations empower users, enhancing their mobility and facilitating connections with peers and medical professionals. For instance, Ruzaij et al. [95] introduced a novel multi-input control system for rehabilitation applications, specifically designed for individuals with limited arm mobility. By integrating a voice controller unit and a head orientation control unit, the system employed various voice recognition algorithms and MEMS sensors to facilitate wheelchair control through user commands and head movements. This hybrid voice controller not only enhanced voice recognition accuracy but also provided language flexibility, offering a promising solution for individuals with diverse needs. In a related context, Kobayashi et al. [96] proposed a bilingual social networking service (SNS) agency robot aimed at assisting individuals with physical disabilities in using tablets or smartphones for communication. Notably, this robot incorporated a voice user interface, enabling users to interact with others who share similar conditions or communicate with medical specialists in their native languages.

    Recent advancements in healthcare robotics have introduced innovative solutions for evaluating mental health and monitoring emotional well-being in elderly individuals. In 2020, a multi-language robot interface that helped evaluate the mental health of elderly people through problem interaction was introduced by Yvanoff-Frenchin et al. [97], which was implemented on an embedded device for edge computing. The robot could interact with the user through appropriate language and it could process the answers and then, with the guidance of an expert, direct the questions and answers in the desired direction of treatment. At the same time, the device could filter out environmental noise and is suitable for placement anywhere in the home. In the same year, a robotic interface with multi-language capability was proposed [98] for monitoring and assessing the emotional health of elderly individuals through extended conversations. The system utilized voice interface and expert supervision to engage in automated conversations with clients, and the proposed method demonstrated compatibility with embedded platforms. One year later, Jibo, the social robot developed by NTT Disruption, has found application in the medical field as an empathetic healthcare companion [99] (Figure 10). Leveraging Microsoft Azure Cognitive Services, Jibo utilized AI capabilities to recognize people, understand moods and provide information and support for treatments. Notably, Jibo's multilingual communication abilities enable it to engage with patients in their preferred language, offering companionship, proactive assistance, video calling capabilities and reminders for treatment plans and exercises. During the COVID-19, Pepper has been deployed at Hořovice Hospital in the Czech Republic to assist in the fight against the pandemic [100]. With its ability to work tirelessly and be easily disinfected, Pepper helped enforce social distancing measures by detecting patients' temperatures and encouraging hand sanitization. The robot has been well-received by both staff and patients, improving the hospital experience and easing the burden on medical personnel.

    Figure 10.  Jibo robot for empathetic healthcare companion [99].

    Nowadays, travel more and more takes the fancy of domestic and abroad tourists. Taking multiple factors into consideration, bright prospects lie in this field concerning assistive robots in travel guides.

    In [101], a multi-lingual service system utilizing the iBeacon network for service robots was developed. By leveraging users' personal information stored in a dedicated app and the iBeacon region, the system enabled robots to understand users' language and provide personalized services. The collaborative nature of the system allowed for efficient resource utilization and had the potential to be applied in various domains, such as Olympic Game guidance. Additionally, Sun et al. [102] presented the "Concierge robot", a sightseeing support robot partner designed to recommend shops, restaurants and sightseeing spots to hotel visitors. The robot incorporated intelligent devices, a body and a four-wheel robot base, providing guide services through interactive multi-language utterances and a touch interface. Besides, Jeanpierre et al. [103] developed a robust system of autonomous robots that operated independently in complex public spaces, interacted with humans and assisted them in various environments. Equipped with a speech server with Microsoft Speech Recognition, the system demonstrates impressive effectiveness in interacting with visitors naturally. In 2019, Yoshiuchi et al. [104] explored the use of data analysis technology in service robot systems to improve business operations. By modifying service scenarios and analyzing collected data, the study demonstrated an 8.1% potential increase in business improvement, particularly in areas such as communication, image processing and multi-language processing.

    Later, a voice-based attender robot with line-following capabilities and speech recognition was designed for university settings to assist with tasks such as passing circulars, interacting with parents and providing navigation assistance [105]. It can connect with humans through spoken natural language, specifically English and Kannada. The results verified the robot's effectiveness in facilitating communication with users, making it applicable not only to universities but also to other environments like railway stations, bus stations and factories. Recently, Zhang et al. [106] presented a voice control system for the LoCoBot WX250 robot, utilizing machine learning models and the BERT model for improved intent classification and keyword recognition. The system enhanced the interaction experience between humans and the robot, enabling it to act as a tour guide in museums. It could communicate with visitors via speaker and microphone, respond to instructions and even switch languages to accommodate foreign tourists. Pepper, another pioneering AI robot, revolutionized the tourism industry by breaking down language barriers and offering multi-linguistic communication and knowledge sharing [107] (Figure 11). With its touch of emotion and surprise, Pepper enhanced the museum experience, shared anecdotes and engaged visitors on a deeper level. Its interactive and proactive nature made it an invaluable tool for attracting and guiding visitors, creating a truly immersive and memorable cultural experience.

    Figure 11.  A guiding robot in library [107].

    The development of assistive robots capable of speaking multiple languages in cross-cultural collaboration is essential for fostering effective communication and collaboration among individuals from diverse cultural backgrounds. These robots enable seamless interaction, understanding and cooperation between people who speak different languages, facilitating cross-cultural collaboration in various domains.

    The integration of advanced technologies in robotics and AI is transforming industries, with applications ranging from industrial automation to agriculture. These innovations enhance productivity and efficiency, revolutionizing processes and addressing industry-specific challenges. Lin et al. [108] invented an automatic sorting system for industrial robots that integrates 3D visual perception, natural language interaction and automatic programming. Notably, the robot utilizes the open-source speech synthesis system (Ekho) for generating speech, supporting multiple languages and different platforms. In the same vein, Birch et al. [109] evaluated the effectiveness of a novel human-robot-interface for machine hole drilling, considering environmental factors on speech recognition accuracy. The developed speech recognition method, displayed in Figure 12, enabled HRI through a unique integration approach, employing DTW and distance comparison for word identification and language translation. Likewise, in [110], a mobile application that utilized AI and voice bot technology was developed to assist farmers in the agriculture sector. It featured a multi-linguistic voice bot for querying agricultural information and a suggestion bot for providing versatile suggestions related to weather, crops, fertilizers and soil. This AI-based system enhanced farming practices, increased agricultural production and addressed unknown issues faced by farmers.

    Figure 12.  Algorithm flow chart (from: Birch) [109].

    In a different context, Hong et al. [111] focused on implementing natural language-based communication between humans and fire fighting robots using ontological semantic technology (OST), which enabled a comprehensible understanding of meanings across multiple languages. The study expanded the application of OST to the domain of fire fighting, specifically addressing communication between robots and humans in Korean and English. To improve the office environment, a dialog agent that could understand natural language instructions from naive users was presented by Thomason et al. [112]. The agent incorporates a learning mechanism that induces training data from user paraphrases, enabling it to adapt to language variation without requiring large annotated corpora. Experimental results from web interfaces and a mobile robot deployed in an office environment demonstrated improved user satisfaction through learning from conversations. On top of that, Contreras et al. [113] explored the use of domain-based speech recognition to control drones in a more natural and human-like manner. By implementing an algorithm for command interpretation in both Spanish and English, the study demonstrated the effectiveness of voice instructions for drone control in a simulated domestic environment. The results showed improved accuracy in speech-to-action recognition, particularly with the phoneme matching approach, achieving high accuracy for both languages. In [114], a remote center of motion (RCM) based nasal robot was designed to assist in nasal surgery (Figure 13). Accordingly, a voice-based control method was proposed where surgeons provided direction instructions through the analysis of endoscopic images and a commercial speech recognition interface was used for offline grammar control, as shown in Figure 14. Additionally, a speech recognition interface was employed to create an offline grammar control word library that is compatible with both English and Chinese.

    Figure 13.  The overall structure of the nasal endoscopic surgical robot [114].
    Figure 14.  Offline speech recognition process of robot motion instructions [114].

    Although social robots are applied in many fields and have made technological breakthroughs in recent years, it still faces some challenges in language translation and understanding, as shown in the Figure 15.

    Figure 15.  The challenges in language translation and understanding.

    (1) Interlingual semantic understanding

    Interlingual semantic understanding constitutes a critical aspect of AI-based language translation and understanding in social robot interaction. As robots are designed to communicate seamlessly with humans, their ability to understand semantics, not just literal translations, across multiple languages is crucial.

    Interlingual semantic understanding typically involves techniques such as neural machine translation (NMT), where the system learns to translate by being trained on large amounts of text in both the source and target language. Moreover, models like BERT and GPT have enhanced semantic understanding by emphasizing the context of words. These models leverage deep learning and large-scale language representation to understand the semantics in one language and then generate the appropriate semantics in the target language.

    There are two difficulties associated with interlingual semantic understanding. One major challenge is the issue of word sense disambiguation, differentiating the meaning of a word based on context. This becomes particularly challenging when a word or phrase in one language has multiple meanings in another. Additionally, understanding and correctly translating idioms, metaphors or cultural references is a formidable task for AI systems. These linguistic features often do not have direct equivalents across languages and require a deep understanding of both languages' cultures.

    (2) Data scarcity and quality

    Given that machine learning models are heavily reliant on large, high-quality datasets for training, the scarcity and inferior quality of data can pose substantial impediments to their performance.

    Specific challenges can be concluded as follows. Current status reveals an uneven distribution of data across different languages. While extensive, high-quality datasets exist for popular languages like English, many minority languages suffer from severe data scarcity. The consequence is an inherent bias in AI systems towards languages for which abundant data is available, resulting in less accurate translation and understanding capabilities for underrepresented languages. Furthermore, when ample data is available, its quality, including accuracy, consistency and relevance, might be compromised. For instance, training data may contain errors, be inconsistently annotated or simply not be representative of the diversity and complexity of real-world language use.

    Furthermore, developing techniques to improve AI performance even with scarce or lower-quality data is a promising research direction. This includes methods such as transfer learning, where a pre-trained model on a large dataset can be fine-tuned on a smaller, specific dataset and data augmentation techniques to synthetically expand existing datasets.

    Therefore, subsequent research is expected to develop techniques that can improve AI performance even with scarce or lower-quality data. This includes methods such as transfer learning, where a pre-trained model on a large dataset can be fine-tuned on a smaller, specific dataset and data augmentation techniques to synthetically expand existing datasets.

    (3) Cultural adaptability and diversity

    As social robots are envisioned to operate in multicultural societies and interact with people from different cultural backgrounds, their ability to adapt to various cultural norms and understand cultural diversity is essential.

    The challenges associated with cultural adaptability and diversity are multifaceted. Language is deeply rooted in culture, carrying idiomatic expressions and metaphors that might be culturally exclusive. Another challenge is cultural bias. Training data used for AI systems often reflects the cultural characteristics of the regions where the data is sourced, which can inadvertently lead to cultural biases in AI models. Such biases could manifest as AI systems performing better for certain cultures while struggling with others. Furthermore, social etiquette, norms and expectations vary widely across different cultures. Designing social robots that can adapt to such a wide range of cultural expectations is an intricate challenge. Hence, addressing these challenges necessitates an interdisciplinary approach, combining insights from linguistics, anthropology, sociology and AI.

    The GPT language model, particularly its latest iterations such as GPT-3 and GPT-4, has greatly impacted the field of language translation and understanding. Hence, the positive impact of GPT in the realm of social human interaction is evident. By providing social robots with the ability to understand and generate human-like responses, GPT has facilitated more nuanced and meaningful interactions. For instance, Nishihara et al. [115] developed an online algorithm for robots to acquire knowledge of natural language and object concepts by connecting recognized words to concepts. The model took into account the interdependence of words and concepts, enabling the robot to develop a more accurate language model and object concepts through unsupervised word segmentation and multimodal information. He and Mary [116], by reviewing the principles of ChatGPT, analyzed various aspects of robot perception and intelligence, excluding intrapersonal intelligence and proposed a multimodal approach using GPT-3 to implement seven types of robot intelligence. The proposed framework, called RobotGPT, paving the way for smarter robotic systems.

    Currently, many social robots are still largely restricted to English interaction. Those that do support multilingual interactions are often confined to specific domains such as language teaching, healthcare or social companionship, leaving broader applications, particularly work-related collaborations, underexplored. Looking ahead, a future where social robots shatter this language barrier, mastering not only multiple languages but also understanding dialects, is an exhilarating prospect. In [117], it suggests that a robot communicating in regional dialects or using a relaxed conversation style might be more warmly received. Andrist et al. [118] explored the impact of language and cultural context on the credibility of robot speech. Comparing Arabic-speaking robots in Lebanon and English-speaking robots in the USA, it revealed cultural differences in the importance of rhetorical cues and practical knowledge. These findings informed the design of culturally sensitive HRIs, particularly in relation to dialect usage.

    Presently, interactions with social robots are primarily command-based, with the robots responding to explicit instructions from users. In the future, we envision social robots evolving from mere AI assistants to empathetic companions that truly understand human emotions, needs and desires, creating meaningful and enriching interactions. Imagine a scenario where you return home from a stressful day at work. Instead of merely offering to perform its usual tasks, it suggests relaxing activities like playing soothing music or initiating a calming meditation session. In a different scenario, let us imagine a tutoring robot assisting in a classroom. Beyond just answering questions or teaching language, the robot could gauge the understanding level of students by their facial expressions, confusion in their voice, or hesitation in their answers. It could then adjust the teaching speed or method to better accommodate the students' learning pace. In summary, the future of social robots lies in moving beyond command-based interaction to truly understanding and empathizing with human users.

    In this literature review, we have explored the progression and current state of language translation and understanding in social robots, focusing particularly on the areas of multilingual capabilities and application in diverse domains. Our primary finding is that while social robots have shown promise in their ability to interact in one or two languages, there are still significant deficiencies, especially when it comes to broad multilingual interactions. Additionally, the application of multilingual social robots is mainly limited to areas like language teaching, healthcare and social companionship, with less prevalent use in sectors such as smart manufacturing or robot-assisted surgery.

    This review provides a comprehensive look at the advancements made in the past decade from the perspective of social robot applications. We have detailed the current challenges faced in this domain, including interlingual semantic understanding, data scarcity and quality and cultural adaptability and diversity. By outlining these challenges, we hope to contribute to the research field by identifying the areas in need of focus and further development.

    In conclusion, this literature review captures the evolution of language translation and understanding in social robots, summarizing the major challenges faced and outlining a roadmap for future research directions. As we continue to advance in AI and robotics, we expect that this review will serve as a reference point for subsequent research aimed at enhancing the multilingual capabilities and empathetic interactions of social robots.

    The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

    This work was supported by the Shandong Province Social Science Planning Fund Program (No. 23CYYJ13) and 2021 Top-notch Student Cultivation Program 2.0 in Basic Discipline (No. 20212060).

    All authors declare no conflicts of interest in this paper.



    [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is All you Need, in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017. https://doi.org/10.48550/arXiv.2206.09457
    [2] Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, et al., Learning deep transformer models for machine translation, preprint, arXiv: 1906.01787.
    [3] S. A. Chowdhury, A. Abdelali, K. Darwish, J. Soon-Gyo, J. Salminen, B. J. Jansen, Improving arabic text categorization using transformer training diversification, in Proceedings of the Fifth Arabic Natural Language Processing Workshop (COLING-WANLP), (2020), 226–236. https://aclanthology.org/2020.wanlp-1.21
    [4] X. Ma, P. Zhang, S. Zhang, N. Duan, Y. Hou, M. Zhou, et al., A tensorized transformer for language modeling, preprint, arXiv: 1906.09777.
    [5] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, preprint, arXiv: 1810.04805.
    [6] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, et al., RoBERTa: A robustly optimized BERT pretraining approach, preprint, arXiv: 1907.11692.
    [7] H. Xu, B. Liu, L. Shu, P. S. Yu, BERT post-training for review reading comprehension and aspect-based sentiment analysis, preprint, arXiv: 1904.02232.
    [8] P. Shi, J. Lin, Simple BERT models for relation extraction and semantic role labeling, preprint, arXiv: 1904.05255.
    [9] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, preprint, arXiv: 1910.01108.
    [10] Y. Cheng, D. Wang, P. Zhou, T. Zhang, Model compression and acceleration for deep neural networks: The principles, progress, and challenges, IEEE Signal Process. Mag., 35 (2018), 126–136. https://doi.org/10.1109/MSP.2017.2765695 doi: 10.1109/MSP.2017.2765695
    [11] S. Cheng, D. Lucor, J. P. Argaud, Observation data compression for variational assimilation of dynamical systems, J. Comput. Sci., 53 (2021), 101405. https://doi.org/10.1016/j.jocs.2021.101405 doi: 10.1016/j.jocs.2021.101405
    [12] S. Liu, Y. Lin, Z. Zhou, K. Nan, H. Liu, J. Du, On-demand deep model compression for mobile devices: A usage-driven model selection framework, in Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, (2018), 389–400. https://doi.org/10.1145/3210240.3210337
    [13] S. Liu, J. Du, K. Nan, Z. Zhou, H. Liu, Z. Wang, et al., AdaDeep: A usage-driven, automated deep model compression framework for enabling ubiquitous intelligent mobiles, IEEE Trans. Mob. Comput., 20 (2021), 3282–3297. https://doi.org/10.1109/TMC.2020.2999956 doi: 10.1109/TMC.2020.2999956
    [14] V. L. Tran, S. E. Kim, Efficiency of three advanced data-driven models for predicting axial compression capacity of CFDST columns, Thin-Walled Struct., 152 (2020), 106744. https://doi.org/10.1016/j.tws.2020.106744 doi: 10.1016/j.tws.2020.106744
    [15] Z. X. Hu, Y. Wang, M. F. Ge, J. Liu, Data-driven fault diagnosis method based on compressed sensing and improved multiscale network, IEEE Trans. Ind. Electron., 67 (2020), 3216–3225. https://doi.org/10.1109/TIE.2019.2912763 doi: 10.1109/TIE.2019.2912763
    [16] S. Cheng, I. C. Prentice, Y. Huang, Y. Jin, Y. K. Guo, R. Arcucci, Data-driven surrogate model with latent data assimilation: Application to wildfire forecasting, J. Comput. Phys., 464 (2022). https://doi.org/10.1016/j.jcp.2022.111302
    [17] S. Yang, Z. Zhang, C. Zhao, X. Song, S. Guo, H. Li, CNNPC: End-edge-cloud collaborative CNN inference with joint model partition and compression, IEEE Trans. Parallel Distrib. Syst., (2022), 1–1. https://doi.org/10.1109/TPDS.2022.3177782 doi: 10.1109/TPDS.2022.3177782
    [18] H. He, S. Jin, C. K. Wen, F. Gao, G. Y. Li, Z. Xu, Model-driven deep learning for physical layer communications, IEEE Wireless Commun., 26 (2019), 77–83. https://doi.org/10.1109/MWC.2019.1800447 doi: 10.1109/MWC.2019.1800447
    [19] Z. Liu, M. del Rosario, Z. Ding, A markovian model-driven deep learning framework for massive MIMO CSI feedback, IEEE Trans. Wireless Commun., 21 (2022), 1214–1228. https://doi.org/10.1109/TWC.2021.3103120 doi: 10.1109/TWC.2021.3103120
    [20] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, M. Zhou, MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, preprint, arXiv: 2002.10957.
    [21] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, et al., TinyBERT: Distilling BERT for natural language understanding, preprint, arXiv: 1909.10351.
    [22] S. Sun, Y. Cheng, Z. Gan, J. Liu, Patient knowledge distillation for BERT model compression, preprint, arXiv: 1908.09355.
    [23] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jegou, Training data-efficient image transformers & distillation through attention, in Proceedings of the 38th International Conference on Machine Learning (ICML), (2021), 10347–10357. https://doi.org/10.48550/arXiv.2012.12877
    [24] P. Michel, O. Levy, G. Neubig, Are sixteen heads really better than one?, Adv. Neural Inf. Process. Syst., preprint, arXiv: 1905.10650.
    [25] M. A. Gordon, K. Duh, N. Andrews, Compressing BERT: Studying the effects of weight pruning on transfer learning, preprint, arXiv: 2002.08307.
    [26] T. Chen, Y. Cheng, Z. Gan, L. Yuan, L. Zhang, Z. Wang, Chasing sparsity in vision transformers: An end-to-end exploration, Adv. Neural Inf. Process. Syst., (2021), 19974–19988. https://doi.org/10.48550/arXiv.2106.04533 doi: 10.48550/arXiv.2106.04533
    [27] T. Chen, J. Frankle, S. Chang, S. Liu, Y. Zhang, Z. Wang, et al., The lottery ticket hypothesis for pre-trained BERT networks, Adv. Neural Inf. Process. Syst., (2020), 15834–15846. https://doi.org/10.48550/arXiv.2007.12223 doi: 10.48550/arXiv.2007.12223
    [28] S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, et al., Q-BERT: Hessian based ultra low precision quantization of BERT, preprint, arXiv: 1909.05840.
    [29] Z. Liu, Y. Wang, K. Han, S. Ma, W. Gao, Post-training quantization for vision transformer, preprint, arXiv: 2106.14156.
    [30] H. Bai, W. Zhang, L. Hou, L. Shang, J. Jin, X. Jiang, et al., BinaryBERT: Pushing the limit of BERT quantization, preprint, arXiv: 2012.15701.
    [31] O. Zafrir, G. Boudoukh, P. Izsak, M. Wasserblat, Q8BERT: Quantized 8Bit BERT, in the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS 2019, (2019), 36–39. https://doi.org/10.1109/EMC2-NIPS53020.2019.00016
    [32] Z. Wu, Z. Liu, J. Lin, Y. Lin, S. Han, Lite transformer with long-short range attention, preprint, arXiv: 2004.11886.
    [33] L. Hou, Z. Huang, L. Shang, X. Jiang, X. Chen, Q. Liu, DynaBERT: Dynamic BERT with adaptive width and depth, preprint, arXiv: 2004.04037.
    [34] M. Chen, H. Peng, J. Fu, H. Ling, AutoFormer: Searching transformers for visual recognition, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 12250–12260. https://doi.org/10.1109/ICCV48922.2021.01205
    [35] P. Ganesh, Y. Chen, X. Lou, M. A. Khan, Y. Yang, H. Sajjad, et al., Compressing large-scale transformer-based models: A case study on BERT, Trans. Assoc. Comput. Linguist., 9 (2021), 1061–1080. https://doi.org/10.1162/tacl_a_00413 doi: 10.1162/tacl_a_00413
    [36] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput., 9 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 doi: 10.1162/neco.1997.9.8.1735
    [37] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, preprint, arXiv: 1412.3555.
    [38] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, preprint, arXiv: 1409.0473.
    [39] B. Li, S. Pandey, H. Fang, Y. Lyv, J. Li, J. Chen, et al., FTRANS: energy-efficient acceleration of transformers using FPGA, in Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), (2020), 175–180. https://doi.org/10.1145/3370748.3406567
    [40] T. J. Ham, S. J. Jung, S. Kim, Y. H. Oh, Y. Park, Y. Song, et al., A.3: Accelerating attention mechanisms in neural networks with approximation, in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), (2020), 328–341. https://doi.org/10.1109/HPCA47549.2020.00035
    [41] T. J. Ham, Y. Lee, S. H. Seo, S. Kim, H. Choi, S. J. Jung, et al., ELSA: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks, in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), (2021), 692–705. https://doi.org/10.1109/ISCA52012.2021.00060
    [42] X. Zhang, Y. Wu, P. Zhou, X. Tang, J. Hu, Algorithm-hardware co-design of attention mechanism on FPGA devices, ACM Trans. Embed. Comput. Syst., 20 (2021), 1–24. https://doi.org/10.1145/3477002 doi: 10.1145/3477002
    [43] S. Lu, M. Wang, S. Liang, J. Lin, Z. Wang, Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer, in IEEE International SOC Conference, (2020), 84–89. https://doi.org/10.1109/ISCA52012.2021.00060
    [44] A. Parikh, O. Tä ckströ m, D. Das, J. Uszkoreit, A decomposable attention model for natural language inference, in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, (2016), 2249–2255. https://doi.org/10.48550/arXiv.1606.01933
    [45] Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, et al., A structured self-attentive sentence embedding, preprint, arXiv: 1703.03130
    [46] M. S. Charikar, Similarity estimation techniques from rounding algorithms, in Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, (2002), 380–388. https://doi.org/10.1145/509907.509965
    [47] X. Zhang, F. X. Yu, R. Guo, S. Kumar, S. Wang, S. F. Chang, Fast orthogonal projection based on kronecker product, in 2015 IEEE International Conference on Computer Vision (ICCV), (2015), 2929–2937. https://doi.org/10.1109/ICCV.2015.335
    [48] Y. Gong, S. Kumar, H. A. Rowley, S. Lazebnik, Learning binary codes for high-dimensional data using bilinear projections, in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2013), 484–491. https://doi.org/10.1109/CVPR.2013.69
    [49] M. Wang, S. Lu, D. Zhu, J. Lin, Z. Wang, A high-speed and low-complexity architecture for softmax function in deep learning, in 2018 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), (2018), 223–226. https://doi.org/10.1109/APCCAS.2018.8605654
    [50] R. Hu, B. Tian, S. Yin, S. Wei, Efficient hardware architecture of softmax layer in deep neural network, in 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), (2018), 1–5. https://doi.org/10.1109/ICDSP.2018.8631588
    [51] L. Deng, G. Li, S. Han, L. Shi, Y. Xie, Model compression and hardware acceleration for neural networks: A comprehensive survey, Proc. IEEE, 108 (2020), 485–532. https://doi.org/10.1109/JPROC.2020.2976475 doi: 10.1109/JPROC.2020.2976475
    [52] C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, et al., C ir CNN: Accelerating and compressing deep neural networks using block-circulant weight matrices, in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), (2017), 395–408. https://doi.org/10.1145/3123939.3124552
    [53] S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, et al., C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs, in Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), (2018), 11–20. https://doi.org/10.1145/3174243.3174253
    [54] L. Zhao, S. Liao, Y. Wang, Z. Li, J. Tang, B. Yuan, Theoretical properties for neural networks with weight matrices of low displacement rank, in Proceedings of the 34th International Conference on Machine Learning (ICML), (2017), 4082–4090. https://doi.org/10.48550/arXiv.1703.00144
    [55] V. Y. Pan, Structured matrices and displacement operators, in Structured Matrices and Polynomials: Unified Superfast Algorithms, Springer Science & Business Media, (2001), 117–153. https://doi.org/10.1007/978-1-4612-0129-8
    [56] J. O. Smith, Mathematics of the discrete fourier transform (DFT): with audio applications, in Mathematics of the Discrete Fourier Transform (DFT): With Audio Applications, Julius Smith, (2007), 115–164. https://ccrma.stanford.edu/~jos/st/
    [57] Z. Liu, G. Li, J. Cheng, Hardware acceleration of fully quantized BERT for efficient natural language processing, in 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), (2021), 513–516. https://doi.org/10.23919/DATE51398.2021.9474043
    [58] M. Sun, H. Ma, G. Kang, Y. Jiang, T. Chen, X. Ma, et al., VAQF: Fully automatic software-hardware co-design framework for low-bit vision transformer, preprint, arXiv: 2201.06618.
    [59] Z. Liu, Z. Shen, M. Savvides, K. T. Cheng, ReActNet: Towards precise binary neural network with generalized activation functions, in Computer Vision–ECCV 2020 (ECCV), (eds. Vedaldi. A., Bischof. H., Brox. T., Frahm. J.-M.), Cham, Springer International Publishing, (2020), 143–159. https://doi.org/10.1007/978-3-030-58568-6_9
    [60] M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, XNOR-Net: ImageNet classification using binary convolutional neural networks, in Computer Vision–ECCV 2016 (ECCV), (eds. Leibe. B., Matas. J., Sebe. N., Welling. M.), Cham, Springer International Publishing, (2016), 525–542. https://doi.org/10.1007/978-3-319-46493-0_32
    [61] S. Han, H. Mao, W. J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, preprint, arXiv: 1510.00149.
    [62] W. Wen, C. Wu, Y. Wang, Y. Chen, H. Li, Learning structured sparsity in deep neural networks, in Advances in Neural Information Processing Systems (NeurIPS), Curran Associates, (2016). https://doi.org/10.48550/arXiv.1608.03665
    [63] X. Ma, F. M. Guo, W. Niu, X. Lin, J. Tang, K. Ma, et al., PCONV: The missing but desirable sparsity in DNN weight pruning for real-time execution on mobile devices, in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), (2020), 5117–5124. https://doi.org/10.1609/aaai.v34i04.5954
    [64] B. Li, Z. Kong, T. Zhang, J. Li, Z. Li, H. Liu, et al., Efficient transformer-based large scale language representations using hardware-friendly block structured pruning, preprint, arXiv: 2009.08065.
    [65] S. Cao, C. Zhang, Z. Yao, W. Xiao, L. Nie, D. Zhan, et al., Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity, in Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), (2019), 63–72. https://doi.org/10.1145/3289602.3293898
    [66] H. Peng, S. Huang, T. Geng, A. Li, W. Jiang, H. Liu, et al., Accelerating transformer-based deep learning models on FPGAs using column balanced block pruning, in 2021 22nd International Symposium on Quality Electronic Design (ISQED), (2021), 142–148. https://doi.org/10.1109/ISQED51717.2021.9424344
    [67] C. Ding, A. Ren, G. Yuan, X. Ma, J. Li, N. Liu, et al., Structured weight matrices-based hardware accelerators in deep neural networks: FPGAs and ASICs, in Proceedings of the 2018 on Great Lakes Symposium on VLSI (GLSVLSI), Chicago, IL, USA, Association for Computing Machinery, (2018), 353–358. https://doi.org/10.1145/3194554.3194625
    [68] S. Narang, E. Undersander, G. Diamos, Block-sparse recurrent neural networks, preprint, arXiv: 1711.02782.
    [69] P. Qi, E. H. M. Sha, Q. Zhuge, H. Peng, S. Huang, Z. Kong, et al., Accelerating framework of transformer by hardware design and model compression co-optimization, in 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), (2021), 1–9. https://doi.org/10.1109/ICCAD51958.2021.9643586
    [70] P. Qi, Y. Song, H. Peng, S. Huang, Q. Zhuge, E. H. M. Sha, Accommodating transformer onto FPGA: Coupling the balanced model compression and FPGA-implementation optimization, in Proceedings of the 2021 on Great Lakes Symposium on VLSI (GLSVLSI), Virtual Event, USA, Association for Computing Machinery, (2021), 163–168. https://doi.org/10.1145/3453688.3461739
    [71] D. So, Q. Le, C. Liang, The evolved transformer, in Proceedings of the 36th International Conference on Machine Learning (ICML), PMLR, (2019), 5877–5886. https://doi.org/10.48550/arXiv.1901.11117
    [72] H. Wang, Efficient algorithms and hardware for natural language processing, Graduate Theses, Retrieved from the Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/127440.
    [73] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, et al., Bit fusion: Bit-Level dynamically composable architecture for accelerating deep neural network, in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), (2018), 764–775. https://doi.org/10.1109/ISCA.2018.00069
    [74] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, et al., Templates for the solution of linear systems: Building blocks for iterative methods, in Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, Society for Industrial and Applied Mathematics, (1994), 39–55. https://doi.org/10.1137/1.9781611971538
    [75] W. Liu, B. Vinter, CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication, in Proceedings of the 29th ACM on International Conference on Supercomputing (ICS), Newport Beach, California, USA, Association for Computing Machinery, (2015), 339–350. https://doi.org/10.1145/2751205.2751209
    [76] R. Kannan, Efficient sparse matrix multiple-vector multiplication using a bitmapped format, in 20th Annual International Conference on High Performance Computing (HiPC), (2013), 286–294. https://doi.org/10.1109/HiPC.2013.6799135
    [77] W. Jiang, X. Zhang, E. H. M. Sha, L. Yang, Q. Zhuge, Y. Shi, et al., Accuracy vs. efficiency: achieving both through FPGA-implementation aware neural architecture search, in Proceedings of the 56th Annual Design Automation Conference 2019 (DAC), Las Vegas NV USA, ACM, (2019), 1–6. https://doi.org/10.1145/3316781.3317757
    [78] W. Jiang, E. H. M. Sha, X. Zhang, L. Yang, Q. Zhuge, Y. Shi, et al., Achieving super-linear speedup across multi-FPGA for real-time DNN inference, preprint, arXiv: 1907.08985.
    [79] W. Jiang, X. Zhang, E. H. M. Sha, Q. Zhuge, L. Yang, Y. Shi, et al., XFER: A novel design to achieve super-linear performance on multiple FPGAs for real-time AI, in Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Seaside, CA, USA, Association for Computing Machinery, (2019), 305. https://doi.org/10.1145/3289602.3293988
  • This article has been cited by:

    1. Umut Pınarcı, Volkan Göçoğlu, 2025, chapter 9, 9798369365472, 219, 10.4018/979-8-3693-6547-2.ch009
  • Reader Comments
  • © 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(5958) PDF downloads(523) Cited by(3)

Figures and Tables

Figures(17)  /  Tables(11)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog