A self-attention based neural architecture for Chinese medical named entity recognition

Qian Wan; Jie Liu; Luona Wei; Bin Ji; Qian Wan; Jie Liu; Luona Wei; Bin Ji

doi:10.3934/mbe.2020197

Mathematical Biosciences and Engineering

2020, Volume 17, Issue 4: 3498-3511. doi: 10.3934/mbe.2020197

Previous Article Next Article

Research article Special Issues

A self-attention based neural architecture for Chinese medical named entity recognition

1.
Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China
2.
Laboratory of Software Engineering for Complex Systems, National University of Defense Technology, Changsha 410073, China
3.
College of Computer, National University of Defense Technology, Changsha 410073, China

Received: 29 December 2019 Accepted: 23 April 2020 Published: 09 May 2020

The combination of medical field and big data has led to an explosive growth in the volume of electronic medical records (EMRs), in which the information contained has guiding significance for diagnosis. And how to extract these information from EMRs has become a hot research topic. In this paper, we propose an ELMo-ET-CRF model based approach to extract medical named entity from Chinese electronic medical records (CEMRs). Firstly, a domain-specific ELMo model is fine-tuned on a common ELMo model with 4679 raw CEMRs. Then we use the encoder from Transformer (ET) as our model's encoder to alleviate the long context dependency problem, and the CRF is utilized as the decoder. At last, we compare the BiLSTM-CRF and ET-CRF model with word2vec and ELMo embeddings to CEMRs respectively to validate the effectiveness of ELMo-ET-CRF model. With the same training data and test data, the ELMo-ET-CRF outperforms all the other mentioned model architectures in this paper with 85.59% F1-score, which indicates the effectiveness of the proposed model architecture, and the performance is also competitive on the CCKS2019 leaderboard.

Keywords:

Citation: Qian Wan, Jie Liu, Luona Wei, Bin Ji. A self-attention based neural architecture for Chinese medical named entity recognition[J]. Mathematical Biosciences and Engineering, 2020, 17(4): 3498-3511. doi: 10.3934/mbe.2020197

Related Papers:

[1]	Chaofan Li, Kai Ma . Entity recognition of Chinese medical text based on multi-head self-attention combined with BILSTM-CRF. Mathematical Biosciences and Engineering, 2022, 19(3): 2206-2218. doi: 10.3934/mbe.2022103
[2]	Xiaoqing Lu, Jijun Tong, Shudong Xia . Entity relationship extraction from Chinese electronic medical records based on feature augmentation and cascade binary tagging framework. Mathematical Biosciences and Engineering, 2024, 21(1): 1342-1355. doi: 10.3934/mbe.2024058
[3]	Zhichang Zhang, Yu Zhang, Tong Zhou, Yali Pang . Medical assertion classification in Chinese EMRs using attention enhanced neural network. Mathematical Biosciences and Engineering, 2019, 16(4): 1966-1977. doi: 10.3934/mbe.2019096
[4]	Kunli Zhang, Bin Hu, Feijie Zhou, Yu Song, Xu Zhao, Xiyang Huang . Graph-based structural knowledge-aware network for diagnosis assistant. Mathematical Biosciences and Engineering, 2022, 19(10): 10533-10549. doi: 10.3934/mbe.2022492
[5]	Mengqi Zhang, Lei Ma, Yanzhao Ren, Ganggang Zhang, Xinliang Liu . Span-based model for overlapping entity recognition and multi-relations classification in the food domain. Mathematical Biosciences and Engineering, 2022, 19(5): 5134-5152. doi: 10.3934/mbe.2022240
[6]	Luqi Li, Yunkai Zhai, Jinghong Gao, Linlin Wang, Li Hou, Jie Zhao . Stacking-BERT model for Chinese medical procedure entity normalization. Mathematical Biosciences and Engineering, 2023, 20(1): 1018-1036. doi: 10.3934/mbe.2023047
[7]	Hongyang Chang, Hongying Zan, Shuai Zhang, Bingfei Zhao, Kunli Zhang . Construction of cardiovascular information extraction corpus based on electronic medical records. Mathematical Biosciences and Engineering, 2023, 20(7): 13379-13397. doi: 10.3934/mbe.2023596
[8]	Zhaoyu Liang, Zhichang Zhang, Haoyuan Chen, Ziqin Zhang . Disease prediction based on multi-type data fusion from Chinese electronic health record. Mathematical Biosciences and Engineering, 2022, 19(12): 13732-13746. doi: 10.3934/mbe.2022640
[9]	Qiao Pan, Chen Huang, Dehua Chen . A method based on multi-standard active learning to recognize entities in electronic medical record. Mathematical Biosciences and Engineering, 2021, 18(2): 1000-1021. doi: 10.3934/mbe.2021054
[10]	Quan Zhu, Xiaoyin Wang, Xuan Liu, Wanru Du, Xingxing Ding . Multi-task learning for aspect level semantic classification combining complex aspect target semantic enhancement and adaptive local focus. Mathematical Biosciences and Engineering, 2023, 20(10): 18566-18591. doi: 10.3934/mbe.2023824

Abstract

1. Introduction

With the combination of the medical field and big data, more and more consultation data and disease information are recorded in the form of electronic medical records (EMRs), and gradually become an important basis for assisting doctors in therapeutic diagnosis. EMRs record a large number of diagnostic information of patients: hospital records, course records, doctor's orders, case data and so on, including key entity information such as disease, surgery, drugs, etc. This information is a decisive factor for doctors to make treatment plans for patients ^[1]. It is of great significance to study how to extract key entity information from massive EMRs efficiently and accurately through intelligent methods.

Named entity recognition (NER) is a vital part of natural language processing (NLP) that meets the aforementioned requirements ^[2]. Its purpose is to recognize various named entities, e.g. names, place, organizations, etc., from raw text. Extracted entities can be taken as information for people, and can also pave the way for other NLP tasks, such as relationship extraction and knowledge graph construction. Recently, with the rise of deep learning technology, deep neural networks are utilized to achieve medical NER and have attracted much research attention.

So far, NER still faces huge problems in the field of Chinese electronic medical records (CEMRs). The main reasons are as follows: first of all, an entity may have multiple names due to the undefined text labeling standards ^[3]; secondly, the meaning of the same word or character may be completely different in different contexts, which causes confusions towards Chinese semantics; last but not least, Chinese has no natural vocabulary boundaries (spaces) as English does, so Chinese have no strict and correct vocabulary boundaries. In previous NER researches, the BiLSTM-CRF, which is the abbreviation of bi-directional Long-Short Term Memory (LSTM) joining with a conditional random field (CRF) layer, shows advanced performance and has become a prevalent architecture for various NER tasks ^[4,5]. This architecture outperforms traditional methods in that it eliminates the inefficient and complex method of manually designing feature templates and utilizes recurrent neural network (RNN) to automatically capture text features. However, lengths of CEMR texts are generally longer than traditional text, for CEMR texts contain at least several hundred Chinese tokens. Although it can capture long-term contextual dependencies ^[6], Long-Short Term Memory (LSTM) present inferior performance when text lengths exceed a certain step size ^[7]. In addition, at the beginning of the NLP task, each token in the text is represented by a low-dimensional dense vector ^[8]. Due to the highly domain-specialized medical field, universal pre-trained models are hardly adopted in practical tasks. The lack of Chinese medical corpus even makes it more difficult to pre-training medical domain-specific language models. In most cases, vector representations are generally randomly initialized, leading to a context-independent representation for each token ^[9], as a result, it can't tackle polysemy problems, and the limitations are very obvious.

Our work focus on Chinese medical NER in CEMRs, which has been a subtask of numerous influential academic conferences in NLP domain, e.g. China Conference on Knowledge Graph and Semantic Computing (CCKS), China Health Information Processing Conference (CHIP) etc.. These tasks not only accelerate Chinese medical NER research, but also provide several precious corpora for Chinese medical NER. In this paper, we first collect 4679 CEMRs, which are utilized to fine-tune the Chinese Embeddings from Language Models (ELMo) ^[9] trained with common field corpora, and then get a pre-trained model that can be used to dynamically generate context-dependent character embeddings for Chinese characters. Secondly, encoder from Transformer (ET) ^[10] is utilized as the model encoder instead of the traditional bi-directional Long-Short Term Memory (BiLSTM) in order to provide the proposed model with the ability to capture the long-term dependence of ultra-long CEMR texts efficiently. In ET, the distance between token and token is one, so there is no problem that the dependence is lost due to the lengthy distance between tokens. Our contributions are summarized as follows.

1) We fine-tuned a Chinese medical domain-specific ELMo model, which provides an authentic pre-trained language model for further research. A Chinese medical corpus with 4679 real-world CEMRs is constructed, containing about 1.8 million Chinese characters. Then a medical domain-specific model ELMo is fine-tuned by the efficiently application of Chinese medical corpus and the public available Chinese ELMo model.

2) We realize Chinese medical NER in CEMRs with ET-CRF model, which can tackle the long context dependencies better than the BiLSTM model with self-attention mechanism, and to the best of our knowledge this is the first time to apply ET-CRF model to Chinese medical NER.

3) Owning to the contributions above, the proposed ELMo-ET-CRF model achieves the best performance among all model architectures mentioned in this paper on the CCKS 2019 datasets, and the final F1-score is competitive to the current state-of-the-art performance.

2. Related word

NER has become one of the important tasks of information retrieval, data mining and NLP owing to of its extraordinary significance ^[11], and various solutions have been proposed in the existing literature.

2.1. Rule based approach

Matching entities through handwritten rules is the main method to deal with NER tasks in the early stage ^[12]. However, the construction of rules requires a certain level of expertise, and even a domain expert cannot enumerate rules that can model all entities. In addition, rules cannot be migrated because they rely on datasets. Thus, a same set of rules may not work under different datasets. This kind of handcrafted approach always leads to a relatively high system engineering cost.

2.2. Statistical machine learning based approach

The statistical machine learning method treats NER as a sequence labeling problem by inputting a set of sequences and outputing a set of predicted optimal tags sequences. Traditional methods include hidden Markov models ^[13,14], maximum entropy Markov models ^[15], conditional random fields ^[16,17], and support vector machines ^[18]. The most common implementation is feature template with CRF and different feature templates can be combined to form a new feature template. However, this statistical machine learning based method relies heavily on hand-crafted features, which cost a lot of overhead when find the most appropriate features.

2.3. Neural network based approach

In recent years, with the increase of computing power, training of deep neural network has become simple and feasible. The propose of word embeddings (e.g. word2vec, glove) make the useage of deep neural networks to deal with NLP tasks become a research focus ^[8,19]. Different from the traditional statistical machine learning training method, the training process of the neural network is regarded as an end-to-end process, which can automatically learn data features and avoid the extra overhead such as feature engineering.

More recently, the model structure which is using bidirectional LSTM to encode data with CRF as the decoder has got the most advanced results in the medical NER task ^[4,5], and the effect is significantly better than the traditional statistical model. The theory of model architecture was first proposed by Collobert et al. ^[20], Huang ^[21] and Lample ^[22] used LSTM-CRF for the first time to deal with sentence-level annotation problems. Ma et al. ^[23] used LSTM-CRF structure for the first time in English NER task, and achieved promising results, and then Dong et al. ^[24] first used LSTM-CRF to handle the Chinese NER task. LSTM has a cell structure and gate mechanism that allows the model to effectively capture long-term dependencies and has certain forgetting capabilities ^[6], and the advantage of splicing CRF after encoder layer is that it can use information that has already occurred during the sequence generation process to ensure that the output value is a set of optimal solution sequences. In addition, Zhang et al. ^[28] investigate a lattice-structured LSTM model for Chinese NER, which encodes a sequence of input characters as well as all potential words that match a lexicon. Zhang et al. ^[29] propose a convolutional attention layer to extract the implicit local context features from character sequence. Liu et al. ^[30] propose a Global Context enhanced Deep Transition architecture for sequence labeling. Qiu et al. ^[3] proposed the RD-CNN-CRF model, which effectively reduced the time required for training without losing the accuracy of the model. In terms of the inconsistency of the label, Ji et al. ^[1] proposed a hybrid model based on the attention mechanism, the attention mechanism effectively alleviates the problem of model accuracy decline caused by label inconsistency.

3. Methods

In our method, raw CEMRs are used as the input of ELMo to get the vector representation of sentences, sentence features are then extracted by ET before decoded through CRF to generate the annotation sequence. Figure 1 shows the structure of the model, which is detailed in this section.

Figure 1. Architecture of ET-CRF model.

DownLoad: Full-Size Img PowerPoint

3.1. Embeddings from Language Models (ELMo)

In early NLP tasks, the input of RNN is a set of word embeddings generated by Word2Vec ^[8], Glove ^[19] etc. However, these embeddings are context-independent, i.e., each word corresponds to a unique static vector without changing with the context, so these methods are limited in the case of polysemy. The emergence of ELMo effectively solves this problem by applying stacked BiLSTM to model the entire sentence from two directions and mapping the sentences into a sequence of vectors. Since LSTM can capture context dependencies, it ensures that the output embedding sequence has a front-to-back correlation.

Given $S$ sequence of $N$ tokens, $\{{w}_{1}, {w}_{2}, ..., {w}_{N}\}$ , these tokens first go through a token layer, which transfers the dimension of original character embedding to the input dimension of BiLSTM according to a weight matrix, then the output of token layer will be sent to a stacked BiLSTM to build a language model, as shown in . Suppose the sentence length is $N$ , and $L$ represents the number of layers of BiLSTM, at each position t, each layer $l$ of LSTM output a context-dependent hidden vector ${h}_{t, l}^{LM}\left(\left[\overrightarrow{{h}_{t, l}^{LM}}, \overleftarrow{{h}_{t, l}^{LM}}\right]\right)$ where $t = 1, 2, ..., N$ and $l = 1, 2, ..., L$ . The hidden vector ${h}_{t, l}^{LM}$ of the last layer of LSTM output is used by the softmax layer to predict the token at the next moment.

Figure 2. Architecture of ELMo model.

DownLoad: Full-Size Img PowerPoint

When the training process finishes, each sentence will learn $2L+1$ representations ${R}_{t}$ , where $L$ is the number of layers in the model:

${R}_{t} = \left\{{x}_{t}^{LM}, \overrightarrow{{h}_{t, l}^{LM}}, \overleftarrow{{h}_{t, l}^{LM}}|l = 1, \dots , L\right\} \\ = \left\{{h}_{t, l}^{LM}|l = 0, \dots , L\right\}$

(1)

where ${h}_{t, 0}^{LM}$ is the output of token layer and ${h}_{t, l}^{LM} = \left[\overrightarrow{{h}_{t, l}^{LM}}, \overleftarrow{{h}_{t, l}^{LM}}\right]$ for each BiLSTM layer.

3.2. Encoder from Transformer (ET)

Our model uses ET as the encoder layer. Compared with RNN and CNN, it takes a different approach that relies entirely on the self-attention mechanism to extract context features, which makes it highly parallel in the encoding process.

Suppose the input $S$ is a set of sequences $\{{w}_{1}, {w}_{2}, ..., {w}_{N}\}$ , $S\in {\mathbb{R}}^{N\times {d}_{model}}$ , where N is the length of the sequence and ${d}_{model}$ is the dimension of the input vector. We use multiple of the scaled dot-product attention components inside the multi-head attention layer to enhance the model's ability to encode sequences internally. We first add position information to the input sequence S, which follows the approach of Vaswani et al. ^[10], then perform a matrix transformation operation on S to obtain three weight matrices, namely key matrix $K$ , query matrix $Q$ and value matrix $V$ . Finally, the output representation of a single self-attention is obtained by scaled dot-product attention:

$Q, K, V = {SW}^{Q}, {SW}^{K}, {SW}^{V}$

(2)

$Attn\left(Q, K, V\right) = softmax\left(\frac{{QK}^{T}}{\sqrt{{d}_{k}}}\right)V$

(3)

where ${W}^{Q}\in {\mathbb{R}}^{{d}_{model}\times {d}_{k}}$ , ${W}^{K}\in {\mathbb{R}}^{{d}_{model}\times {d}_{k}}$ , ${W}^{V}\in {\mathbb{R}}^{{d}_{model}\times {d}_{k}}$ , are learnable parameters and $softmax\left(\right)$ is performed row-wise.

Single head attention may inhibit the information from different representation subspaces at different positions, so our model uses multi-head attention mechanisms ^[10]:

$MultiHead\left(S\right) = [{head}_{1}, {head}_{2}, \dots , {head}_{n}]{W}^{O}$

(4)

where ${head}_{i} = Attn({Q}_{i}, {K}_{i}, {V}_{i})$ and ${W}^{O}$ is learnable parameter.

Multi-head attention will be followed by a feedforward neural network, and the output of each sublayer in the encoder will have a residual link and layer normalization as shown in Figure 3.

Figure 3. Architecture of ET.

DownLoad: Full-Size Img PowerPoint

Assuming that the output of the multi-attention is the $\stackrel{-}{S}$ , the final output of the encoder will be calculated as follows:

$\stackrel{-}{S} = layernorm\left(S+MultiHead\left(S\right)\right)$

(5)

$\stackrel{~}{S} = layernorm\left(\stackrel{-}{S}+FFN\left(\stackrel{-}{S}\right)\right)$

(6)

where $FFN\left(x\right) = \mathrm{max}\left(0, x{W}_{1}+{b}_{1}\right){W}_{2}+{b}_{2}$ and $layernorm\left(\right)$ represents the layer normalization ^[25].

3.3. Conditional random field (CRF)

In the NER task of CEMRs, the output tag sequence is strictly ordered. The CRF layer maintains a state transition matrix which stores the transition probability from the previous state to the current state, ensureing the tag prediction process has inner dependency.

Given the input sequence is $X = \{{x}_{1}, {x}_{2}, ..., {x}_{N}\}$ , the score for defining the output sequence $Y = \{{y}_{1}, {y}_{2}, ..., {y}_{N}\}$ is expressed as follows :

$s\left(X, Y\right) = \sum \limits_{i = 0}^{N}{A}_{{y}_{i}, {y}_{i+1}}+\sum \limits_{i = 1}^{N}{P}_{i, {y}_{i}}$

(7)

where $A$ is a matrix of transition scores, ${A}_{ij}$ represents the score of a transition from the tag $i$ to tag $j$ . $P$ is a matrix of tag scores which is obtained from the output of ET through a fully connected network, $P\in {\mathbb{R}}^{N\times k}$ , where $N$ is the length of the sequence, $k$ is the number of labels, and ${p}_{ij}$ represents the score of the ${j}^{th}$ label of the ${i}^{th}$ position in the sequence. ${y}_{0}$ and ${y}_{N+1}$ are represented by < bos > and < eos > .

The model uses softmax to calculate the probability of all the tag sequences that may be generated by the input sequence X, and defines the logarithmic probability of maximizing the correct annotated sequence as the goal of the model optimization ^[22]:

$p\left(Y|X\right) = \frac{{e}^{s(X, Y)}}{\sum _{Y\text{'}\in {Y}_{ALL}}{e}^{S(X, {Y}^{\text{'}})}}$

(8)

$\mathit{log}\left(p\right(Y\left|X\right)) = s\left(X, Y\right)-\mathit{log}\left({\sum }_{Y\text{'}\in {Y}_{ALL}}{e}^{S(X, {Y}^{\text{'}})}\right)$

(9)

where ${Y}_{ALL}$ represents all possible tag sequences for an input sentence $X$ .

4. Experiments

4.1. Fine-tuning ELMo

The Chinese ELMo model used in this paper comes from the Research Center for Social Computing and Information Retrieval Harbin Institute of Technology. It was trained on Xinhua proportion of Chinese gigawords-v5, and takes roughly 3 days on an NVIDIA P100 GPU ^[26,27]. Since the datasets used in this paper belong to medical field, the corpus distribution has a strong specialized background, and there are also some uncommon characters in the text. Therefore, the pre-trained ELMo which is used in the general field cannot be directly applied to this task. We fine-tuned the ELMo with the medical corpus.

We collected 4679 CEMRs, which contains about 1.8 million characters, from CCKS2018, CCKS2019 and CHIP2018, and routinely preprocessed the text. In particular, it should be noted that in this paper the proposed neural network model is based on character embeddings. In the fine-tuning process, the hyperparameter setting is shown in the Table 1.

Table 1. Hyperparameter setting.

Hyperparameter name	Hyperparameter setting
Character Embedding	300
LSTM Layer	2
LSTM Cell Size	4096
LSTM Hidden Size	1024
Batch Size	1
Optimizer	Adam
Learning rate	0.001
Annealing rate	0.9^t/4679
Gradients clip	5
Dropout	0.1
Max epoch	100

| Show Table

DownLoad: CSV

After the model training is completed, we freeze the parameters of ELMo and use the weighted summation of each layer of ELMo as the output vector of ELMo, finally take the concatenation of the ELMo output and the original character embeddings as downstream model input:

${V}_{t}^{EMLo} = E\left({R}_{t};\Theta \right) = \lambda \sum\limits _{l = 0}^{L}{\theta }_{l}{h}_{t, l}^{LM}$

(10)

${c}_{t}^{embed} = [{c}_{t}, {V}_{t}^{ELMo}]$

(11)

where $\theta$ are softmax-normalized weights and the $\lambda$ is a scalar which can scale the vector according to a certain ratio. These two parameters are learnable and is updated with downstream model training. ${c}_{t}$ represents original character embedding which generated by Word2Vec, ${c}_{t}^{embed}$ represents the final vector representation of the character at the t position in a sequence and it will be the input of ET.

4.2. Task and dataset

This paper focus on Chinese medical NER, which aims to detect entity boundary and categorize entity into pre-defined categories. The dataset used in this paper comes from CCKS2019, which is jointly provided by Yiducloud (Beijing) Technology Co., Ltd.

The training data contains 1000 manually annotated CMERs, while the test data contains 379 manually annotated CMERs. Each EMR contains two parts: raw CEMR and annotation information. The annotation information consists of several triples, which are formed of entity start index, entity end index and entity category. Through entity start and end indices, we can extract the entity from CMERs. All entities are categorized into six categories, disease and diagnosis, imaging examination, laboratory test, surgery, medicine and anatomy. Entities contained in this dataset is shown in Table 2.

Table 2. Entity category of dataset.

Category	Disease and diagnosis	Imaging examination	Laboratory test	Surgery	Medicine	Anatomy	Totally
Training data	4193	966	1194	1027	1814	8231	17425
test data	1310	344	586	162	483	2938	5823

| Show Table

DownLoad: CSV

4.3. Evaluation criteria

We use the standard evaluation criteria to validate the effectiveness of the model, namely precision, recall and micro F1-score. which can be calculated as follows:

$Precision = \frac{TP}{\left(TP+FP\right)}$

(12)

$Recal = \frac{TP}{\left(TP+FN\right)}$

(13)

$F1 = \frac{2\times Precision\times Recall}{(Precision+Recall)}$

(14)

where the $TP, TN, FP$ and $FN$ are true positive, true negative, false positive and false negative respectively.

In this paper, CMERs are encoded with BIO format (Begin, Inside and Outside). The token will label as B-label if the token is the beginning of a named entity, I-label means the token is inside a named entity, other case will label as O-label. It can be seen from Figure 4 that tag ‘O’ are negative samples, and other tags are positive samples.

Figure 4. BIO tagging schema.

DownLoad: Full-Size Img PowerPoint

5. Results and discussions

We first use the training data to train the BiLSTM-CRF. In the experiment, the batch size was 10, Adam was selected as the model's optimizer, and the learning rate was set to 0.001. The initial character embedding is generated by word2vec method with a dimension of 300. The BiLSTM layer number is 1, and the hidden layer vector dimension is 300.

For ET-CRF, since the layer numbers and self-attention heads have a critical impact on the model's performance, we trained six sets of models with different numbers of layers or heads, and compared the effects of the models under different parameters on the test set. The results are shown in Table 3. The hyperparameters for the best model are 2 layers and 8 heads. In addition, the model’s input is 512 dimensions character embedding, which is also generated by the word2vec method with the same experimental configuration as BiLSTM-CRF.

Table 3. F1-score with different number of ET layers and heads.

Layers	2	2	4	4	6	6
Heads	4	8	4	8	4	8
F1	83.95	84.05	83.68	84.01	83.55	83.98

| Show Table

DownLoad: CSV

The results of WV-BiLSTM-CRF and WV-ET-CRF are shown in Table 4 (WV represents word2vec). It is obvious that the F1-score of the medicine and the imaging examination part is ideal for the BiLSTM-CRF, but the recognition ability of disease and diagnosis is the short board of the model, F1-score only 76.23%. It is speculated that the entity length of the medicine and the anatomy part is generally short and there is no multiple standard, but the tag of the entity such as disease and diagnosis is subjective, context-dependent, and generally has a long length, so judging the boundary is quite difficultly for BiLSTM. The total F1-score of the ET-CRF is 84.05%, which is higher than that of BiLSTM-CRF, 2.19%, revealing that about 382 medical entities are rectified or extracted. And for entities such as disease and diagnosis, ET-CRF performs remarkable. This is an intuitive result, we speculated that due to the ET encoding process which offers effectively method of shortening the direct distance between characters, the context-dependent relationship become well captured and preserved, ensuring the good recognition rate of ET-CRF for longer entities.

Table 4. Results of the two models on test dataset.

Entity name	WV-BiLSTM-CRF			WV-ET-CRF
	Strict index (%)			Strict index (%)
	P	R	F1	P	R	F1
Disease and diagnosis	74.47	78.07	76.23	77.86	81.04	79.42
Imaging examination	82.22	89.31	85.62	77.78	96.55	86.15
Laboratory test	82.35	84.13	83.23	82.56	85.34	83.92
Surgery	79.04	81.27	80.14	80.00	87.63	83.64
Anatomy	81.34	84.02	82.55	82.83	85.78	84.28
Medicine	89.24	90.28	89.76	91.80	93.80	92.79
Total	80.39	83.39	81.86	82.08	86.12	84.05

| Show Table

DownLoad: CSV

In order to verify the actual effect of ELMo on the NER task in CEMRs, we add ELMo components to the above two models, BiLSTM-CRF and ET-CRF. As a result, the input of the model becomes a dynamic context-dependent character embeddings with context information. The test results are shown in Table 5. For the ELMo-BiLSTM-CRF, the total F1-score is 2.64% higher than BiLSTM-CRF, which means that about 460 medical entities are rectified or extracted. The addition of ELMo has also improved the recognition ability of disease and diagnosis, which can be seen from the table that 2.24% has been improved. For the ELMo-ET-CRF, the total F1-score is 1.54% higher than that of ET-CRF, which shows that 268 medical entities are rectified or extracted. In summary, the addition of pre-trained ELMo enriches the information contained in the character embedding, which effectively improves the accuracy of the model.

Table 5. Test results of the two models with ELMo.

Entity name	ELMo-LSTM-CRF			ELMo-ET-CRF
	Strict index (%)			Strict index (%)
	P	R	F1	P	R	F1
Disease and diagnosis	76.93	80.07	78.47	79.57	82.53	81.02
Imaging examination	84.38	93.10	88.52	82.35	96.55	88.89
Laboratory test	84.94	86.78	85.85	83.72	86.54	85.11
Surgery	81.79	84.10	82.93	80.65	88.34	84.32
Medicine	84.74	86.36	85.54	84.75	87.92	86.31
Anatomy	91.06	92.13	91.59	90.32	93.80	92.03
Total	83.32	85.72	84.50	83.65	87.61	85.59

| Show Table

DownLoad: CSV

Table 6 illustrates that ELMo outperforms word2vec due to the obtained dynamic context-dependent model input,and the performance of ET-CRF in Chinese medical named entity recognition is significantly better than that of BiLSTM-CRF. Due to the excellent long context dependency capture capability of the self-attention mechanism,the ET-CRF’s ability to recognize long entity is significantly better than the BiLSTM-CRF. In addition,we also found that the convergence speed of ET-CRF is significantly faster than BiLSTM-CRF. It can also be seen in the table that the final F1-score of the best model ELMo-ET-CRF in the paper is 85.59%,which is still competitive compared to the top three in the CCKS2019 competition (https://www.biendata.com/competition/ccks_2019_1/final-leaderboard/).

Table 6. Results compared with CCKS2019.

Model	Strict index (%)
Model	P	R	F1
WV-LSTM-CRF	80.39	83.39	81.86
WV-ET-CRF	82.08	86.12	84.05
ELMo-LSTM-CRF	83.32	85.72	84.50
ELMo-ET-CRF	83.65	87.61	85.59
CCKS2019-No.1	-	-	85.62
CCKS2019-No.2	-	-	85.59
CCKS2019-No.3	-	-	85.16

| Show Table

DownLoad: CSV

6. Conclusions

In this paper, we firstly fine-tune a medical domain-specific ELMo model through a small medical corpus which is contained with 4679 CEMRs. Then we apply the ET-CRF model to Chinese medical NER on CEMRs. Finally, the proposed ELMo-ET-CRF model use dynamic context-dependent ELMo character embeddings to incorporate more lexical, syntantic and semantic information, and alleviates long context dependency problem. Under the strict evaluation index, the F1-score of ELMo-ET-CRF on the test set is 85.59%, which is competitive to the state-of-the-art on this dataset, and indicates the effectiveness of the proposed model architecture.

Acknowledgments

We thank the anonymous reviewers for their constructive comments. This work was supported by grants from the National Key Research and Development Program of China (2017YFB0202104). Publication costs are funded by a grant of the National Key Research and Development Program of China (2017YFB0202104).

Conflict of interest

All authors declare no conflicts of interest in this paper.

References

[1]	B. Ji, R. Liu, S. Li, J. Yu, Q. Wu, Y. Tan, et al., A hybrid approach for named entity recognition in Chinese electronic medical record, BMC Med. Inform. Decis. Mak., 19 (2019), 64.
[2]	C. Zong, Statistical natural language process, Tsinghua University Press, 2013.
[3]	J. Qiu, Y. Zhou, Q. Wang, T. Ruan, J. Gao, Chinese clinical named entity recognition using residual dilated convolutional neural network with conditional random field, IEEE Trans. Nanobiosci., 18 (2019), 306-315. doi: 10.1109/TNB.2019.2908678
[4]	L. Li, L. Jin, Z. Jiang, D. Song, D. Huang, Biomedical named entity recognition based on extended recurrent neural networks, 2015 IEEE International Conference on bioinformatics and biomedicine, 2015. Available from: https://ieeexplore.ieee.org/abstract/document/7359761/.
[5]	R. Leaman, C. H. Wei, Z. Lu, tmChem: A high performance approach for chemical named entity recognition and normalization, J. Cheminf., 7 (2015), S3.
[6]	S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput., 9 (1997), 1735-1780.
[7]	S. Lai, L. Xu, K. Liu, J. Zhao, Recurrent convolutional neural networks for text classification, Twenty-ninth AAAI conference on artificial intelligence, 2015. Available from: https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/viewPaper/9745.
[8]	T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv, 2013 (2013), 1301.3781.
[9]	M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, et al., Deep contextualized word representations, arXiv preprint arXiv, 2018 (2018), 1802.05365.
[10]	A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is all you need, Advances in neural information processing systems, 2017. Available from: http://papers.nips.cc/paper/7181-attention-is-all-you-need.
[11]	C. Lyu, B. Chen, Y. Ren, D. Ji, Long short-term memory RNN for biomedical named entity recognition, BMC Med. Inform. Decis. Mak., 18 (2017), 462.
[12]	G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. C. Kipper-Schuler, et al., Mayo clinical text analysis and knowledge extraction system (cTAKES): Architecture, component evaluation and applications, J. Am. Med. Inf. Assoc., 17 (2010), 507-513.
[13]	G. Zhou, J. Su, Named entity recognition using an HMM-based chunk tagger, proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2002. Available from: https://dl.acm.org/doi/10.3115/1073083.1073163.
[14]	M. Song, H. Yu, W. S. Han, Developing a hybrid dictionary-based bio-entity recognition technique, BMC Med. Inform. Decis. Mak., 15 (2015), S9.
[15]	A. McCallum, D. Freitag, F. C. Pereira, Maximum Entropy Markov Models for Information Extraction and Segmentation, LCML, 2000. Available from: http://cseweb.ucsd.edu/~elkan/254spring02/gidofalvi.pdf.
[16]	A. McCallum, W. Li, Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons, Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, 2003. Available from: https://dl.acm.org/doi/10.3115/1119176.1119206.
[17]	M. Skeppstedt, M. Kvist, G. H. Nilsson, H. Dalianis, Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study, J. Biomed. Inform., 49 (2014), 148-158.
[18]	Z. Ju, J. Wang, F. Zhu, Named entity recognition from biomedical text using SVM, 2011 5th international conference on bioinformatics and biomedical engineering, 2011. Available from: https://ieeexplore.ieee.org/abstract/document/5779984/.
[19]	J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, Proceedings of the 2014 conference on empirical methods in natural language processing, 2014. Available from: https://www.aclweb.org/anthology/D14-1162.pdf.
[20]	R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, Natural language processing (almost) from scratch, J. Mach. Learn. Res., 12 (2011), 2493-2537.
[21]	Z. Huang, W. Xu, K. Yu, Bidirectional LSTM-CRF models for sequence tagging, arXiv preprint arXiv, 2015 (2015), 1508.01991.
[22]	G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, C. Dyer, Neural architectures for named entity recognition, arXiv preprint arXiv, 2016 (2016), 1603.01360.
[23]	X. Ma, E. Hovy, End-to-end sequence labeling via bi-directional lstm-cnns-crf, arXiv preprint arXiv, 2016 (2016), 1603.01354.
[24]	C. Dong, J. Zhang, C. Zong, M. Hattori, H. Di, Character-based LSTM-CRF with radical-level features for Chinese named entity recognition, in Natural Language Understanding and Intelligent Applications, Springer. (2016), 239-250.
[25]	J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, arXiv preprint arXiv, 2016 (2016), 1607.06450.
[26]	W. Che, Y. Liu, Y. Wang, B. Zheng, T. Liu, Towards better UD parsing: Deep contextualized word embeddings, ensemble, and treebank concatenation, arXiv preprint arXiv, 2018 (2018), 1807.03121.
[27]	A. Kutuzov, M. Fares, S. Oepen, E. Velldal, Word vectors, reuse, and replicability: Towards a community repository of large-text resources, Proceedings of the 58th Conference on Simulation and Modelling, 2017. Available from: https://www.duo.uio.no/handle/10852/65205.
[28]	Y. Zhang, J. Yang, Chinese ner using lattice lstm, arXiv preprint arXiv, 2018 (2018), 1805.02023.
[29]	Y. Zhu, G. Wang, CAN-NER: Convolutional attention network for Chinese named entity recognition, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019. Available from: https://www.aclweb.org/anthology/N19-1342.pdf.
[30]	Y. Liu, F. Meng, J. Zhang, J. Xu, Y. Chen, J. Zhou, Gcdt: A global context enhanced deep transition architecture for sequence labeling, arXiv preprint arXiv, 2019 (2019), 1906.02437.

This article has been cited by:

1.	Ide Yunianto, Adhistya Erna Permanasari, Widyawan Widyawan, 2020, Domain-Specific Contextualized Embedding: A Systematic Literature Review, 978-1-7281-1097-4, 162, 10.1109/ICITEE49829.2020.9271752
2.	Jun Kong, Leixin Zhang, Min Jiang, Tianshan Liu, Incorporating multi-level CNN and attention mechanism for Chinese clinical named entity recognition, 2021, 116, 15320464, 103737, 10.1016/j.jbi.2021.103737
3.	Min Zuo, Baoyu Zhang, Qingchuan Zhang, Wenjing Yan, Dongmei Ai, Xin Ning, An Entity Relation Extraction Method for Few-Shot Learning on the Food Health and Safety Domain, 2022, 2022, 1687-5273, 1, 10.1155/2022/1879483
4.	Yuchen Zheng, Zhenggong Han, Yimin Cai, Xubo Duan, Jiangling Sun, Wei Yang, Haisong Huang, An imConvNet-based deep learning model for Chinese medical named entity recognition, 2022, 22, 1472-6947, 10.1186/s12911-022-02049-4
5.	Tanvir Islam, Sakila Mahbin Zinat, Shamima Sukhi, M. F. Mridha, 2022, Chapter 57, 978-981-16-2596-1, 665, 10.1007/978-981-16-2597-8_57
6.	Xiaojing Du, Yuxiang Jia, Hongying Zan, 2022, Chapter 10, 978-3-031-18314-0, 149, 10.1007/978-3-031-18315-7_10
7.	Jintong Shi, Mengxuan Sun, Zhengya Sun, Mingda Li, Yifan Gu, Wensheng Zhang, Multi-level semantic fusion network for Chinese medical named entity recognition, 2022, 133, 15320464, 104144, 10.1016/j.jbi.2022.104144
8.	Yani Chen, Danqing Hu, Mengyang Li, Huilong Duan, Xudong Lu, Automatic SNOMED CT coding of Chinese clinical terms via attention-based semantic matching, 2022, 159, 13865056, 104676, 10.1016/j.ijmedinf.2021.104676
9.	Oswaldo Solarte-Pabon, Alberto Blazquez-Herranz, Maria Torrente, Alejandro Rodriguez-Gonzalez, Mariano Provencio, Ernestina Menasalvas, 2021, Extracting Cancer Treatments from Clinical Text written in Spanish: A Deep Learning Approach, 978-1-6654-2099-0, 1, 10.1109/DSAA53316.2021.9564137
10.	Zaira Hassan Amur, Yew Kwang Hooi, Hina Bhanbhro, Kamran Dahri, Gul Muhammad Soomro, Short-Text Semantic Similarity (STSS): Techniques, Challenges and Future Perspectives, 2023, 13, 2076-3417, 3911, 10.3390/app13063911
11.	Weijie Wang, Xiaoying Li, Huiling Ren, Dongping Gao, An Fang, Chinese Clinical Named Entity Recognition from Electronic Medical Records based on Multi-semantic Features by using RoBERTa-wwm and CNN: Model Development and Validation (Preprint), 2022, 2291-9694, 10.2196/44597
12.	Yinlong Xiao, Zongcheng Ji, Jianqiang Li, Qing Zhu, CLART: A cascaded lattice-and-radical transformer network for Chinese medical named entity recognition, 2023, 9, 24058440, e20692, 10.1016/j.heliyon.2023.e20692
13.	Jinhong Zhong, Zhanxiang Xuan, Kang Wang, Zhou Cheng, A BERT-Span model for Chinese named entity recognition in rehabilitation medicine, 2023, 9, 2376-5992, e1535, 10.7717/peerj-cs.1535
14.	Haoze Du, Jiahao Xu, Zhiyong Du, Lihui Chen, Shaohui Ma, Dongqing Wei, Xianfang Wang, MF-MNER: Multi-models Fusion for MNER in Chinese Clinical Electronic Medical Records, 2024, 16, 1913-2751, 489, 10.1007/s12539-024-00624-z
15.	Hui Peng, Zhichang Zhang, Dan Liu, Xiaohui Qin, Chinese medical entity recognition based on the dual-branch TENER model, 2023, 23, 1472-6947, 10.1186/s12911-023-02243-y
16.	Feng Li, Zhongao Bi, Hongzeng Xu, Yunqi Shi, Na Duan, Zhaoyu Li, Design and implementation of a smart Internet of Things chest pain center based on deep learning, 2023, 20, 1551-0018, 18987, 10.3934/mbe.2023840
17.	Benedict Hartmann, Philippe Tamla, Matthias Hemmje, 2023, Chapter 6, 978-3-031-48056-0, 84, 10.1007/978-3-031-48057-7_6
18.	Xin Wang, Zurui Gan, Yaxi Xu, Bingnan Liu, Tao Zheng, Extracting Domain-Specific Chinese Named Entities for Aviation Safety Reports: A Case Study, 2023, 13, 2076-3417, 11003, 10.3390/app131911003
19.	Chi Xinyan, Huo Guang, Jin Qi, Hong Zhaoyang, Yang Chuang, Agri-NER-Net: Glyph Fusion for Chinese Field Crop Diseases and Pests Named Entity Recognition Network, 2024, 58, 0146-4116, 679, 10.3103/S0146411624701141
20.	Meijing Li, Runqing Huang, Xianxian Qi, Chinese Clinical Named Entity Recognition Using Multi-Feature Fusion and Multi-Scale Local Context Enhancement, 2024, 80, 1546-2226, 2283, 10.32604/cmc.2024.053630

Reader Comments

Your name:*

Email:*
© 2020 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Mathematical Biosciences and Engineering

3.9

Metrics

Article views(5419) PDF downloads(391) Cited by(20)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(4) / Tables(6)

Mathematical Biosciences and Engineering

A self-attention based neural architecture for Chinese medical named entity recognition

Related Papers:

Abstract

1. Introduction

2. Related word

2.1. Rule based approach

2.2. Statistical machine learning based approach

2.3. Neural network based approach

3. Methods

3.1. Embeddings from Language Models (ELMo)

3.2. Encoder from Transformer (ET)

3.3. Conditional random field (CRF)

4. Experiments

4.1. Fine-tuning ELMo

4.2. Task and dataset

4.3. Evaluation criteria

5. Results and discussions

6. Conclusions

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

Mathematical Biosciences and Engineering

A self-attention based neural architecture for Chinese medical named entity recognition

Related Papers:

Abstract

1. Introduction

2. Related word

2.1. Rule based approach

2.2. Statistical machine learning based approach

2.3. Neural network based approach

3. Methods

3.1. Embeddings from Language Models (ELMo)

3.2. Encoder from Transformer (ET)

3.3. Conditional random field (CRF)

4. Experiments

4.1. Fine-tuning ELMo

4.2. Task and dataset

4.3. Evaluation criteria

5. Results and discussions

6. Conclusions

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog