Enhancing named entity recognition in Turkish by ıntegrating external knowledge and extra layers into transformer-based models

Çarık, Buse (2022) Enhancing named entity recognition in Turkish by ıntegrating external knowledge and extra layers into transformer-based models. [Thesis]

[thumbnail of 10513284.pdf] PDF
10513284.pdf

Download (1MB)

Abstract

Named Entity Recognition (NER) is a core component in extraction information that aims to detect and classify named entities, such as person and location names. Applications of this task include the detection of named entities in raw texts from various domains. Categorizing news articles, anonymizing texts to ensure privacy, and identifying diseases and drugs from electronic health records in the medical field are some of the usage areas of this task. However, each domain has its own challenges and knowledge requirements. One of the challenging domains in NER is social media because of its noisy nature and context deficiency. In addition, newly named entity classes are included in this domain, covering ambiguous and complex entities such as book or movie titles. Because of these issues, models perform poorly in this domain compared to well-written texts such as news articles. In this work, we aim to improve the performance of models, particularly in complex entities and lack of context, by integrating external information from a knowledge base, like Wikipedia, into a transformer-based model in an unsupervised manner. To select the external context and add it to the BERT model, we proposed two different methods. In the first approach, the two pipelines called ELBERT and ELMultiBERT attempted to find possible named entities on Wikipedia and utilized the pages they found as external information. Our second method, ELSemantic, improved the previous approach by emphasizing the contextually closer pages since detecting every named entity in Wikipedia is not always possible. With ELBERT and ELMultiBERT , we achieved significant improvement on the MultiCoNER dataset, which contains iv many short samples and complex entities, compared to vanilla transformer-based models. Moreover, by incorporating semantically similar content in the ELSemantic, we outperformed the BERTurk model on all datasets with noisy text. Since the social media datasets in Turkish NER are either old or insufficient, we first constructed a new Twitter dataset. Moreover, since the existing social media datasets have not been evaluated with transformer-based models, we trained variations of these models and compared them with BiLSTM-CRF architecture on social media datasets. We also implemented CRF and BiLSTM layers on top of transformer-based models to improve their performances by capturing relations among labels. The BERT-CRF model outperformed our pipelines with external knowledge, however, it performed poorly compared to our pipelines for the dataset full of short samples and complex entities, namely MultiCoNER. The BERTBiLSTM- CRF model, on the other hand, performed poorly and lagged behind other transformer-based approaches.
Item Type: Thesis
Uncontrolled Keywords: Information Extraction. -- Knowledge Base. -- Wikipedia. -- Twitter. -- Bilgi Çıkarma. -- Bilgi Bankası. -- Vikipedi. -- Twitter
Subjects: T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK7800-8360 Electronics > TK7885-7895 Computer engineering. Computer hardware
Divisions: Faculty of Engineering and Natural Sciences > Academic programs > Computer Science & Eng.
Faculty of Engineering and Natural Sciences
Depositing User: Dila Günay
Date Deposited: 11 Jul 2023 16:19
Last Modified: 11 Jul 2023 16:19
URI: https://research.sabanciuniv.edu/id/eprint/47482

Actions (login required)

View Item
View Item