Identifying non-elliptical entity mentions in a coordinated NP with ellipses

Jeongmin Chae, Younghee Jung, Taemin Lee, Soon Young Jung, Chan Huh, Gilhan Kim, Hyeoncheol Kim, Heungbum Oh

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

Named entities in the biomedical domain are often written using a Noun Phrase (NP) along with a coordinating conjunction such as 'and' and 'or'. In addition, repeated words among named entity mentions are frequently omitted. It is often difficult to identify named entities. Although various Named Entity Recognition (NER) methods have tried to solve this problem, these methods can only deal with relatively simple elliptical patterns in coordinated NPs. We propose a new NER method for identifying non-elliptical entity mentions with simple or complex ellipses using linguistic rules and an entity mention dictionary. The GENIA and CRAFT corpora were used to evaluate the performance of the proposed system. The GENIA corpus was used to evaluate the performance of the system according to the quality of the dictionary. The GENIA corpus comprises 3434 non-elliptical entity mentions in 1585 coordinated NPs with ellipses. The system achieves 92.11% precision, 95.20% recall, and 93.63% F-score in identification of non-elliptical entity mentions in coordinated NPs. The accuracy of the system in resolving simple and complex ellipses is 94.54% and 91.95%, respectively. The CRAFT corpus was used to evaluate the performance of the system under realistic conditions. The system achieved 78.47% precision, 67.10% recall, and 72.34% F-score in coordinated NPs. The performance evaluations of the system show that it efficiently solves the problem caused by ellipses, and improves NER performance. The algorithm is implemented in PHP and the code can be downloaded from https://code.google.com/p/medtextmining/.

Original languageEnglish
Pages (from-to)139-152
Number of pages14
JournalJournal of Biomedical Informatics
Volume47
DOIs
Publication statusPublished - 2014 Jan 1

Fingerprint

Glossaries
Linguistics

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics

Cite this

Identifying non-elliptical entity mentions in a coordinated NP with ellipses. / Chae, Jeongmin; Jung, Younghee; Lee, Taemin; Jung, Soon Young; Huh, Chan; Kim, Gilhan; Kim, Hyeoncheol; Oh, Heungbum.

In: Journal of Biomedical Informatics, Vol. 47, 01.01.2014, p. 139-152.

Research output: Contribution to journalArticle

Chae, Jeongmin ; Jung, Younghee ; Lee, Taemin ; Jung, Soon Young ; Huh, Chan ; Kim, Gilhan ; Kim, Hyeoncheol ; Oh, Heungbum. / Identifying non-elliptical entity mentions in a coordinated NP with ellipses. In: Journal of Biomedical Informatics. 2014 ; Vol. 47. pp. 139-152.
@article{14372caabed24eafbebbfa2223b01fd8,
title = "Identifying non-elliptical entity mentions in a coordinated NP with ellipses",
abstract = "Named entities in the biomedical domain are often written using a Noun Phrase (NP) along with a coordinating conjunction such as 'and' and 'or'. In addition, repeated words among named entity mentions are frequently omitted. It is often difficult to identify named entities. Although various Named Entity Recognition (NER) methods have tried to solve this problem, these methods can only deal with relatively simple elliptical patterns in coordinated NPs. We propose a new NER method for identifying non-elliptical entity mentions with simple or complex ellipses using linguistic rules and an entity mention dictionary. The GENIA and CRAFT corpora were used to evaluate the performance of the proposed system. The GENIA corpus was used to evaluate the performance of the system according to the quality of the dictionary. The GENIA corpus comprises 3434 non-elliptical entity mentions in 1585 coordinated NPs with ellipses. The system achieves 92.11{\%} precision, 95.20{\%} recall, and 93.63{\%} F-score in identification of non-elliptical entity mentions in coordinated NPs. The accuracy of the system in resolving simple and complex ellipses is 94.54{\%} and 91.95{\%}, respectively. The CRAFT corpus was used to evaluate the performance of the system under realistic conditions. The system achieved 78.47{\%} precision, 67.10{\%} recall, and 72.34{\%} F-score in coordinated NPs. The performance evaluations of the system show that it efficiently solves the problem caused by ellipses, and improves NER performance. The algorithm is implemented in PHP and the code can be downloaded from https://code.google.com/p/medtextmining/.",
keywords = "Ellipsis resolution, Named entity recognition, Text mining",
author = "Jeongmin Chae and Younghee Jung and Taemin Lee and Jung, {Soon Young} and Chan Huh and Gilhan Kim and Hyeoncheol Kim and Heungbum Oh",
year = "2014",
month = "1",
day = "1",
doi = "10.1016/j.jbi.2013.10.002",
language = "English",
volume = "47",
pages = "139--152",
journal = "Journal of Biomedical Informatics",
issn = "1532-0464",
publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Identifying non-elliptical entity mentions in a coordinated NP with ellipses

AU - Chae, Jeongmin

AU - Jung, Younghee

AU - Lee, Taemin

AU - Jung, Soon Young

AU - Huh, Chan

AU - Kim, Gilhan

AU - Kim, Hyeoncheol

AU - Oh, Heungbum

PY - 2014/1/1

Y1 - 2014/1/1

N2 - Named entities in the biomedical domain are often written using a Noun Phrase (NP) along with a coordinating conjunction such as 'and' and 'or'. In addition, repeated words among named entity mentions are frequently omitted. It is often difficult to identify named entities. Although various Named Entity Recognition (NER) methods have tried to solve this problem, these methods can only deal with relatively simple elliptical patterns in coordinated NPs. We propose a new NER method for identifying non-elliptical entity mentions with simple or complex ellipses using linguistic rules and an entity mention dictionary. The GENIA and CRAFT corpora were used to evaluate the performance of the proposed system. The GENIA corpus was used to evaluate the performance of the system according to the quality of the dictionary. The GENIA corpus comprises 3434 non-elliptical entity mentions in 1585 coordinated NPs with ellipses. The system achieves 92.11% precision, 95.20% recall, and 93.63% F-score in identification of non-elliptical entity mentions in coordinated NPs. The accuracy of the system in resolving simple and complex ellipses is 94.54% and 91.95%, respectively. The CRAFT corpus was used to evaluate the performance of the system under realistic conditions. The system achieved 78.47% precision, 67.10% recall, and 72.34% F-score in coordinated NPs. The performance evaluations of the system show that it efficiently solves the problem caused by ellipses, and improves NER performance. The algorithm is implemented in PHP and the code can be downloaded from https://code.google.com/p/medtextmining/.

AB - Named entities in the biomedical domain are often written using a Noun Phrase (NP) along with a coordinating conjunction such as 'and' and 'or'. In addition, repeated words among named entity mentions are frequently omitted. It is often difficult to identify named entities. Although various Named Entity Recognition (NER) methods have tried to solve this problem, these methods can only deal with relatively simple elliptical patterns in coordinated NPs. We propose a new NER method for identifying non-elliptical entity mentions with simple or complex ellipses using linguistic rules and an entity mention dictionary. The GENIA and CRAFT corpora were used to evaluate the performance of the proposed system. The GENIA corpus was used to evaluate the performance of the system according to the quality of the dictionary. The GENIA corpus comprises 3434 non-elliptical entity mentions in 1585 coordinated NPs with ellipses. The system achieves 92.11% precision, 95.20% recall, and 93.63% F-score in identification of non-elliptical entity mentions in coordinated NPs. The accuracy of the system in resolving simple and complex ellipses is 94.54% and 91.95%, respectively. The CRAFT corpus was used to evaluate the performance of the system under realistic conditions. The system achieved 78.47% precision, 67.10% recall, and 72.34% F-score in coordinated NPs. The performance evaluations of the system show that it efficiently solves the problem caused by ellipses, and improves NER performance. The algorithm is implemented in PHP and the code can be downloaded from https://code.google.com/p/medtextmining/.

KW - Ellipsis resolution

KW - Named entity recognition

KW - Text mining

UR - http://www.scopus.com/inward/record.url?scp=84895484578&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84895484578&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2013.10.002

DO - 10.1016/j.jbi.2013.10.002

M3 - Article

C2 - 24153413

AN - SCOPUS:84895484578

VL - 47

SP - 139

EP - 152

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

SN - 1532-0464

ER -