Mut2Vec: Distributed representation of cancerous mutations

Sunkyu Kim, Heewon Lee, Keonwoo Kim, Jaewoo Kang

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Background: Embedding techniques for converting high-dimensional sparse data into low-dimensional distributed representations have been gaining popularity in various fields of research. In deep learning models, embedding is commonly used and proven to be more effective than naive binary representation. However, yet no attempt has been made to embed highly sparse mutation profiles into densely distributed representations. Since binary representation does not capture biological context, its use is limited in many applications such as discovering novel driver mutations. Additionally, training distributed representations of mutations is challenging due to a relatively small amount of available biological data compared with the large amount of text corpus data in text mining fields. Methods: We introduce Mut2Vec, a novel computational pipeline that can be used to create a distributed representation of cancerous mutations. Mut2Vec is trained on cancer profiles using Skip-Gram since cancer can be characterized by a series of co-occurring mutations. We also augmented our pipeline with existing information in the biomedical literature and protein-protein interaction networks to compensate for the data insufficiency. Results: To evaluate our models, we conducted two experiments that involved the following tasks: a) visualizing driver and passenger mutations, b) identifying novel driver mutations using a clustering method. Our visualization showed a clear distinction between passenger mutations and driver mutations. We also found driver mutation candidates and proved that these were true driver mutations based on our literature survey. The pre-trained mutation vectors and the candidate driver mutations are publicly available at http://infos.korea.ac.kr/mut2vec. Conclusions: We introduce Mut2Vec that can be utilized to generate distributed representations of mutations and experimentally validate the efficacy of the generated mutation representations. Mut2Vec can be used in various deep learning applications such as cancer classification and drug sensitivity prediction.

Original languageEnglish
Article number33
JournalBMC Medical Genomics
Volume11
DOIs
Publication statusPublished - 2018 Apr 20

Fingerprint

Mutation
Learning
Protein Interaction Maps
Neoplasms
Data Mining
Korea
Cluster Analysis
Research
Pharmaceutical Preparations

Keywords

  • Cancer
  • Deep learning
  • Distributed representation
  • Mut2Vec
  • Mutation embedding

ASJC Scopus subject areas

  • Genetics
  • Genetics(clinical)

Cite this

Mut2Vec : Distributed representation of cancerous mutations. / Kim, Sunkyu; Lee, Heewon; Kim, Keonwoo; Kang, Jaewoo.

In: BMC Medical Genomics, Vol. 11, 33, 20.04.2018.

Research output: Contribution to journalArticle

Kim, Sunkyu ; Lee, Heewon ; Kim, Keonwoo ; Kang, Jaewoo. / Mut2Vec : Distributed representation of cancerous mutations. In: BMC Medical Genomics. 2018 ; Vol. 11.
@article{063f2e043f334b16b501be68bed8e803,
title = "Mut2Vec: Distributed representation of cancerous mutations",
abstract = "Background: Embedding techniques for converting high-dimensional sparse data into low-dimensional distributed representations have been gaining popularity in various fields of research. In deep learning models, embedding is commonly used and proven to be more effective than naive binary representation. However, yet no attempt has been made to embed highly sparse mutation profiles into densely distributed representations. Since binary representation does not capture biological context, its use is limited in many applications such as discovering novel driver mutations. Additionally, training distributed representations of mutations is challenging due to a relatively small amount of available biological data compared with the large amount of text corpus data in text mining fields. Methods: We introduce Mut2Vec, a novel computational pipeline that can be used to create a distributed representation of cancerous mutations. Mut2Vec is trained on cancer profiles using Skip-Gram since cancer can be characterized by a series of co-occurring mutations. We also augmented our pipeline with existing information in the biomedical literature and protein-protein interaction networks to compensate for the data insufficiency. Results: To evaluate our models, we conducted two experiments that involved the following tasks: a) visualizing driver and passenger mutations, b) identifying novel driver mutations using a clustering method. Our visualization showed a clear distinction between passenger mutations and driver mutations. We also found driver mutation candidates and proved that these were true driver mutations based on our literature survey. The pre-trained mutation vectors and the candidate driver mutations are publicly available at http://infos.korea.ac.kr/mut2vec. Conclusions: We introduce Mut2Vec that can be utilized to generate distributed representations of mutations and experimentally validate the efficacy of the generated mutation representations. Mut2Vec can be used in various deep learning applications such as cancer classification and drug sensitivity prediction.",
keywords = "Cancer, Deep learning, Distributed representation, Mut2Vec, Mutation embedding",
author = "Sunkyu Kim and Heewon Lee and Keonwoo Kim and Jaewoo Kang",
year = "2018",
month = "4",
day = "20",
doi = "10.1186/s12920-018-0349-7",
language = "English",
volume = "11",
journal = "BMC Medical Genomics",
issn = "1755-8794",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Mut2Vec

T2 - Distributed representation of cancerous mutations

AU - Kim, Sunkyu

AU - Lee, Heewon

AU - Kim, Keonwoo

AU - Kang, Jaewoo

PY - 2018/4/20

Y1 - 2018/4/20

N2 - Background: Embedding techniques for converting high-dimensional sparse data into low-dimensional distributed representations have been gaining popularity in various fields of research. In deep learning models, embedding is commonly used and proven to be more effective than naive binary representation. However, yet no attempt has been made to embed highly sparse mutation profiles into densely distributed representations. Since binary representation does not capture biological context, its use is limited in many applications such as discovering novel driver mutations. Additionally, training distributed representations of mutations is challenging due to a relatively small amount of available biological data compared with the large amount of text corpus data in text mining fields. Methods: We introduce Mut2Vec, a novel computational pipeline that can be used to create a distributed representation of cancerous mutations. Mut2Vec is trained on cancer profiles using Skip-Gram since cancer can be characterized by a series of co-occurring mutations. We also augmented our pipeline with existing information in the biomedical literature and protein-protein interaction networks to compensate for the data insufficiency. Results: To evaluate our models, we conducted two experiments that involved the following tasks: a) visualizing driver and passenger mutations, b) identifying novel driver mutations using a clustering method. Our visualization showed a clear distinction between passenger mutations and driver mutations. We also found driver mutation candidates and proved that these were true driver mutations based on our literature survey. The pre-trained mutation vectors and the candidate driver mutations are publicly available at http://infos.korea.ac.kr/mut2vec. Conclusions: We introduce Mut2Vec that can be utilized to generate distributed representations of mutations and experimentally validate the efficacy of the generated mutation representations. Mut2Vec can be used in various deep learning applications such as cancer classification and drug sensitivity prediction.

AB - Background: Embedding techniques for converting high-dimensional sparse data into low-dimensional distributed representations have been gaining popularity in various fields of research. In deep learning models, embedding is commonly used and proven to be more effective than naive binary representation. However, yet no attempt has been made to embed highly sparse mutation profiles into densely distributed representations. Since binary representation does not capture biological context, its use is limited in many applications such as discovering novel driver mutations. Additionally, training distributed representations of mutations is challenging due to a relatively small amount of available biological data compared with the large amount of text corpus data in text mining fields. Methods: We introduce Mut2Vec, a novel computational pipeline that can be used to create a distributed representation of cancerous mutations. Mut2Vec is trained on cancer profiles using Skip-Gram since cancer can be characterized by a series of co-occurring mutations. We also augmented our pipeline with existing information in the biomedical literature and protein-protein interaction networks to compensate for the data insufficiency. Results: To evaluate our models, we conducted two experiments that involved the following tasks: a) visualizing driver and passenger mutations, b) identifying novel driver mutations using a clustering method. Our visualization showed a clear distinction between passenger mutations and driver mutations. We also found driver mutation candidates and proved that these were true driver mutations based on our literature survey. The pre-trained mutation vectors and the candidate driver mutations are publicly available at http://infos.korea.ac.kr/mut2vec. Conclusions: We introduce Mut2Vec that can be utilized to generate distributed representations of mutations and experimentally validate the efficacy of the generated mutation representations. Mut2Vec can be used in various deep learning applications such as cancer classification and drug sensitivity prediction.

KW - Cancer

KW - Deep learning

KW - Distributed representation

KW - Mut2Vec

KW - Mutation embedding

UR - http://www.scopus.com/inward/record.url?scp=85045845251&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045845251&partnerID=8YFLogxK

U2 - 10.1186/s12920-018-0349-7

DO - 10.1186/s12920-018-0349-7

M3 - Article

C2 - 29697361

AN - SCOPUS:85045845251

VL - 11

JO - BMC Medical Genomics

JF - BMC Medical Genomics

SN - 1755-8794

M1 - 33

ER -