Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification

Honglan Li, Yoon Sung Joh, Hyunwoo Kim, Eunok Paek, Sang-Won Lee, Kyu Baek Hwang

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

Background: Proteogenomics is a promising approach for various tasks ranging from gene annotation to cancer research. Databases for proteogenomic searches are often constructed by adding peptide sequences inferred from genomic or transcriptomic evidence to reference protein sequences. Such inflation of databases has potential of identifying novel peptides. However, it also raises concerns on sensitive and reliable peptide identification. Spurious peptides included in target databases may result in underestimated false discovery rate (FDR). On the other hand, inflation of decoy databases could decrease the sensitivity of peptide identification due to the increased number of high-scoring random hits. Although several studies have addressed these issues, widely applicable guidelines for sensitive and reliable proteogenomic search have hardly been available. Results: To systematically evaluate the effect of database inflation in proteogenomic searches, we constructed a variety of real and simulated proteogenomic databases for yeast and human tandem mass spectrometry (MS/MS) data, respectively. Against these databases, we tested two popular database search tools with various approaches to search result validation: the target-decoy search strategy (with and without a refined scoring-metric) and a mixture model-based method. The effect of separate filtering of known and novel peptides was also examined. The results from real and simulated proteogenomic searches confirmed that separate filtering increases the sensitivity and reliability in proteogenomic search. However, no one method consistently identified the largest (or the smallest) number of novel peptides from real proteogenomic searches. Conclusions: We propose to use a set of search result validation methods with separate filtering, for sensitive and reliable identification of peptides in proteogenomic search.

Original languageEnglish
Article number1031
JournalBMC Genomics
Volume17
DOIs
Publication statusPublished - 2016 Dec 22

Fingerprint

Economic Inflation
Databases
Peptides
Molecular Sequence Annotation
Proteogenomics
Tandem Mass Spectrometry
Yeasts
Guidelines

Keywords

  • False discovery rate
  • Model-based approach
  • Proteogenomic search
  • Separate false discovery rate analysis
  • Simulation
  • Target-decoy approach

ASJC Scopus subject areas

  • Biotechnology
  • Genetics

Cite this

Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification. / Li, Honglan; Joh, Yoon Sung; Kim, Hyunwoo; Paek, Eunok; Lee, Sang-Won; Hwang, Kyu Baek.

In: BMC Genomics, Vol. 17, 1031, 22.12.2016.

Research output: Contribution to journalArticle

Li, Honglan ; Joh, Yoon Sung ; Kim, Hyunwoo ; Paek, Eunok ; Lee, Sang-Won ; Hwang, Kyu Baek. / Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification. In: BMC Genomics. 2016 ; Vol. 17.
@article{3487797aadad4026a2d592bf9bbb69f6,
title = "Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification",
abstract = "Background: Proteogenomics is a promising approach for various tasks ranging from gene annotation to cancer research. Databases for proteogenomic searches are often constructed by adding peptide sequences inferred from genomic or transcriptomic evidence to reference protein sequences. Such inflation of databases has potential of identifying novel peptides. However, it also raises concerns on sensitive and reliable peptide identification. Spurious peptides included in target databases may result in underestimated false discovery rate (FDR). On the other hand, inflation of decoy databases could decrease the sensitivity of peptide identification due to the increased number of high-scoring random hits. Although several studies have addressed these issues, widely applicable guidelines for sensitive and reliable proteogenomic search have hardly been available. Results: To systematically evaluate the effect of database inflation in proteogenomic searches, we constructed a variety of real and simulated proteogenomic databases for yeast and human tandem mass spectrometry (MS/MS) data, respectively. Against these databases, we tested two popular database search tools with various approaches to search result validation: the target-decoy search strategy (with and without a refined scoring-metric) and a mixture model-based method. The effect of separate filtering of known and novel peptides was also examined. The results from real and simulated proteogenomic searches confirmed that separate filtering increases the sensitivity and reliability in proteogenomic search. However, no one method consistently identified the largest (or the smallest) number of novel peptides from real proteogenomic searches. Conclusions: We propose to use a set of search result validation methods with separate filtering, for sensitive and reliable identification of peptides in proteogenomic search.",
keywords = "False discovery rate, Model-based approach, Proteogenomic search, Separate false discovery rate analysis, Simulation, Target-decoy approach",
author = "Honglan Li and Joh, {Yoon Sung} and Hyunwoo Kim and Eunok Paek and Sang-Won Lee and Hwang, {Kyu Baek}",
year = "2016",
month = "12",
day = "22",
doi = "10.1186/s12864-016-3327-5",
language = "English",
volume = "17",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification

AU - Li, Honglan

AU - Joh, Yoon Sung

AU - Kim, Hyunwoo

AU - Paek, Eunok

AU - Lee, Sang-Won

AU - Hwang, Kyu Baek

PY - 2016/12/22

Y1 - 2016/12/22

N2 - Background: Proteogenomics is a promising approach for various tasks ranging from gene annotation to cancer research. Databases for proteogenomic searches are often constructed by adding peptide sequences inferred from genomic or transcriptomic evidence to reference protein sequences. Such inflation of databases has potential of identifying novel peptides. However, it also raises concerns on sensitive and reliable peptide identification. Spurious peptides included in target databases may result in underestimated false discovery rate (FDR). On the other hand, inflation of decoy databases could decrease the sensitivity of peptide identification due to the increased number of high-scoring random hits. Although several studies have addressed these issues, widely applicable guidelines for sensitive and reliable proteogenomic search have hardly been available. Results: To systematically evaluate the effect of database inflation in proteogenomic searches, we constructed a variety of real and simulated proteogenomic databases for yeast and human tandem mass spectrometry (MS/MS) data, respectively. Against these databases, we tested two popular database search tools with various approaches to search result validation: the target-decoy search strategy (with and without a refined scoring-metric) and a mixture model-based method. The effect of separate filtering of known and novel peptides was also examined. The results from real and simulated proteogenomic searches confirmed that separate filtering increases the sensitivity and reliability in proteogenomic search. However, no one method consistently identified the largest (or the smallest) number of novel peptides from real proteogenomic searches. Conclusions: We propose to use a set of search result validation methods with separate filtering, for sensitive and reliable identification of peptides in proteogenomic search.

AB - Background: Proteogenomics is a promising approach for various tasks ranging from gene annotation to cancer research. Databases for proteogenomic searches are often constructed by adding peptide sequences inferred from genomic or transcriptomic evidence to reference protein sequences. Such inflation of databases has potential of identifying novel peptides. However, it also raises concerns on sensitive and reliable peptide identification. Spurious peptides included in target databases may result in underestimated false discovery rate (FDR). On the other hand, inflation of decoy databases could decrease the sensitivity of peptide identification due to the increased number of high-scoring random hits. Although several studies have addressed these issues, widely applicable guidelines for sensitive and reliable proteogenomic search have hardly been available. Results: To systematically evaluate the effect of database inflation in proteogenomic searches, we constructed a variety of real and simulated proteogenomic databases for yeast and human tandem mass spectrometry (MS/MS) data, respectively. Against these databases, we tested two popular database search tools with various approaches to search result validation: the target-decoy search strategy (with and without a refined scoring-metric) and a mixture model-based method. The effect of separate filtering of known and novel peptides was also examined. The results from real and simulated proteogenomic searches confirmed that separate filtering increases the sensitivity and reliability in proteogenomic search. However, no one method consistently identified the largest (or the smallest) number of novel peptides from real proteogenomic searches. Conclusions: We propose to use a set of search result validation methods with separate filtering, for sensitive and reliable identification of peptides in proteogenomic search.

KW - False discovery rate

KW - Model-based approach

KW - Proteogenomic search

KW - Separate false discovery rate analysis

KW - Simulation

KW - Target-decoy approach

UR - http://www.scopus.com/inward/record.url?scp=85006700540&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85006700540&partnerID=8YFLogxK

U2 - 10.1186/s12864-016-3327-5

DO - 10.1186/s12864-016-3327-5

M3 - Article

C2 - 28155652

AN - SCOPUS:85006700540

VL - 17

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

M1 - 1031

ER -