Knowledge-based reconstruction of mRNA transcripts with short sequencing reads for transcriptome research

Junhee Seok, Weihong Xu, Hui Jiang, Ronald W. Davis, Wenzhong Xiao

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

While most transcriptome analyses in high-throughput clinical studies focus on gene level expression, the existence of alternative isoforms of gene transcripts is a major source of the diversity in the biological functionalities of the human genome. It is, therefore, essential to annotate isoforms of gene transcripts for genome-wide transcriptome studies. Recently developed mRNA sequencing technology presents an unprecedented opportunity to discover new forms of transcripts, and at the same time brings bioinformatic challenges due to its short read length and incomplete coverage for the transcripts. In this work, we proposed a computational approach to reconstruct new mRNA transcripts from short sequencing reads with reference information of known transcripts in existing databases. The prior knowledge helped to define exon boundaries and fill in the transcript regions not covered by sequencing data. This approach was demonstrated using a deep sequencing data set of human muscle tissue with transcript annotations in RefSeq as prior knowledge. We identified 2,973 junctions, 7,471 exons, and 7,571 transcripts not previously annotated in RefSeq. 73% of these new transcripts found supports from UCSC Known Genes, Ensembl or EST transcript annotations. In addition, the reconstructed transcripts were much longer than those from de novo approaches that assume no prior knowledge. These previously un-annotated transcripts can be integrated with known transcript annotations to improve both the design of microarrays and the follow-up analyses of isoform expression. The overall results demonstrated that incorporating transcript annotations from genomic databases significantly helps the reconstruction of novel transcripts from short sequencing reads for transcriptome research.

Original languageEnglish
Article numbere31440
JournalPLoS One
Volume7
Issue number2
DOIs
Publication statusPublished - 2012 Feb 1
Externally publishedYes

Fingerprint

transcriptomics
Transcriptome
Protein Isoforms
Genes
Messenger RNA
Exons
Research
transcriptome
Databases
exons
High-Throughput Nucleotide Sequencing
genes
Biodiversity
Expressed Sequence Tags
Gene Expression Profiling
Human Genome
Computational Biology
application coverage
genome
muscle tissues

ASJC Scopus subject areas

  • Agricultural and Biological Sciences(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Medicine(all)

Cite this

Knowledge-based reconstruction of mRNA transcripts with short sequencing reads for transcriptome research. / Seok, Junhee; Xu, Weihong; Jiang, Hui; Davis, Ronald W.; Xiao, Wenzhong.

In: PLoS One, Vol. 7, No. 2, e31440, 01.02.2012.

Research output: Contribution to journalArticle

Seok, Junhee ; Xu, Weihong ; Jiang, Hui ; Davis, Ronald W. ; Xiao, Wenzhong. / Knowledge-based reconstruction of mRNA transcripts with short sequencing reads for transcriptome research. In: PLoS One. 2012 ; Vol. 7, No. 2.
@article{3f45d5b3cceb448495f25f22ef65e535,
title = "Knowledge-based reconstruction of mRNA transcripts with short sequencing reads for transcriptome research",
abstract = "While most transcriptome analyses in high-throughput clinical studies focus on gene level expression, the existence of alternative isoforms of gene transcripts is a major source of the diversity in the biological functionalities of the human genome. It is, therefore, essential to annotate isoforms of gene transcripts for genome-wide transcriptome studies. Recently developed mRNA sequencing technology presents an unprecedented opportunity to discover new forms of transcripts, and at the same time brings bioinformatic challenges due to its short read length and incomplete coverage for the transcripts. In this work, we proposed a computational approach to reconstruct new mRNA transcripts from short sequencing reads with reference information of known transcripts in existing databases. The prior knowledge helped to define exon boundaries and fill in the transcript regions not covered by sequencing data. This approach was demonstrated using a deep sequencing data set of human muscle tissue with transcript annotations in RefSeq as prior knowledge. We identified 2,973 junctions, 7,471 exons, and 7,571 transcripts not previously annotated in RefSeq. 73{\%} of these new transcripts found supports from UCSC Known Genes, Ensembl or EST transcript annotations. In addition, the reconstructed transcripts were much longer than those from de novo approaches that assume no prior knowledge. These previously un-annotated transcripts can be integrated with known transcript annotations to improve both the design of microarrays and the follow-up analyses of isoform expression. The overall results demonstrated that incorporating transcript annotations from genomic databases significantly helps the reconstruction of novel transcripts from short sequencing reads for transcriptome research.",
author = "Junhee Seok and Weihong Xu and Hui Jiang and Davis, {Ronald W.} and Wenzhong Xiao",
year = "2012",
month = "2",
day = "1",
doi = "10.1371/journal.pone.0031440",
language = "English",
volume = "7",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "2",

}

TY - JOUR

T1 - Knowledge-based reconstruction of mRNA transcripts with short sequencing reads for transcriptome research

AU - Seok, Junhee

AU - Xu, Weihong

AU - Jiang, Hui

AU - Davis, Ronald W.

AU - Xiao, Wenzhong

PY - 2012/2/1

Y1 - 2012/2/1

N2 - While most transcriptome analyses in high-throughput clinical studies focus on gene level expression, the existence of alternative isoforms of gene transcripts is a major source of the diversity in the biological functionalities of the human genome. It is, therefore, essential to annotate isoforms of gene transcripts for genome-wide transcriptome studies. Recently developed mRNA sequencing technology presents an unprecedented opportunity to discover new forms of transcripts, and at the same time brings bioinformatic challenges due to its short read length and incomplete coverage for the transcripts. In this work, we proposed a computational approach to reconstruct new mRNA transcripts from short sequencing reads with reference information of known transcripts in existing databases. The prior knowledge helped to define exon boundaries and fill in the transcript regions not covered by sequencing data. This approach was demonstrated using a deep sequencing data set of human muscle tissue with transcript annotations in RefSeq as prior knowledge. We identified 2,973 junctions, 7,471 exons, and 7,571 transcripts not previously annotated in RefSeq. 73% of these new transcripts found supports from UCSC Known Genes, Ensembl or EST transcript annotations. In addition, the reconstructed transcripts were much longer than those from de novo approaches that assume no prior knowledge. These previously un-annotated transcripts can be integrated with known transcript annotations to improve both the design of microarrays and the follow-up analyses of isoform expression. The overall results demonstrated that incorporating transcript annotations from genomic databases significantly helps the reconstruction of novel transcripts from short sequencing reads for transcriptome research.

AB - While most transcriptome analyses in high-throughput clinical studies focus on gene level expression, the existence of alternative isoforms of gene transcripts is a major source of the diversity in the biological functionalities of the human genome. It is, therefore, essential to annotate isoforms of gene transcripts for genome-wide transcriptome studies. Recently developed mRNA sequencing technology presents an unprecedented opportunity to discover new forms of transcripts, and at the same time brings bioinformatic challenges due to its short read length and incomplete coverage for the transcripts. In this work, we proposed a computational approach to reconstruct new mRNA transcripts from short sequencing reads with reference information of known transcripts in existing databases. The prior knowledge helped to define exon boundaries and fill in the transcript regions not covered by sequencing data. This approach was demonstrated using a deep sequencing data set of human muscle tissue with transcript annotations in RefSeq as prior knowledge. We identified 2,973 junctions, 7,471 exons, and 7,571 transcripts not previously annotated in RefSeq. 73% of these new transcripts found supports from UCSC Known Genes, Ensembl or EST transcript annotations. In addition, the reconstructed transcripts were much longer than those from de novo approaches that assume no prior knowledge. These previously un-annotated transcripts can be integrated with known transcript annotations to improve both the design of microarrays and the follow-up analyses of isoform expression. The overall results demonstrated that incorporating transcript annotations from genomic databases significantly helps the reconstruction of novel transcripts from short sequencing reads for transcriptome research.

UR - http://www.scopus.com/inward/record.url?scp=84856417163&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84856417163&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0031440

DO - 10.1371/journal.pone.0031440

M3 - Article

VL - 7

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 2

M1 - e31440

ER -