Microarray data mining using landmark gene-guided clustering

Pankaj Chopra, Jaewoo Kang, Jiong Yang, Hyung Jun Cho, Heenam Kim, Min-Goo Lee

Research output: Contribution to journalArticle

21 Citations (Scopus)

Abstract

Background: Clustering is a popular data exploration technique widely used in microarray data analysis. Most conventional clustering algorithms, however, generate only one set of clusters independent of the biological context of the analysis. This is often inadequate to explore data from different biological perspectives and gain new insights. We propose a new clustering model that can generate multiple versions of different clusters from a single dataset, each of which highlights a different aspect of the given dataset. Results: By applying our SigCalc algorithm to three yeast Saccharomyces cerevisiae datasets we show two results. First, we show that different sets of clusters can be generated from the same dataset using different sets of landmark genes. Each set of clusters groups genes differently and reveals new biological associations between genes that were not apparent from clustering the original microarray expression data. Second, we show that many of these new found biological associations are common across datasets. These results also provide strong evidence of a link between the choice of landmark genes and the new biological associations found in gene clusters. Conclusion: We have used the SigCalc algorithm to project the microarray data onto a completely new subspace whose co-ordinates are genes (called landmark genes), known to belong to a Biological Process. The projected space is not a true vector space in mathematical terms. However, we use the term subspace to refer to one of virtually infinite numbers of projected spaces that our proposed method can produce. By changing the biological process and thus the landmark genes, we can change this subspace. We have shown how clustering on this subspace reveals new, biologically meaningful clusters which were not evident in the clusters generated by conventional methods. The R scripts (source code) are freely available under the GPL license. The source code is available [see Additional File 1] as additional material, and the latest version can be obtained at http://www4.ncsu.edu/~pchopra/landmarks.html. The code is under active development to incorporate new clustering methods and analysis.

Original languageEnglish
Article number92
JournalBMC Bioinformatics
Volume9
DOIs
Publication statusPublished - 2008 Feb 11

Fingerprint

Data Mining
Microarrays
Landmarks
Microarray Data
Cluster Analysis
Data mining
Genes
Clustering
Gene
Biological Phenomena
Subspace
Multigene Family
Yeast
Licensure
Microarray Analysis
Microarray Data Analysis
Clustering Analysis
Saccharomyces cerevisiae
Saccharomyces Cerevisiae
Term

ASJC Scopus subject areas

  • Medicine(all)
  • Structural Biology
  • Applied Mathematics

Cite this

Microarray data mining using landmark gene-guided clustering. / Chopra, Pankaj; Kang, Jaewoo; Yang, Jiong; Cho, Hyung Jun; Kim, Heenam; Lee, Min-Goo.

In: BMC Bioinformatics, Vol. 9, 92, 11.02.2008.

Research output: Contribution to journalArticle

@article{f0ea5926aa8e4c3e8e7dec4da145b5b3,
title = "Microarray data mining using landmark gene-guided clustering",
abstract = "Background: Clustering is a popular data exploration technique widely used in microarray data analysis. Most conventional clustering algorithms, however, generate only one set of clusters independent of the biological context of the analysis. This is often inadequate to explore data from different biological perspectives and gain new insights. We propose a new clustering model that can generate multiple versions of different clusters from a single dataset, each of which highlights a different aspect of the given dataset. Results: By applying our SigCalc algorithm to three yeast Saccharomyces cerevisiae datasets we show two results. First, we show that different sets of clusters can be generated from the same dataset using different sets of landmark genes. Each set of clusters groups genes differently and reveals new biological associations between genes that were not apparent from clustering the original microarray expression data. Second, we show that many of these new found biological associations are common across datasets. These results also provide strong evidence of a link between the choice of landmark genes and the new biological associations found in gene clusters. Conclusion: We have used the SigCalc algorithm to project the microarray data onto a completely new subspace whose co-ordinates are genes (called landmark genes), known to belong to a Biological Process. The projected space is not a true vector space in mathematical terms. However, we use the term subspace to refer to one of virtually infinite numbers of projected spaces that our proposed method can produce. By changing the biological process and thus the landmark genes, we can change this subspace. We have shown how clustering on this subspace reveals new, biologically meaningful clusters which were not evident in the clusters generated by conventional methods. The R scripts (source code) are freely available under the GPL license. The source code is available [see Additional File 1] as additional material, and the latest version can be obtained at http://www4.ncsu.edu/~pchopra/landmarks.html. The code is under active development to incorporate new clustering methods and analysis.",
author = "Pankaj Chopra and Jaewoo Kang and Jiong Yang and Cho, {Hyung Jun} and Heenam Kim and Min-Goo Lee",
year = "2008",
month = "2",
day = "11",
doi = "10.1186/1471-2105-9-92",
language = "English",
volume = "9",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Microarray data mining using landmark gene-guided clustering

AU - Chopra, Pankaj

AU - Kang, Jaewoo

AU - Yang, Jiong

AU - Cho, Hyung Jun

AU - Kim, Heenam

AU - Lee, Min-Goo

PY - 2008/2/11

Y1 - 2008/2/11

N2 - Background: Clustering is a popular data exploration technique widely used in microarray data analysis. Most conventional clustering algorithms, however, generate only one set of clusters independent of the biological context of the analysis. This is often inadequate to explore data from different biological perspectives and gain new insights. We propose a new clustering model that can generate multiple versions of different clusters from a single dataset, each of which highlights a different aspect of the given dataset. Results: By applying our SigCalc algorithm to three yeast Saccharomyces cerevisiae datasets we show two results. First, we show that different sets of clusters can be generated from the same dataset using different sets of landmark genes. Each set of clusters groups genes differently and reveals new biological associations between genes that were not apparent from clustering the original microarray expression data. Second, we show that many of these new found biological associations are common across datasets. These results also provide strong evidence of a link between the choice of landmark genes and the new biological associations found in gene clusters. Conclusion: We have used the SigCalc algorithm to project the microarray data onto a completely new subspace whose co-ordinates are genes (called landmark genes), known to belong to a Biological Process. The projected space is not a true vector space in mathematical terms. However, we use the term subspace to refer to one of virtually infinite numbers of projected spaces that our proposed method can produce. By changing the biological process and thus the landmark genes, we can change this subspace. We have shown how clustering on this subspace reveals new, biologically meaningful clusters which were not evident in the clusters generated by conventional methods. The R scripts (source code) are freely available under the GPL license. The source code is available [see Additional File 1] as additional material, and the latest version can be obtained at http://www4.ncsu.edu/~pchopra/landmarks.html. The code is under active development to incorporate new clustering methods and analysis.

AB - Background: Clustering is a popular data exploration technique widely used in microarray data analysis. Most conventional clustering algorithms, however, generate only one set of clusters independent of the biological context of the analysis. This is often inadequate to explore data from different biological perspectives and gain new insights. We propose a new clustering model that can generate multiple versions of different clusters from a single dataset, each of which highlights a different aspect of the given dataset. Results: By applying our SigCalc algorithm to three yeast Saccharomyces cerevisiae datasets we show two results. First, we show that different sets of clusters can be generated from the same dataset using different sets of landmark genes. Each set of clusters groups genes differently and reveals new biological associations between genes that were not apparent from clustering the original microarray expression data. Second, we show that many of these new found biological associations are common across datasets. These results also provide strong evidence of a link between the choice of landmark genes and the new biological associations found in gene clusters. Conclusion: We have used the SigCalc algorithm to project the microarray data onto a completely new subspace whose co-ordinates are genes (called landmark genes), known to belong to a Biological Process. The projected space is not a true vector space in mathematical terms. However, we use the term subspace to refer to one of virtually infinite numbers of projected spaces that our proposed method can produce. By changing the biological process and thus the landmark genes, we can change this subspace. We have shown how clustering on this subspace reveals new, biologically meaningful clusters which were not evident in the clusters generated by conventional methods. The R scripts (source code) are freely available under the GPL license. The source code is available [see Additional File 1] as additional material, and the latest version can be obtained at http://www4.ncsu.edu/~pchopra/landmarks.html. The code is under active development to incorporate new clustering methods and analysis.

UR - http://www.scopus.com/inward/record.url?scp=40449140216&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=40449140216&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-9-92

DO - 10.1186/1471-2105-9-92

M3 - Article

VL - 9

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

M1 - 92

ER -