bioinformaticsComplex diseases, such as cancer, heart disease, stroke, and diabetes, have been the leading causes of death in the United States for the last half-century. The underlying molecular mechanisms of these diseases are still poorly understood. Recent advancements in high-throughput sequencing and molecular profiling technologies have provided an unprecedented opportunity for complex disease studies. Comprehensive characterization of genomic, epigenomic, transcriptomic, proteomic, and metabolomic changes in disease states may eventually help elucidate molecular mechanisms underlying these diseases. Moreover, correlating molecular measurements with clinical characterizations holds great potential for improving disease diagnosis, prognosis, and personalized treatment. A main obstacle to these advances is our ability to effectively manage, integrate, and interpret the large volume of heterogeneous data. The long-term goal of our research is to develop computational and statistical approaches that help translate multidimensional omics data into biological and clinical insights.Recent works in my group focus on the following three areas: 1) Proteogenomics; 2) Network medicine; and 3) Democratization of bioinformatics.

1. Proteogenomics

Over the past decade, mass spectrometry (MS)-based shotgun proteomics has emerged as a powerful technology for protein profiling in complex samples with remarkable applications in elucidating cellular and subcellular proteomes, mapping protein interaction networks, and discovering disease biomarkers. Previously, we have developed methods for peptide identification and protein assembly (Zhang, et al., 2007), comparative proteomics (Zhang, et al., 2006), protein complex detection (Zhang, et al., 2008), and signaling network identification (Zhang, et al., 2011).

Our recent work has been focusing on cancer proteogenomics, i.e., the integration of genomic data and proteome analyses to characterize tumor proteomes. Although proteins reflecting the genomic changes in cancer have the potential to become clinically useful biomarkers, their discovery and validation has proven to be challenging. Large-scale cancer genomic projects such as The Cancer Genome Atlas (TCGA) project have identified a large number of genomic alterations in different types of cancer. Through integrating genomic variation data from various public resources, we have built a human Cancer Proteome Variation Database (CanProVar) (Li, et al., 2010), which enables variant peptide identification in cancer proteomics studies (Li, et al., 2011). More recently, we proposed to derive sample-specific protein sequence databases from RNA-Seq data and have demonstrated that customized databases significantly increase the sensitivity in peptide identification, reduce ambiguity in protein assembly, and enable the detection of known and novel peptide variants (Wang, et al., 2012; Wang and Zhang, 2013). In addition to protein identification, integrative analysis of quantitative transcriptome and proteome data has started to provide novel insights into gene expression regulation (Liu, et al., 2013). In the Clinical Proteomic Tumor Analysis Consortium (CPTAC) project funded by the National Cancer Institute (NCI), in collaboration with our colleagues Drs. Dan Liebler, Rob Slebos, and Dave Tabb, we are working on the integrated proteogenomic characterization of human colorectal cancer. Results from the initial phase of this project were recently published in Nature (Zhang, et al. 2014). Some of my thoughts on proteogenomics can be found in this interview.

2. Network medicine

A major goal in the post genomic era is to understand how genes, proteins, and metabolites are integrated in cellular systems, how disturbances in the systems may lead to disease, and what might be done to restore disturbed systems to their normal functions. Network biology offers a holistic way to study cellular systems and plays a critical role in answering these questions. During the past several years, my group has developed methods for network construction (Shi, et al., 2010), network module identification (Shi, et al., 2010; Shi, et al., 2013; Zhang, et al., 2008), network-based gene prioritization (Shi and Zhang, 2011; Zhang, et al., 2011), and network-based prediction of protein function and expression (Li, et al., 2009).

In the project “Systems approach to the biological basis of colon cancer metastases” funded by the National Institute of Health (NIH), we apply the above network-based approaches to colon cancer studies. First, to address a clinically important question of predicting recurrence risk in colon cancer patients, we identified a network-based gene expression signature with both biological and clinical relevance (Shi, et al., 2012). Prognosis models based on this signature robustly predicted colon cancer recurrence and stratified patients into two subgroups with markedly different response to adjuvant chemotherapy. This signature holds the potential for the future development of clinical assays to guide treatment decisions on adjuvant chemotherapy for colon cancer patients. Secondly, we have established a co-expression module-based bioinformatics workflow for predicting transcriptional regulators of disease phenotypes (Shi, et al., 2010). The framework has been demonstrated through the identification of as a novel transcriptional regulator of colon cancer metastasis, which was experimentally validated in both murine colon cancer cells and human colon cancer specimens by colleagues in Dr. Dan Beauchamp’s group (Tripathi, et al., 2014). Finally, we have defined three clinically relevant transcriptional subtypes in human colon cancer, and our integrative analysis of somatic mutation, copy number variation, gene expression, and signaling network information suggests that highly heterogeneous genomic alterations converge to a limited number of distinct mechanisms that drive unique cancer biology in different transcriptional subtypes (Zhu, et al., 2013). This work provides a coherent and integrated picture of human colon cancer that links genomic alterations to molecular and clinical consequences, which helps shed light on the development of targeted therapeutic strategies for different colon cancer subtypes.


3. Democratization of bioinformatics

Technology advancements have led to an increasing gap between data generation and investigators’ ability to interpret the data. Until computational tools are available for biologists and clinicians to independently interpret the vast amount of interconnected data, the potential of these data will be severely underexploited. To fill this gap, we develop bioinformatics tools that can be used directly by biologists and clinical researchers. In the WebGestalt ( project (Wang, et al., 2013; Zhang, et al., 2005), we develop gene-centric databases, statistical methods, and user-friendly web-interface to help translate gene lists derived from genomic, transcriptomic, and proteomic studies to underlying biological processes, pathways, and regulatory mechanisms. WebGestalt has been serving the biology community for more than eight years with more than 300 daily usage count during the past year. In our newly initiated NetGestalt ( project (Shi, et al., 2013), we are developing a novel data integration framework that allows simultaneous presentation of large scale experimental and annotation data from many sources in the context of biological networks to facilitate data visualization, analysis, interpretation and hypothesis generation. The tool achieves high scalability through exploiting the inherent hierarchical modular architecture of biological networks, and thus helps reveal important biological insights that cannot be readily obtained with existing visualization tools. We have received fund from NCI to develop a NetGestalt-based data portal that allows biologists to perform integrative analysis of cancer genomics and proteomics data.


Li, J., Duncan, D.T. and Zhang, B. CanProVar: a human cancer proteome variation database. Hum Mutat 2010;31(3):219-228.

Li, J., et al. A bioinformatics workflow for variant peptide detection in shotgun proteomics. Mol Cell Proteomics 2011;10(5):M110 006536.

Li, J., et al. Network-assisted protein identification and data interpretation in shotgun proteomics. Mol Syst Biol 2009;5:303.

Liu, Q., et al. Integrative omics analysis reveals the importance and scope of translational repression in microRNA-mediated regulation. Mol Cell Proteomics 2013;12(7):1900-1911.

Shi, M., Beauchamp, R.D. and Zhang, B. A network-based gene expression signature informs prognosis and treatment for colorectal cancer patients. PLoS One 2012;7(7):e41292.

Shi, Z., Derow, C.K. and Zhang, B. Co-expression module analysis reveals biological processes, genomic gain, and regulatory mechanisms associated with breast cancer progression. BMC Syst Biol 2010;4:74.

Shi, Z., Wang, J. and Zhang, B. NetGestalt: integrating multidimensional omics data over biological networks. Nat Methods 2013;10(7):597-598.

Shi, Z. and Zhang, B. Fast network centrality analysis using GPUs. BMC Bioinformatics 2011;12:149.

Tripathi, M.K., et la. NFAT transcriptional activity is associated with metastatic capacity in colon cancer. Cancer Res 2014.

Wang, J., et al. WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013. Nucleic Acids Res 2013;41(Web Server issue):W77-83.

Wang, X., et al. Protein identification using customized protein sequence databases derived from RNA-Seq data. J Proteome Res 2012;11(2):1009-1017.

Wang, X. and Zhang, B. customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics 2013;29(24):3235-3237.

Zhang, B., Chambers, M.C. and Tabb, D.L. Proteomic parsimony through bipartite graph analysis improves accuracy and transparency. J Proteome Res 2007;6(9):3549-3557.

Zhang, B., Kirov, S. and Snoddy, J. WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res 2005;33(Web Server issue):W741-748.

Zhang, B., et al. From pull-down data to protein interaction networks and complexes with biological relevance. Bioinformatics 2008;24(7):979-986.

Zhang, B., et al. Relating protein adduction to gene expression changes: a systems approach. Mol Biosyst 2011;7(7):2118-2127.

Zhang, B., et al. Detecting differential and correlated protein expression in label-free shotgun proteomics. J Proteome Res 2006;5(11):2909-2918.

Zhang, B., et al. Proteogenomic characterization of human colon and rectal cancer. Nature 2014;513(7518):382-387.

Zhu, J., et al. Deciphering genomic alterations in colorectal cancer through transcriptional subtype-based network analysis. PLoS One 2013;8(11):e79282.