Mission: Translating Cancer Data into Biological and Clinical Insights

We are an interdisciplinary research lab focusing on computational and translational oncology. Our dedicated team leverages the power of high-throughput sequencing, mass spectrometry, and informatics to push the boundaries of cancer research and treatment. At the core of our mission is a simple yet profound aim: we want to unlock the hidden potential within extensive cancer omics datasets. Our approach is driven by data, and our ultimate goal is to drive progress in the fight against cancer.

Research Directions: Three Key Domains

We explore three key domains that contribute to our central mission: computational proteomics, cancer proteogenomics, and data democratization.

1. Computational Proteomics: Decoding Proteome Complexity

Proteins are the fundamental building blocks of cells and primary targets for therapeutic intervention. Mass spectrometry (MS)-based shotgun proteomics empowers us to comprehensively identify and quantify proteins and their modifications within biological samples. However, there are various computational challenges in shotgun proteomics data analysis.

Peptide identification is a pivotal step in the analysis of MS proteomics data. To enhance the sensitivity of peptide identification, we have developed deep learning-based algorithms (1-2). These algorithms are tailored for immunopeptidomic and post-translational modification (PTM) profiling studies, where peptide identification proves to be the most challenging. In our pursuit of identifying novel disease-specific peptides for potential use as biomarkers and therapeutic targets, we have pioneered a customized protein database approach (3-4). By integrating this approach with immunopeptidomics or computational human leukocyte antigens (HLA) binding prediction in a computational workflow named NeoFlow, we streamline the process of proteogenomics-based neoantigen discovery (5).

For assembling identified peptides into proteins, we have introduced a bipartite graph model that effectively represents the intricate relationships between peptides and proteins (6). Expanding upon this model, we have created SEPepQuant (7), a tool designed to enhance isoform characterization. This advancement enables the detection of protein isoform regulations, which play important roles in both normal and disease processes.

To facilitate the interpretation of genes and proteins identified from proteomics and other omics studies, we integrate these findings with existing knowledge about pathway and biological networks to gain a systematic understanding (8-9). Given the often limited curation of knowledge at the level of PTM sites, we leverage recent advancements in deep learning-based natural language processing (10). This approach allows us to extract insights from published literature on PTM sites, enriching our comprehension of findings from PTM studies.

2. Cancer Proteogenomics: Advancing Functional Precision Oncology

Cancer is a disease of genetic aberrations, but many processes downstream of the genome may influence cancer phenotypes. Cancer proteogenomics aims to integrate next generation sequencing-based genomics and transcriptomics with MS-based proteomics to gain a comprehensive understanding of cancer, ultimately enhancing cancer diagnosis and treatment (11).

Our journey in this field started with a groundbreaking colon cancer study published in 2014 (12), which introduced the concept of cancer proteogenomics. Since then, this approach has been applied to over a dozen cancer types, with more on the horizon. Our team has actively contributed to more than ten of these studies, assuming a leading role in investigations related to colon cancer, uterine cancer, breast cancer, head and neck cancer, pancreatic cancer, and lung cancer (13-19). Together, these studies have demonstrated that integrated proteogenomic analysis provides functional context to interpret genomic abnormalities, and that proteogenomics holds great potential to enable new advances in cancer biology, diagnostics and therapeutics.

3. Data Democratization: Making Data Accessible to All

While thousands of proteomics datasets with billions of MS spectra have been deposited into public data repositories, their utilization is largely restricted to computational proteomics researchers due to the intricacies involved in comprehending, retrieving, analyzing, and interpreting MS data. Inspired by the BLAST algorithm, we have developed PepQuery (20-21), a peptide centric algorithm that enables users to query a peptide sequence of interest (e.g., a mutant peptide) against a collection of MS/MS spectra to identify statistically significant peptide-spectrum matches. PepQuery has a wide range of applications, such as detecting proteomic evidence for genomically predicted novel peptides, validating novel or known peptides identified using traditional spectrum-centric database searching, prioritizing tumor-specific antigens, identifying missing proteins, and selecting proteotypic peptides for targeted proteomics experiments (21).

In addition to promoting the reuse of public MS spectra data, we also make processed cancer multi-omics data from large consortium studies readily available and useful to the broad research community (22). The LinkedOmics web portal (23) provides a unique platform for biologists and clinicians to access and analyze the vast amount of cancer multi-omics data generated by TCGA and CPTAC. LinkedOmics allows users to analyze and visualize associations between billions of molecular and clinical feature pairs for each tumor cohort, to compare the association results across omics modalities and cancer types, and to interpret association results using WebGestalt, a user-friendly pathway and network analysis tool developed by us. LinkedOmicsKB (24) further streamlines pan-cancer and multi-omic analysis by introducing innovative visualization techniques.

Collectively, these tools grant scientists easy access to intricate omics data via user-friendly web interfaces. This accessibility significantly amplifies the potential impact of these invaluable cancer datasets.

Collaborations: Translating Discoveries into Clinical Impact

Our commitment in the above three domains has yielded data-driven hypotheses that shed light on new biological mechanisms, biomarkers, and therapeutic targets. This illumination is especially pertinent in the realms of targeted and immunotherapies. Through close partnerships with cancer biologists and clinicians, we aim to translate these discoveries into tangible benefits for cancer patients.


  1. Li K, Jain A, Malovannaya A, Wen B, Zhang B. DeepRescore: Leveraging Deep Learning to Improve Peptide Identification in Immunopeptidomics. Proteomics. 2020;20(21-22):e1900334.
  2. Yi X, Wen B, Ji S, et al. Deep learning prediction boosts phosphoproteomics-based discoveries through improved phosphopeptide identification. Preprint. bioRxiv. 2023;2023.01.11.523329.
  3. Wang X, Slebos RJ, Wang D, et al. Protein identification using customized protein sequence databases derived from RNA-Seq data. J Proteome Res. 2012;11(2):1009-1017.
  4. Wang X, Zhang B. customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics. 2013;29(24):3235-3237.
  5. Wen B, Li K, Zhang Y, Zhang B. Cancer neoantigen prioritization through sensitive and reliable proteogenomics analysis. Nat Commun. 2020;11(1):1759.
  6. Zhang B, Chambers MC, Tabb DL. Proteomic parsimony through bipartite graph analysis improves accuracy and transparency. J Proteome Res. 2007;6(9):3549-3557.
  7. Dou Y, Liu Y, Yi X, et al. SEPepQuant enhances the detection of possible isoform regulations in shotgun proteomics. Nat Commun. 2023;14(1):5809.
  8. Zhang B, Kirov S, Snoddy J. WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res. 2005;33(Web Server issue):W741-W748.
  9. Liao Y, Wang J, Jaehnig EJ, Shi Z, Zhang B. WebGestalt 2019: gene set analysis toolkit with revamped UIs and APIs. Nucleic Acids Res. 2019;47(W1):W199-W205.
  10. Savage SR, Zhang Y, Jaehnig EJ, et al. IDPpub: Illuminating the Dark Phosphoproteome Through PubMed Mining. Mol Cell Proteomics. Published online November 20, 2023. doi:10.1016/j.mcpro.2023.100682
  11. Zhang B, Whiteaker JR, Hoofnagle AN, Baird GS, Rodland KD, Paulovich AG. Clinical potential of mass spectrometry-based proteogenomics. Nat Rev Clin Oncol. 2019;16(4):256-268.
  12. Zhang B, Wang J, Wang X, et al. Proteogenomic characterization of human colon and rectal cancer. Nature. 2014;513(7518):382-387.
  13. Vasaikar S, Huang C, Wang X, et al. Proteogenomic Analysis of Human Colon Cancer Reveals New Therapeutic Opportunities. Cell. 2019;177(4):1035-1049.
  14. Dou Y, Kawaler EA, Cui Zhou D, et al. Proteogenomic Characterization of Endometrial Carcinoma. Cell. 2020;180(4):729-748.
  15. Krug K, Jaehnig EJ, Satpathy S, et al. Proteogenomic Landscape of Breast Cancer Tumorigenesis and Targeted Therapy. Cell. 2020;183(5):1436-1456.
  16. Huang C, Chen L, Savage SR, et al. Proteogenomic insights into the biology and treatment of HPV-negative head and neck squamous cell carcinoma. Cancer Cell. 2021;39(3):361-379.
  17. Cao L, Huang C, Cui Zhou D, et al. Proteogenomic characterization of pancreatic ductal adenocarcinoma. Cell. 2021;184(19):5031-5052.
  18. Satpathy S, Krug K, Jean Beltran PM, et al. A proteogenomic portrait of lung squamous cell carcinoma. Cell. 2021;184(16):4348-4371.
  19. Dou Y, Katsnelson L, Gritsenko MA, et al. Proteogenomic insights suggest druggable pathways in endometrial carcinoma. Cancer Cell. 2023;41(9):1586-1605.
  20. Wen B, Wang X, Zhang B. PepQuery enables fast, accurate, and convenient proteomic validation of novel genomic alterations. Genome Res. 2019;29(3):485-493.
  21. Wen B, Zhang B. PepQuery2 democratizes public MS proteomics data for rapid peptide searching. Nat Commun. 2023;14(1):2213.
  22. Li Y, Dou Y, Da Veiga Leprevost F, et al. Proteogenomic data and resources for pan-cancer analysis. Cancer Cell. 2023;41(8):1397-1406.
  23. Vasaikar SV, Straub P, Wang J, Zhang B. LinkedOmics: analyzing multi-omics data within and across 32 cancer types. Nucleic Acids Res. 2018;46(D1):D956-D963.
  24. Liao Y, Savage SR, Dou Y, et al. A proteogenomics data-driven knowledge base of human cancer. Cell Syst. 2023;14(9):777-787.