Computational Biology • Bioinformatics • Genomics
Developing integrative approaches to analyze gene lists from high-throughput screens, expression profiling, and genome-wide association studies — placing genes into biological context through enrichment analysis, protein networks, and process mapping.
A next-generation web tool for gene set enrichment analysis — supporting both standard over-representation analysis and a novel iterative mode.
Gene lists from siRNA screens, microarray experiments, GWAS, and proteomics can be analyzed using two complementary approaches — placing genes into known biology, or discovering new biology through networks.
A gene-centric database framework that supports enrichment analysis and gene-gene interaction discovery. It integrates annotation from multiple sources: protein interactions, microarray expression profiles, co-citation data, transcription factor binding sites, and high-throughput screen results. Enrichment p-values are calculated using the hypergeometric distribution.
Construct a biological process map of connected keywords and genes before the screen. Then place screen hits into the map using enrichment analysis and protein interactions. This approach works best when the biological pathway is well characterized.
Example: Mapping 160+ genes into the HIV life cycle — from receptor binding through nuclear import to budding — using enrichment of Gene Ontology terms, protein interactions, and text mining.
When the biological process is not well characterized, expand the gene list with enriched interacting proteins, then iteratively extract the most significant network cluster. Remove it and repeat to discover successively deeper biology.
Example: In the HIF-1 screen, this strategy — biased with keywords like "hypoxia" and "oxygen" — revealed clusters for hypoxic response, cell proliferation, and metabolic reprogramming. This iterative concept is now implemented in iGEA.
An siRNA screen identifying genes required for HIV infection — published in Science, 2008. Over 160 host dependency factors were placed into a comprehensive map of the viral life cycle.
Construct a map of the viral life cycle before the screen. This captures the current state of knowledge — from receptor binding through nuclear import, integration, transcription, and budding.
Add genes previously known to play a role — manually (e.g., CD4, CXCR4, CCR5 for viral entry) or automatically via a text mining engine. Each text-mined gene is supported by at least 3 independent publications.
Screen hits are placed into the map using the integrative database: enrichment analysis identifies relevant categories (e.g., "mRNA transport" — 6 genes, the most significant term), and protein interactions link individual genes (e.g., RANBP2 and TAOK1 interact with XPO1 in nuclear export).
The process continues until all possible genes are placed. The final manuscript figure shows the complete HIV life cycle with host dependency factors (HDFs) mapped to each step.
Genes that could not be placed on the map fall into three categories:
siRNA screen identifying genes required for Hepatitis C replication — published in Cell Host & Microbe.
The analysis revealed that the main process on which viral replication depends is the formation of the membranous web — a reorganization of intracellular membranes that creates the replication compartment. This fundamental mechanism became the central insight of the study.
Membranous web — manuscript figure
Luciferase-based screen for genes that activate HIF-1, combined with a strategy to predict HIF-1 downstream targets. Published in Nucleic Acids Research.
Since many genes are poorly annotated, the gene list is first expanded with significantly enriched interacting proteins. The expanded list is then tested for enrichment, and the largest significant cluster is extracted. This cluster is removed, and the process iterates — exactly the strategy now available in iGEA's iterative mode.
Iterative strategy overview
The first and strongest cluster captures the expected hypoxic response — biased with keywords like "hypoxia" and "oxygen." Removing this cluster allows the algorithm to discover deeper, less obvious biology in subsequent iterations.
Cluster 1: Hypoxia / Oxygen (biased)
After removing the hypoxia cluster, the next most significant cluster reveals cell proliferation — a known but less obvious connection to HIF-1 signaling, demonstrating how iterative extraction uncovers layered biology.
Cluster 2: Cell Proliferation
Using integrative analysis to identify causal genes across 30 GWAS loci. PubMed
The 2007 GWAS meta-analysis of Crohn's disease identified 30 loci with over 200 candidate causal genes. The strategy was to find shared attributes linking genes across multiple loci — and to look for disease-relevant responses to IFNγ (the hallmark cytokine of Crohn's disease) and NFκB activation.
Multi-locus linking strategy
By intersecting genes that respond to IFNγ stimulation with those harboring conserved IRF binding sites in their proximal promoters, the field of 200+ candidates was narrowed to just 9 genes — a dramatic reduction pointing to the most likely causal factors.
IFNγ response + IRF binding site intersection
Among the 9 genes: NOD2, CYLD, and LRRK2. LRRK2 was previously known only for Parkinson's disease but was subsequently validated as IRF-responsive in macrophages — confirming the prediction and revealing an unexpected connection between two diseases.
LRRK2 validated as IRF-responsive in macrophages
Tools that provide functional clues for genes without relying on existing annotations — using expression profiling, EST counting, and promoter analysis.
Gene expression profiles across 126 primary cells and tissues with a novel enrichment scoring system. Unlike expression levels, this tissue-specificity score is comparable between genes, enabling gene ranking within each tissue.
Key finding: The top 4 scoring genes in embryonic stem cells — POU5F1 (OCT4), SOX2, NANOG, and LIN28 — were independently identified as sufficient to reprogram human somatic cells to pluripotent stem cells.
Compiled from over 7 million human Expressed Sequence Tags to identify genes enriched in the immune system. Provides tissue-level expression evidence independent of microarray platforms.
Identifies transcription factor binding sites in proximal promoters, providing clues to the stimuli that regulate a gene. Conservation across multiple species is used as evidence, with p-values calculated relative to flanking regions.
Validated predictions: ANKRD37 confirmed as a HIF-1 target; LRRK2 confirmed as an IRF target in the GWAS analysis.
This thesis describes the complete framework: the integration database, the supervised process-mapping approach, the unsupervised self-organizing network strategy, and annotation-independent tools — applied to HIV, HCV, and HIF-1 screens.
📥 Download Thesis (PDF)