Differential Gene Expression

Differential Expression and Pathway Analysis

What changes between conditions

Once cells are clustered and annotated, the most common next question is, which genes are expressed differently across conditions, time points, treatments, or disease states? Differential expression analysis answers that question systematically, producing ranked lists of genes that change in a specific cell type and direction, with statistical confidence estimates that account for the complexity of single-cell data.

The results feed directly into biological interpretation: which pathways are activated or suppressed, which transcription factors are driving the changes, and which findings are robust enough to follow up experimentally.

The pseudoreplication problem

The most consequential methodological choice in single-cell DE analysis is how biological replication is handled. Treating individual cells as independent replicates — as many early single-cell DE workflows did — produces severely inflated false positive rates, because cells from the same sample are correlated with each other and do not constitute independent observations.

The correct approach is pseudobulk analysis: counts from all cells of a given type within each biological sample are aggregated before testing, so that each sample is the unit of replication rather than each cell. DESeq2 and edgeR applied to pseudobulk counts are the current standard, validated across extensive benchmarking. Both use negative binomial models originally developed for bulk RNA-seq and perform well even with small sample sizes. limma-voom is a third option, preferred when sample sizes are very small or variance structure is unusual.

For complex experimental designs with repeated measures, matched donors, or nested random effects, mixed model approaches via muscat or DREAM (from the variancePartition package) are appropriate. These model donor as a random effect rather than aggregating counts, preserving cell-level information while correctly accounting for within-donor correlation.

Marker gene identification

Marker gene identification — finding genes that distinguish one cluster from others — is a related but distinct task from condition-level DE. For marker identification within a single dataset, Wilcoxon rank-sum tests (implemented in Seurat’s FindMarkers and Scanpy’s rank_genes_groups) perform well and are computationally efficient. The pseudoreplication concern is less acute here because the comparison is between cell types within the same samples rather than between experimental conditions across samples.

For datasets with multiple samples per condition, marker gene identification should still account for sample structure where possible, using pseudobulk or mixed model approaches to ensure markers are reproducible across biological replicates rather than driven by a single sample.

Gene set enrichment analysis

Individual DE gene lists are difficult to interpret biologically without a framework for summarizing which pathways or biological processes are affected. Gene set enrichment analysis (GSEA) addresses this by testing whether genes from predefined biological pathways are systematically enriched at the top or bottom of a ranked gene list, rather than applying an arbitrary significance cutoff.

fgsea (fast pre-ranked GSEA) is the standard implementation, applied to gene lists ranked by log fold change, test statistic, or signed p-value. Gene sets are drawn from MSigDB (Molecular Signatures Database), which curates gene sets across hallmark pathways, GO biological processes, KEGG pathways, and dozens of other collections. clusterProfiler provides over-representation analysis (ORA) as an alternative for discrete gene lists, and integrates with MSigDB and GO databases for human and mouse.

Spatial differential expression

When spatial transcriptomics data is available, differential expression analysis gains a location dimension. Spatially variable gene analysis identifies genes whose expression varies systematically across tissue space, independent of cell type identity. This reveals genes that are regulated by the tissue microenvironment — proximity to a vessel, a tumor boundary, or an anatomical structure — rather than by cell intrinsic programs.

Spatial DE analysis can be performed at two levels of resolution. When cell segmentation is available, individual cells are the unit of analysis, and expression profiles can be directly compared to single-cell reference data using the same DE frameworks applied to dissociated cells. This is the preferred approach when it is achievable, because it aligns spatial and single-cell analyses on a common footing, enables cell-type-specific DE within the spatial context, and avoids the compositional confounding that affects spot-level analysis. For sequencing-based platforms like VisiumHD where segmentation is approximate, or where spot-level analysis is preferred for its simplicity, spot-level DE remains informative but results should be interpreted with awareness that each observation reflects a mixture of cell types rather than a single cell.

Spatially variable genes are identified using tools such as SpatialDE or the spatial variable gene functions in Squidpy, which model gene expression as a function of spatial coordinates. Results complement cell-type-level DE by identifying microenvironmental signals that cut across cell type boundaries.

Visualization

DE results are reported as volcano plots showing effect size against statistical significance, MA plots, and ranked gene tables. Pathway analysis results are summarized as dot plots or enrichment bar charts showing normalized enrichment scores (NES) and adjusted p-values. For spatial data, spatially variable genes are overlaid on tissue images to show where in the tissue each gene’s spatial variation occurs.

← Back to Science