Computational Methods and Analytical Infrastructure

The analytical foundation

Single-cell and spatial transcriptomics generate large, complex datasets that require careful computational handling at every stage, from raw data processing through final biological interpretation. The tools, infrastructure, and workflow choices made at each step have material effects on the conclusions drawn from the data. This page describes the computational infrastructure 3DG uses and the analytical choices that underpin all of our analyses.

Core analysis frameworks

Seurat (R) and Scanpy (Python, AnnData format) are the two dominant frameworks for single-cell analysis, and 3DG works in both depending on client needs and downstream analysis requirements. Seurat v5 is preferred for analyses involving CITE-seq, Multiome, or reference mapping via Azimuth. Scanpy/AnnData is preferred for large datasets, Python-native tool compatibility, and integration with the broader scverse ecosystem.

Both frameworks implement standard preprocessing, normalization, dimensionality reduction (PCA, UMAP), clustering (Leiden algorithm), and differential expression testing. Differences in default parameters and normalization approaches can produce different results on the same data, so reporting which framework was used and with which settings is part of reproducible single-cell analysis.

Foundation models for single-cell biology

scGPT, published in Nature Methods in 2024 and pretrained on 33 million cells, achieves strong performance on cell type annotation, gene expression imputation, and perturbation response prediction via transfer learning. scGPT-spatial (2025 preprint) extends the architecture to spatial transcriptomics, incorporating neighborhood context and protocol-aware decoding for integration across imaging- and sequencing-based platforms.

Geneformer and Nicheformer represent alternative architectural approaches. CellTypist provides pretrained classifiers for a broad range of tissue types with documented performance metrics and remains the most practical option for routine annotation tasks.

Foundation model performance is dataset-dependent. These models perform best when the query data resembles the training distribution. For specialized tissue types, rare cell populations, or non-human species, classical supervised annotation with curated marker lists or reference atlases remains more reliable. 3DG selects annotation strategies based on the specific dataset and biological context rather than defaulting to any single approach.

Scalable computing for large datasets

Modern single-cell experiments routinely profile hundreds of thousands to millions of cells. Out-of-core analysis tools (TileDB-SOMA, backed by cloud object storage) and GPU-accelerated implementations (RAPIDS-singlecell) enable analysis of datasets that exceed available memory and reduce wall-clock time for computationally intensive steps such as nearest-neighbor graph construction and UMAP embedding.

3DG maintains dedicated workstation infrastructure for large-scale analysis, including systems with up to 1.3 TB RAM and 48 GB GPU memory, enabling in-house processing of very large datasets without reliance on cloud compute for routine work.

Reproducibility and workflow management

Reproducibility is a persistent challenge in single-cell analysis, where pipelines involve many steps, each with parameter choices that can materially affect results. Workflow management systems (Nextflow and Snakemake) enable version-controlled, containerized pipelines that can be re-run exactly on different compute infrastructure. Combined with explicit logging of software versions and parameter choices, workflow management is increasingly a prerequisite for publication-quality single-cell analysis.

3DG documents software versions, parameter choices, and analysis decisions for all projects. Analysis code is available for client review and can be adapted for internal replication or publication submission.

Return to Science.