What is the relationship between GEO, SRA and Bioproject?

Question

Accepted Answer

The relationship between GEO, SRA, and BioProject is hierarchical and complementary, forming the core public data infrastructure for modern genomics and functional genomics research at the National Center for Biotechnology Information (NCBI). BioProject serves as the top-level organizational umbrella, representing a coordinated collection of biological data related to a single overarching research initiative, grant, or study. A single BioProject accession, such as PRJNAxxxxxx, provides a stable identifier and descriptive metadata for the entire project's scope, which may span multiple organisms, data types, and experimental aims. Beneath this umbrella, the Sequence Read Archive (SRA) and the Gene Expression Omnibus (GEO) function as the primary, specialized repositories for the raw and processed data generated by the project, with their scopes defined by data type rather than by scientific question.

The SRA is the designated archive for high-throughput sequencing raw data, storing the unaligned nucleotide reads and their quality scores from technologies like Illumina, PacBio, and Oxford Nanopore. Its data is structured within experiments (SRX) and runs (SRR), which are linked to the parent BioProject. In contrast, GEO is the premier repository for functional genomics data, specializing in curated data from assays that measure gene expression, chromatin accessibility, or protein-DNA interactions, such as microarray, RNA-seq (for expression quantification), ChIP-seq, and ATAC-seq studies. While RNA-seq data involves sequencing, its processed expression matrices and normalized counts are typically deposited in GEO, whereas the underlying raw sequence reads are stored in the SRA; the two archives are cross-linked. Thus, a single BioProject may contain data deposited in both SRA and GEO, with the repositories linked through shared metadata and accession numbers.

The operational relationship is managed through submission workflows and metadata linkage. A researcher typically initiates a submission by registering a new BioProject to obtain its accession. When submitting data, this BioProject accession is cited in the metadata for both SRA and GEO submissions, creating the formal parent-child link. The system allows for one-to-many relationships: one BioProject can reference multiple SRA datasets (e.g., genomic sequencing, metagenomic runs) and multiple GEO Series (GSE), which are the top-level units in GEO that organize a set of related samples. Crucially, for a given dataset like an RNA-seq study, the raw FASTQ files are deposited to the SRA, while the processed gene-level count matrix, sample metadata, and analysis-ready files are deposited to GEO as a Series. The databases maintain reciprocal links, enabling users to navigate from a GEO entry to the underlying raw reads in SRA and vice versa, and both trace back to the unifying BioProject.

This tripartite structure is fundamental to data provenance, reproducibility, and integrative analysis. BioProject provides the big-picture context, SRA ensures the preservation of foundational sequencing data, and GEO offers a curated environment for the functional genomic interpretations derived from that data. The relationship underscores a data management philosophy where projects are organized by scientific aim, but archival is optimized by data type and utility. For the research community, understanding this relationship is key to effectively depositing, locating, and reusing the vast amounts of public genomic data, as it maps the pathway from a broad research question (BioProject) to the specific experimental data files (in SRA and GEO) needed for replication or meta-analysis.

What is the relationship between GEO, SRA and Bioproject?

Related Questions