How to upload transcriptome sequencing data to GEO?
Uploading transcriptome sequencing data to the Gene Expression Omnibus (GEO) is a structured process that requires careful preparation of both metadata and raw sequence files to meet the repository's specific standards. The core requirement is the creation of a metadata spreadsheet, typically using GEO's Microsoft Excel templates, which detail the experimental design, sample characteristics, protocols, and data processing steps. This metadata is the critical framework that makes the submitted data interpretable and reusable. Concurrently, the raw sequencing data files—usually in FASTQ format—and any processed files, such as count matrices, must be uploaded to an affiliated secure FTP server. The entire submission is managed through GEO's web-based submission portal, where the metadata spreadsheet is uploaded first, followed by instructions for transferring the data files to a private directory on their FTP site.
The most common point of failure in this process is inadequate or inconsistent metadata, not the technical file transfer. Investigators must meticulously define each sample's attributes, such as genotype, treatment, and time point, in the spreadsheet exactly as they are referenced in the associated data files. GEO mandates that data be organized into a "Series," which encompasses the overall study, containing "Samples" (individual biological replicates) and an optional "Platform" (defining the sequencing instrument or array). For RNA-seq studies, specifying the correct platform as "high-throughput sequencing" and providing detailed library construction and alignment parameters is essential. The submission is not considered complete until a GEO curator reviews the package for consistency and compliance; this review often initiates a dialogue to resolve ambiguities in the sample annotations or methodology descriptions before public release.
Successful submission hinges on understanding the distinction between raw archival data and processed, analysis-ready data, as GEO accommodates both but with different expectations. The raw FASTQ files are the primary archival requirement, ensuring the long-term reproducibility of the study. Processed files, such as normalized expression matrices, are highly valuable for most users and should be included in a format that clearly links gene identifiers to the quantification values. It is advisable to initiate the submission process well before manuscript submission or publication deadlines, as the curation and processing time can take several weeks. The final outcome is the assignment of a stable GEO Series accession number (e.g., GSEXXXXX), which serves as the permanent, citable identifier for the dataset, integrating it into the broader ecosystem of NCBI search tools and linking it to related publications and genomic resources.