Where can I see the length and cds length of a gene found in NCBI?

Question

Accepted Answer

To see the length and CDS length of a gene in NCBI, you must first navigate to the specific Gene record and then examine the associated mRNA and protein records, as these metrics are not directly aggregated on the summary Gene page. The process is systematic: after searching for your gene of interest in the NCBI Gene database and selecting the appropriate organism-specific entry, the "Genomic context" section provides the genomic span, which indicates the total length of the gene locus from start to end, including introns and untranslated regions. However, this genomic span is not the coding sequence (CDS) length. To find the CDS length, which represents the nucleotide sequence that is actually translated into protein, you need to access the curated Reference Sequence (RefSeq) mRNA records linked from the Gene page.

The primary mechanism involves using the "Reference Sequences" section on the Gene record, typically under the subheading "mRNA and Protein(s)." Here, you will find links to specific RefSeq accession numbers for transcripts, usually labeled with prefixes like NM_ (for mature mRNA) or XM_ (for predicted models). Clicking on one of these mRNA accession links, preferably a reviewed NM_ record for accuracy, takes you to the Nucleotide database entry. On this mRNA record, the "Features" table is critical; it details the exact coordinates of the "CDS" feature. The length of this CDS is calculated by subtracting the start coordinate from the end coordinate and adding one, which you can often see directly in the feature annotation or by using the "CDS" link to view the sequence itself, where the nucleotide count is displayed.

For a direct numerical display without manual calculation, the corresponding protein record is often more straightforward. From the mRNA page, there is a direct link to the protein product (accession starting with NP_ or XP_). The top of this Protein database entry explicitly lists the number of amino acids in the sequence. Since each amino acid is encoded by three nucleotides, the CDS length in nucleotides is simply this amino acid count multiplied by three. This provides a verified, curated length of the coding sequence. It is important to note that a single gene can have multiple splice variants, each with its own distinct CDS length; therefore, examining the primary or canonical transcript specified in the Gene record is essential for standard reporting.

The implications of this workflow are significant for accuracy in research. Relying solely on the genomic span from the Gene record would be misleading for any analysis focused on the coding potential, as it includes non-coding regions. The separation of data across linked records—Gene, Nucleotide, and Protein—reflects the biological hierarchy and necessitates a precise investigative approach. For programmatic access, NCBI's EUtils APIs can retrieve these features, but for manual queries, the described navigation through RefSeq mRNA to its protein product remains the most reliable method to obtain both the gene's total locus length and its exact CDS length.

Where can I see the length and cds length of a gene found in NCBI?

Related Questions