Working with the RefSeq database

1. GeneID

Given a gene symbol/name obtained from the ‘Gene’ database, the refseqR library enables us to retrieve the associated mRNA and protein accessions.

GeneID <- c("LOC101512347")
transcript <- refseq_fromGene(GeneID, sequence = "transcript", retries = 4)
protein <- refseq_fromGene(GeneID, sequence = "protein", retries = 4)

The mRNA transcript identifier (id) for LOC101512347 = XM_004487701.

The protein identifier (id) for LOC101512347 = XP_004487758.

Similarly, the function is effective when utilizing gene symbols that encode for multiple transcripts.

GeneID <- c("LOC105852298")
transcript <- refseq_fromGene(GeneID, sequence = "transcript", retries = 4)
protein    <- refseq_fromGene(GeneID, sequence = "protein", retries = 4)

The mRNA transcript id(s) for LOC105852298 = XM_027335934, XM_012717197.
The protein id(s) for LOC105852298 = XP_027191735, XP_012572651.

The function refseq_description returns the sequence description corresponding to a given accession. The identifier (id) can be either a transcript, protein, or Gene identifier.

id <- c("LOC101512347")
refseq_description(id)
#> [1] "aldehyde dehydrogenase 22A1"

2. mRNA

Using the rentrez package, we can fetch data from NCBI. Here, the first 30 lines for accession “XM_004487701” :

mrna_gb <- rentrez::entrez_fetch(db= 'nuccore', id = "XM_004487701", rettype = 'gp') 
strsplit(mrna_gb, "\n")[[1]][1:30]
#>  [1] "LOCUS       XM_004487701            2927 bp    mRNA    linear   PLN 14-DEC-2018"
#>  [2] "DEFINITION  PREDICTED: Cicer arietinum aldehyde dehydrogenase 22A1"             
#>  [3] "            (LOC101512347), mRNA."                                              
#>  [4] "ACCESSION   XM_004487701"                                                       
#>  [5] "VERSION     XM_004487701.3"                                                     
#>  [6] "DBLINK      BioProject: PRJNA190909"                                            
#>  [7] "KEYWORDS    RefSeq."                                                            
#>  [8] "SOURCE      Cicer arietinum (chickpea)"                                         
#>  [9] "  ORGANISM  Cicer arietinum"                                                    
#> [10] "            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;" 
#> [11] "            Spermatophyta; Magnoliopsida; eudicotyledons; Gunneridae;"          
#> [12] "            Pentapetalae; rosids; fabids; Fabales; Fabaceae; Papilionoideae; 50"
#> [13] "            kb inversion clade; NPAAA clade; Hologalegina; IRL clade; Cicereae;"
#> [14] "            Cicer."                                                             
#> [15] "COMMENT     MODEL REFSEQ:  This record is predicted by automated computational" 
#> [16] "            analysis. This record is derived from a genomic sequence"           
#> [17] "            (NC_021160.1) annotated using gene prediction method: Gnomon,"      
#> [18] "            supported by EST evidence."                                         
#> [19] "            Also see:"                                                          
#> [20] "                Documentation of NCBI's Annotation Process"                     
#> [21] "            "                                                                   
#> [22] "            On Dec 14, 2018 this sequence version replaced XM_004487701.2."     
#> [23] "            "                                                                   
#> [24] "            ##Genome-Annotation-Data-START##"                                   
#> [25] "            Annotation Provider         :: NCBI"                                
#> [26] "            Annotation Status           :: Full annotation"                     
#> [27] "            Annotation Name             :: Cicer arietinum Annotation Release"  
#> [28] "                                           102"                                 
#> [29] "            Annotation Version          :: 102"                                 
#> [30] "            Annotation Pipeline         :: NCBI eukaryotic genome annotation"

The refseq_mRNAfeat function serves as a wrapper built on top of entrez_summary from the rentrez package, designed to extract specific features from the obtained data. Typically, my focus lies on key features like id, accession, title, update, or sequence length (bp). However, you have the flexibility to tailor the function to extract additional features of interest from the esummary_list object.

transcript = c("XM_004487701", "XM_004488493", "XM_004501904")
feat = c("caption", "moltype", "sourcedb", "slen", "title")
refseq_mRNAfeat(transcript ,feat)

Another interesting function is refseq_mRNA2protein, which retrieves the protein accession associated with the provided mRNA.

transcript <- "XM_004487701"
refseq_mRNA2protein(transcript)
#> [1] "XP_004487758"

The CDS coordinates come in handy when we want to get the fasta sequence. We sometimes do not want the 5’UTR or 3’UTR contained in the mRNA sequence and are interested just in the CDS.

The function refseq_CDScoords creates an IRanges object with the CDS coordinates from an mRNA accession. The output object is the basis for refseq_CDSseq, which fetches the NCBI data, uses that coordinates and returns a DNAString object with the CDS nucleotide sequence.

refseq_CDScoords(transcript)
#> IRanges object with 1 range and 0 metadata columns:
#>                      start       end     width
#>                  <integer> <integer> <integer>
#>   XM_004487701.3       670      2457      1788
refseq_CDSseq(transcript)
#> DNAStringSet object of length 1:
#>     width seq                                               names               
#> [1]  1788 ATGGCGTTTTGGTGGTCTTTGCT...GTGGCAATAAGAAGGAGAAATGA XM_004487701.3

Here, the first 500 nucleotides of the mRNA ‘XM_004487701’:

mrna_fasta = rentrez::entrez_fetch(db="nuccore", id=transcript, rettype="fasta")
# take a look at the first 500 chars. 
cat(strwrap(substr(mrna_fasta, 1, 500)), sep="\n")
#> >XM_004487701.3 PREDICTED: Cicer arietinum aldehyde dehydrogenase 22A1
#> (LOC101512347), mRNA
#> GTTACCATGTCAACAAAAACTCTCAAGTCACTTTCTATTTGAAGCCGAGAAACCTATTATCTTTATGTCA
#> TGACAATTCCAAAATACATAACCCACATCTTTGCATGAATAGCATCACAATTCCCTAATTTTTTTATAAT
#> ACCCCTTAATCCATTTGTGGTCTACATATCGAAGTAAACCACTACACCCCCACTTTCTCTATAGATCTGT
#> GAGCTCGATCGCAATTTAGTTTGATTGTTACTTTATTTATTTATTAATCTCATTTTATATGTTTTCATTT
#> TCTTCTTGGAACCGATAAAGTCGTAGTTTATTCCTTTCTCAATTTGATGAAAAGTGCAAACTTGGAAAAG
#> AAAACAGGTTCACCTTTGAACTCAAATAAACAAGTACTACAATATCAAAACCC

Here, the first 60 nucleotides of the CDS from the mRNA ‘XM_004487701’:

substr(toString(refseq_CDSseq(transcript)), 1, 60)
#> [1] "ATGGCGTTTTGGTGGTCTTTGCTCGTTCTAGCATTCGCTTTCGCTATCTGCAAGTTCCTT"

As previously said, the function refseq_description returns the sequence description corresponding to a given accession. The identifier (id) can be either a transcript, protein, or Gene identifier.

id <- "XM_004487701"
refseq_description(id)
#> [1] "aldehyde dehydrogenase 22A1"

3. Protein

Similarly to nucleotide sequences, refseq_protein2mRNA, retrieves the mRNA associated with the provided protein accession.

protein <- "XP_020244413"
refseq_protein2mRNA(protein)
#> [1] "XM_020388824"

Two specific functions prove useful for managing protein accessions: refseq_AA_length offers the amino acid length of the sequence, while refseq_mol.wt provides the molecular weight in Daltons.

refseq_AA_length(protein, retries = 4)
#> [1] 917
refseq_AA_mol_wt(protein)
#> [1] 104178

The refseq_AAseq function, fetches the NCBI data, and returns a DNAString object with the amino acid sequence.

refseq_AAseq(protein)
#> AAStringSet object of length 1:
#>     width seq                                               names               
#> [1]   917 MSTRRVRKTKGKIPKKKISVEKL...SMGPDWHKVEHIPSLMIDPTTGE XP_020244413

As previously mentioned, the refseq_description function ultimately provides the sequence description associated with a given accession. The identifier (id) can take the form of either a transcript, protein, or Gene identifier.

id <- "XP_020244413"
refseq_description(id)
#> [1] "probable disease resistance protein At1g58602"

4. Concluding Remarks

The package refseqR contains a number of functions to programmatically automatize some common operations.

Functions to apply on GeneID accessions

refseq_description
refseq_fromGene

Functions to apply on transcript id. accessions

refseq_GeneID
refseq_CDScoords
refseq_CDSseq
refseq_mRNA2protein

Functions to apply on protein id. accessions

refseq_GeneID
refseq_AAseq
refseq_AA_mol_wt
refseq_AA_length
refseq_protein2mRNA

I’d really appreciate your feedback. The whole code used in this tutorial is available from my Github repository. You can contact me by email or visit my website.

Córdoba, (Spain), 2024-07-10.

Working with the RefSeq database

0. Introduction

1. GeneID

2. mRNA

3. Protein

4. Concluding Remarks