BLAST is a very powerful and widely used method for querying large sequence databases for sequence proximity with a sequence of interest. This difficult problem requires a fast algorithm designed for nucleic and protein sequences only since their alphabet is limited (5 bases for nucleotides, 21 amino acids for protein), but sequence length can be very long (more than 40000 amino). An example of BLAST installation and usage is provided below.
Basic Local Alignment Search Tool executables can be found online from the NCBI server : https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ but if you are using a Linux-based system, it is often as easy as
sudo apt install ncbi-blast+
Once downloaded and installed, you’ll get a list of software for performing your sequence queries.
- blastn : searching in a nucleotide database from a nucleotide query
- blastp : searching in a protein database from an amino acids query
- blastx : searching in a protein database from a translated nucleic sequence
- tblastn : searing in a nucleotide database from a translated amino acids query
- When possible, I use protein queries on protein databases, since this is more efficient :
- three nucleotides define a codon, and thus one amino acid
- some mutation are synonymous, i.e. they do not change the amino acid
All SARS-Cov-2 sequences are regrouped on the COVID-19 NCBI dedicated portal, and more specifically, the initial SARS-Cov-2 genome deposited on January, 13th, 2020 is accessible under the reference NC_045512.
This sequence is coming from the discovery of SARS-Cov-2 published in Nature in February : Wu, F., Zhao, S., Yu, B. et al. A new coronavirus associated with human respiratory disease in China. Nature 579, 265–269 (2020). https://doi.org/10.1038/s41586-020-2008-3
As of writing the genome has been revised on July, 18th, 2020. The reference SARS-Cov-2 reference Genome is thus NC_045512.2. The core sequence has not been modified, only the virus name.
The complete nucleotide sequence of the virus is provided below in zip format for convenience :
Since we want to search for the presence (or absence) of and Hemalutinin-Esterase gene, as found in other SARS-Cov genomes (or in other viral genomes), we take the reference protein sequence of this protein from the Human coronavirus OC43 (HCoV-OC43) strain, entry A0A482K8Y4.
The sequence (in fasta format) of the protein is easily accessible under the "Sequence" tab in Uniprot, but we also provide the link here for convenience (again as a zipped file).
Hemagglutinin-esterase of Human coronavirus OC43
To execute the search in a reasonable time, BLAST needs to have the database sequence mostly in fasta format (text) pre-processed into 3 binary files :
- nsq : sequence data
- pin : sequence indices
- phr : complete identifiers
The command to execute is :
makeblastdb -in sars-cov-2-NC_045512.2.fasta -out sars2Genome -dbtype nucl
The result of this command is :
Building a new DB, current time: 11/20/2020 13:41:28 New DB name: /home/stephane/Nextcloud/Enseignement/Bioinformatique/BLAST/sars2Genmoe New DB title: sars-cov-2-NC_045512.2.fasta Sequence type: Nucleotide Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 1 sequences in 0.000279903 seconds.
Since we took the raw nucleotide genome, there is only one (big !) sequence found and indexed.
Please note that we have very restricted use of the makeblastdb command, more options are available in the online documentation.
To search for protein results from a protein query into a nucleotide query we use tblastn (as indicated above) :
blast -query A0A482K8Y4.fasta -db sars2Genome
The BLAST output by default is quite verbose. The complete log is available on the following file :
BLAST result on SARS-Cov-2, query from Hemaglutinin-Esterase of HCoV-OH43
An extract of the results are presented below :
> NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome Length=29903 Score = 21.6 bits (44), Expect = 1.3, Method: Compositional matrix adjust. Identities = 9/24 (38%), Positives = 14/24 (58%), Gaps = 6/24 (25%) Frame = -2 Query 27 VSHVNGDWFL------FGDSRSDC 44 +SH++ W+L G +RSDC Sbjct 24274 ISHLHSKWYL*CSTCTKGPTRSDC 24203
The first line indicates the identifier found for the identified positions. Since we did not process the genome, we only have the entire genome and not the annotated one (containing ORFs, transcripts, and proteins).
The line starting with "Score" contains BLAST information concerning the reliability of the search, If you are not familiar with these scores, you can have a look at the NCBI YouTube channel.
Important information to consider is the Expect value (1.3, which is not good), but also the Frame, -2, indicative that an offset is required from the initial genome translation site to find the open reading frame.
The lines starting with Query and Sbject (Subject) indicate where a match between the query sequence (the Hemaglutinin-Esterase from HCov-OC43) and the reference genome is found.
This shows that in the HE sequence, from position 27 to position 44, a match is found in the SARS-CoV-2 genome at positions between 24274 and 24203. Please remark that the numbering is reversed in the output, this is indicative of the putative presence os this sequence on the reverse-complement strand of the RNA.
The rest of the BLAST output is shown below.
Score = 21.6 bits (44), Expect = 1.4, Method: Compositional matrix adjust. Identities = 7/18 (39%), Positives = 11/18 (61%), Gaps = 0/18 (0%) Frame = -2 Query 271 TKAICFNKRKDFTPVQVV 288 +K IC N +DF V ++ Sbjct 23452 SKLICMNSNRDFCAVNIL 23399 Score = 21.2 bits (43), Expect = 1.9, Method: Compositional matrix adjust. Identities = 12/46 (26%), Positives = 22/46 (48%), Gaps = 3/46 (7%) Frame = -3 Query 140 YRSLTFVNVPYVYNGSAQ---ATAFCKSGSLVLNNPVYIAPQANSG 182 Y +L + V +YN + CKS V++ +++ PQA + Sbjct 19686 YFTLLSIKVSLIYNIKSHIF*TLKSCKSVPEVISIKLFVNPQAKAS 19549 Score = 21.2 bits (43), Expect = 1.9, Method: Compositional matrix adjust. Identities = 10/27 (37%), Positives = 14/27 (52%), Gaps = 6/27 (22%) Frame = +1 Query 13 IIGSLGFHNPPTNVVSHVNGDWFLFGD 39 I+G L N +NG+W+ FGD Sbjct 14041 IVGVLTLDN------QDLNGNWYDFGD 14103 Score = 19.2 bits (38), Expect = 7.1, Method: Compositional matrix adjust. Identities = 8/27 (30%), Positives = 13/27 (48%), Gaps = 0/27 (0%) Frame = +2 Query 288 VDSRWNNARQSDNMTAVACQPPYCYFR 314 +D RWN R + Q +C+F+ Sbjct 26744 LDHRWNCYRNGLSCRLDVAQLLHCFFQ 26824
In short, it is possible to find very tiny portions of the HE protein at the genomic level of SARS-Cov-2 (although barely significantly). If you still want to go deeper into BLAST interpretation, here is a nice video on it !
If we glue up the results, this means that we can find some matches in the SARS-Cov-2 genome, respectively on 24 amino acids + 18 + 46 +27 + 27, for a total of 142, or about one-third of the total protein length. But this difficult to imagine. This would require the virus polymerase to do multiple back-and-forth movements between frames (-2, then -3, then +1 and +2).
This seems very unlikely...
On a side note, there have been examples of Quality Control RNA Replication or Transcription "slippery" of coronaviruses, but if you want to know more about the story, please see the wonderful video of Pr. Britt Glaunsinger from Berkeley about that and coronaviruses, in general.
To sum up, we discovered how to install and use BLAST, then applied it to an existing question and answered it. There is no Hemaglutinin-Esterase in SARS-CoV-2, contrary to some other beta coronaviruses.
If you still want to know more about BLAST but also about SARS-CoV-2 ongoing research efforts, you can visit the following links :
- 23 new ORFs detected by ribosome profiling. Finkel, Y., Mizrahi, O., Nachshon, A. et al. The coding capacity of SARS-CoV-2. Nature (2020).
- Mutations : D614G (and more ?)
- Other mutations and large virus follow-up
- Deep docking protocol for large screening of potential inhibitors (1.3 billion compounds !)
- Become a JEDI and participate in the 1 billion COVID challenge
- Join the ongoing fight against COVID-19 research efforts in Europe https://www.exscalate4cov.eu/ or worldwide https://covid.postera.ai/
- Structure information listed by the NCBI : https://www.ncbi.nlm.nih.gov/Structure/SARS-CoV-2.html
- All SARS-CoV-2 proteins with an atomic structure (and more) : https://www.rcsb.org/news?year=2020&article=5e74d55d2d410731e9944f52&feature=true
- The dedicated Uniprot Portal for COVID-19 : https://covid-19.uniprot.org/