Making use of BLAST, example on the COVID-19 genome

, par Stéphane

BLAST is a very powerful and widely used method for querying large sequence databases for sequence proximity with a sequence of interest. This difficult problem requires a fast algorithm designed for nucleic and protein sequences only since their alphabet is limited (5 bases for nucleotides, 21 amino acids for protein), but sequence length can be very long (more than 40000 amino). An example of BLAST installation and usage is provided below.

BLAST Installation

Basic Local Alignment Search Tool executables can be found online from the NCBI server : but if you are using a Linux-based system, it is often as easy as

sudo apt install ncbi-blast+

Once downloaded and installed, you’ll get a list of software for performing your sequence queries.

  • blastn : searching in a nucleotide database from a nucleotide query
  • blastp : searching in a protein database from an amino acids query
  • blastx : searching in a protein database from a translated nucleic sequence
  • tblastn : searing in a nucleotide database from a translated amino acids query
  • When possible, I use protein queries on protein databases, since this is more efficient :
  • three nucleotides define a codon, and thus one amino acid
  • some mutation are synonymous, i.e. they do not change the amino acid

Data retrieval : Sars-Cov-2 genome, Sars-Cov-1 Hemaglutinin-Esterase

All SARS-Cov-2 sequences are regrouped on the COVID-19 NCBI dedicated portal, and more specifically, the initial SARS-Cov-2 genome deposited on January, 13th, 2020 is accessible under the reference NC_045512.

This sequence is coming from the discovery of SARS-Cov-2 published in Nature in February : Wu, F., Zhao, S., Yu, B. et al. A new coronavirus associated with human respiratory disease in China. Nature 579, 265–269 (2020).

As of writing the genome has been revised on July, 18th, 2020. The reference SARS-Cov-2 reference Genome is thus NC_045512.2. The core sequence has not been modified, only the virus name.
The complete nucleotide sequence of the virus is provided below in zip format for convenience :

SARS-Cov-2 complete genome

Since we want to search for the presence (or absence) of and Hemalutinin-Esterase gene, as found in other SARS-Cov genomes (or in other viral genomes), we take the reference protein sequence of this protein from the Human coronavirus OC43 (HCoV-OC43) strain, entry A0A482K8Y4.
The sequence (in fasta format) of the protein is easily accessible under the "Sequence" tab in Uniprot, but we also provide the link here for convenience (again as a zipped file).

Hemagglutinin-esterase of Human coronavirus OC43

Preparing the genome for BLAST

To execute the search in a reasonable time, BLAST needs to have the database sequence mostly in fasta format (text) pre-processed into 3 binary files :

  • nsq : sequence data
  • pin : sequence indices
  • phr : complete identifiers

The command to execute is :

makeblastdb -in sars-cov-2-NC_045512.2.fasta -out sars2Genome -dbtype nucl

The result of this command is :

Building a new DB, current time: 11/20/2020 13:41:28
New DB name:   /home/stephane/Nextcloud/Enseignement/Bioinformatique/BLAST/sars2Genmoe
New DB title:  sars-cov-2-NC_045512.2.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1 sequences in 0.000279903 seconds.

Since we took the raw nucleotide genome, there is only one (big !) sequence found and indexed.
Please note that we have very restricted use of the makeblastdb command, more options are available in the online documentation.
To search for protein results from a protein query into a nucleotide query we use tblastn (as indicated above) :

blast -query A0A482K8Y4.fasta -db sars2Genome

Analyzing the Results

The BLAST output by default is quite verbose. The complete log is available on the following file :

BLAST result on SARS-Cov-2, query from Hemaglutinin-Esterase of HCoV-OH43

An extract of the results are presented below :

> NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate 
Wuhan-Hu-1, complete genome

Score = 21.6 bits (44),  Expect = 1.3, Method: Compositional matrix adjust.
Identities = 9/24 (38%), Positives = 14/24 (58%), Gaps = 6/24 (25%)
Frame = -2

Query  27     VSHVNGDWFL------FGDSRSDC  44
             +SH++  W+L       G +RSDC

The first line indicates the identifier found for the identified positions. Since we did not process the genome, we only have the entire genome and not the annotated one (containing ORFs, transcripts, and proteins).

The line starting with "Score" contains BLAST information concerning the reliability of the search, If you are not familiar with these scores, you can have a look at the NCBI YouTube channel.

Important information to consider is the Expect value (1.3, which is not good), but also the Frame, -2, indicative that an offset is required from the initial genome translation site to find the open reading frame.
The lines starting with Query and Sbject (Subject) indicate where a match between the query sequence (the Hemaglutinin-Esterase from HCov-OC43) and the reference genome is found.

This shows that in the HE sequence, from position 27 to position 44, a match is found in the SARS-CoV-2 genome at positions between 24274 and 24203. Please remark that the numbering is reversed in the output, this is indicative of the putative presence os this sequence on the reverse-complement strand of the RNA.

The rest of the BLAST output is shown below.

Score = 21.6 bits (44),  Expect = 1.4, Method: Compositional matrix adjust.
Identities = 7/18 (39%), Positives = 11/18 (61%), Gaps = 0/18 (0%)
Frame = -2

             +K IC N  +DF  V ++
Sbjct  23452  SKLICMNSNRDFCAVNIL  23399

Score = 21.2 bits (43),  Expect = 1.9, Method: Compositional matrix adjust.
Identities = 12/46 (26%), Positives = 22/46 (48%), Gaps = 3/46 (7%)
Frame = -3

             Y +L  + V  +YN  +        CKS   V++  +++ PQA + 

Score = 21.2 bits (43),  Expect = 1.9, Method: Compositional matrix adjust.
Identities = 10/27 (37%), Positives = 14/27 (52%), Gaps = 6/27 (22%)
Frame = +1

             I+G L   N        +NG+W+ FGD
Sbjct  14041  IVGVLTLDN------QDLNGNWYDFGD  14103

Score = 19.2 bits (38),  Expect = 7.1, Method: Compositional matrix adjust.
Identities = 8/27 (30%), Positives = 13/27 (48%), Gaps = 0/27 (0%)
Frame = +2

             +D RWN  R   +      Q  +C+F+

In short, it is possible to find very tiny portions of the HE protein at the genomic level of SARS-Cov-2 (although barely significantly). If you still want to go deeper into BLAST interpretation, here is a nice video on it !

What happens ?

If we glue up the results, this means that we can find some matches in the SARS-Cov-2 genome, respectively on 24 amino acids + 18 + 46 +27 + 27, for a total of 142, or about one-third of the total protein length. But this difficult to imagine. This would require the virus polymerase to do multiple back-and-forth movements between frames (-2, then -3, then +1 and +2).

This seems very unlikely...

On a side note, there have been examples of Quality Control RNA Replication or Transcription "slippery" of coronaviruses, but if you want to know more about the story, please see the wonderful video of Pr. Britt Glaunsinger from Berkeley about that and coronaviruses, in general.

Going Further

To sum up, we discovered how to install and use BLAST, then applied it to an existing question and answered it. There is no Hemaglutinin-Esterase in SARS-CoV-2, contrary to some other beta coronaviruses.
If you still want to know more about BLAST but also about SARS-CoV-2 ongoing research efforts, you can visit the following links :