Making use of BLAST, example on the COVID-19 genome

mercredi 28 juin 2023, par S. Téletchéa

Sommaire

BLAST Installation
Data retrieval : Sars-Cov-2 genome, Sars-Cov-1 Hemaglutinin-Esterase
SARS-Cov-2 complete genome
Preparing the genome for BLAST
Analyzing the Results
What happens ?
Going Further

BLAST is a very powerful and widely used method for querying large sequence databases for sequence proximity with a sequence of interest. This difficult problem requires a fast algorithm designed for nucleic and protein sequences only since their alphabet is limited (5 bases for nucleotides, 21 amino acids for protein), but sequence length can be very long (more than 40000 amino). An example of BLAST installation and usage is provided below.

BLAST Installation

Basic Local Alignment Search Tool executables can be found online from the NCBI server : https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ but if you are using a Linux-based system, it is often as easy as

sudo apt install ncbi-blast+

Once downloaded and installed, you’ll get a list of software for performing your sequence queries.

blastn : searching in a nucleotide database from a nucleotide query
blastp : searching in a protein database from an amino acids query
blastx : searching in a protein database from a translated nucleic sequence
tblastn : searing in a nucleotide database from a translated amino acids query
When possible, I use protein queries on protein databases, since this is more efficient :

three nucleotides define a codon, and thus one amino acid
some mutation are synonymous, i.e. they do not change the amino acid

Data retrieval : Sars-Cov-2 genome, Sars-Cov-1 Hemaglutinin-Esterase

All SARS-Cov-2 sequences are regrouped on the COVID-19 NCBI dedicated portal, and more specifically, the initial SARS-Cov-2 genome deposited on January, 13th, 2020 is accessible under the reference NC_045512.

This sequence is coming from the discovery of SARS-Cov-2 published in Nature in February : Wu, F., Zhao, S., Yu, B. et al. A new coronavirus associated with human respiratory disease in China. Nature 579, 265–269 (2020). https://doi.org/10.1038/s41586-020-2008-3

As of writing the genome has been revised on July, 18th, 2020. The reference SARS-Cov-2 reference Genome is thus NC_045512.2. The core sequence has not been modified, only the virus name.
The complete nucleotide sequence of the virus is provided below in zip format for convenience :

https://web.archive.org/web/20230325065144/http://www.steletch.org/IMG/zip/sars-cov-2-nc_045512.2.fasta.zip

SARS-Cov-2 complete genome

Since we want to search for the presence (or absence) of and Hemalutinin-Esterase gene, as found in other SARS-Cov genomes (or in other viral genomes), we take the reference protein sequence of this protein from the Human coronavirus OC43 (HCoV-OC43) strain, entry A0A482K8Y4.
The sequence (in fasta format) of the protein is easily accessible under the "Sequence" tab in Uniprot, but we also provide the link here for convenience (again as a zipped file).

https://web.archive.org/web/20230325065144/http://www.steletch.org/IMG/zip/a0a482k8y4.fasta.zip

Hemagglutinin-esterase of Human coronavirus OC43

Preparing the genome for BLAST

To execute the search in a reasonable time, BLAST needs to have the database sequence mostly in fasta format (text) pre-processed into 3 binary files :

nsq : sequence data
pin : sequence indices
phr : complete identifiers

The command to execute is :

makeblastdb -in sars-cov-2-NC_045512.2.fasta -out sars2Genome -dbtype nucl

The result of this command is :

Building a new DB, current time: 11/20/2020 13:41:28
New DB name: &nbsp; /home/stephane/Nextcloud/Enseignement/Bioinformatique/BLAST/sars2Genmoe
New DB title: &nbsp;sars-cov-2-NC_045512.2.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1 sequences in 0.000279903 seconds.

Since we took the raw nucleotide genome, there is only one (big !) sequence found and indexed.
Please note that we have very restricted use of the makeblastdb command, more options are available in the online documentation.
To search for protein results from a protein query into a nucleotide query we use tblastn (as indicated above) :

blast -query A0A482K8Y4.fasta -db sars2Genome

Analyzing the Results

The BLAST output by default is quite verbose. The complete log is available on the following file :

https://web.archive.org/web/20230325065144/http://www.steletch.org/IMG/zip/hequery.blast.zip

BLAST result on SARS-Cov-2, query from Hemaglutinin-Esterase of HCoV-OH43

An extract of the results are presented below :

&gt; NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate 
Wuhan-Hu-1, complete genome
Length=29903

Score = 21.6 bits (44), &nbsp;Expect = 1.3, Method: Compositional matrix adjust.
Identities = 9/24 (38%), Positives = 14/24 (58%), Gaps = 6/24 (25%)
Frame = -2

Query &nbsp;27 &nbsp; &nbsp; VSHVNGDWFL------FGDSRSDC &nbsp;44
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;+SH++ &nbsp;W+L &nbsp; &nbsp; &nbsp; G +RSDC
Sbjct &nbsp;24274 &nbsp;ISHLHSKWYL*CSTCTKGPTRSDC &nbsp;24203

The first line indicates the identifier found for the identified positions. Since we did not process the genome, we only have the entire genome and not the annotated one (containing ORFs, transcripts, and proteins).

The line starting with "Score" contains BLAST information concerning the reliability of the search, If you are not familiar with these scores, you can have a look at the NCBI YouTube channel.

Important information to consider is the Expect value (1.3, which is not good), but also the Frame, -2, indicative that an offset is required from the initial genome translation site to find the open reading frame.
The lines starting with Query and Sbject (Subject) indicate where a match between the query sequence (the Hemaglutinin-Esterase from HCov-OC43) and the reference genome is found.

This shows that in the HE sequence, from position 27 to position 44, a match is found in the SARS-CoV-2 genome at positions between 24274 and 24203. Please remark that the numbering is reversed in the output, this is indicative of the putative presence os this sequence on the reverse-complement strand of the RNA.

The rest of the BLAST output is shown below.

Score = 21.6 bits (44), &nbsp;Expect = 1.4, Method: Compositional matrix adjust.
Identities = 7/18 (39%), Positives = 11/18 (61%), Gaps = 0/18 (0%)
Frame = -2

Query &nbsp;271 &nbsp; &nbsp;TKAICFNKRKDFTPVQVV &nbsp;288
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;+K IC N &nbsp;+DF &nbsp;V ++
Sbjct &nbsp;23452 &nbsp;SKLICMNSNRDFCAVNIL &nbsp;23399


Score = 21.2 bits (43), &nbsp;Expect = 1.9, Method: Compositional matrix adjust.
Identities = 12/46 (26%), Positives = 22/46 (48%), Gaps = 3/46 (7%)
Frame = -3

Query &nbsp;140 &nbsp; &nbsp;YRSLTFVNVPYVYNGSAQ---ATAFCKSGSLVLNNPVYIAPQANSG &nbsp;182
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Y +L &nbsp;+ V &nbsp;+YN &nbsp;+ &nbsp; &nbsp; &nbsp; &nbsp;CKS &nbsp; V++ &nbsp;+++ PQA + 
Sbjct &nbsp;19686 &nbsp;YFTLLSIKVSLIYNIKSHIF*TLKSCKSVPEVISIKLFVNPQAKAS &nbsp;19549


Score = 21.2 bits (43), &nbsp;Expect = 1.9, Method: Compositional matrix adjust.
Identities = 10/27 (37%), Positives = 14/27 (52%), Gaps = 6/27 (22%)
Frame = +1

Query &nbsp;13 &nbsp; &nbsp; IIGSLGFHNPPTNVVSHVNGDWFLFGD &nbsp;39
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;I+G L &nbsp; N &nbsp; &nbsp; &nbsp; &nbsp;+NG+W+ FGD
Sbjct &nbsp;14041 &nbsp;IVGVLTLDN------QDLNGNWYDFGD &nbsp;14103


Score = 19.2 bits (38), &nbsp;Expect = 7.1, Method: Compositional matrix adjust.
Identities = 8/27 (30%), Positives = 13/27 (48%), Gaps = 0/27 (0%)
Frame = +2

Query &nbsp;288 &nbsp; &nbsp;VDSRWNNARQSDNMTAVACQPPYCYFR &nbsp;314
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;+D RWN &nbsp;R &nbsp; + &nbsp; &nbsp; &nbsp;Q &nbsp;+C+F+
Sbjct &nbsp;26744 &nbsp;LDHRWNCYRNGLSCRLDVAQLLHCFFQ &nbsp;26824

In short, it is possible to find very tiny portions of the HE protein at the genomic level of SARS-Cov-2 (although barely significantly). If you still want to go deeper into BLAST interpretation, here is a nice video on it !

What happens ?

If we glue up the results, this means that we can find some matches in the SARS-Cov-2 genome, respectively on 24 amino acids + 18 + 46 +27 + 27, for a total of 142, or about one-third of the total protein length. But this difficult to imagine. This would require the virus polymerase to do multiple back-and-forth movements between frames (-2, then -3, then +1 and +2).

This seems very unlikely...

On a side note, there have been examples of Quality Control RNA Replication or Transcription "slippery" of coronaviruses, but if you want to know more about the story, please see the wonderful video of Pr. Britt Glaunsinger from Berkeley about that and coronaviruses, in general.

Going Further

To sum up, we discovered how to install and use BLAST, then applied it to an existing question and answered it. There is no Hemaglutinin-Esterase in SARS-CoV-2, contrary to some other beta coronaviruses.
If you still want to know more about BLAST but also about SARS-CoV-2 ongoing research efforts, you can visit the following links :

23 new ORFs detected by ribosome profiling. Finkel, Y., Mizrahi, O., Nachshon, A. et al. The coding capacity of SARS-CoV-2. Nature (2020).
Mutations : D614G (and more ?)
Other mutations and large virus follow-up
Deep docking protocol for large screening of potential inhibitors (1.3 billion compounds !)
Become a JEDI and participate in the 1 billion COVID challenge
Join the ongoing fight against COVID-19 research efforts in Europe https://www.exscalate4cov.eu/ or worldwide https://covid.postera.ai/
Structure information listed by the NCBI : https://www.ncbi.nlm.nih.gov/Structure/SARS-CoV-2.html
All SARS-CoV-2 proteins with an atomic structure (and more) : https://www.rcsb.org/news?year=2020&article=5e74d55d2d410731e9944f52&feature=true
The dedicated Uniprot Portal for COVID-19 : https://covid-19.uniprot.org/

Molecular Modelling and GNU/Linux Where Penguins Meet Molecules