Disease Motifs
Investigating the Biomechanics of Disease
Go back to the Home page Resources About this site Biological Research Services Contact this Site

Techniques Used


BLAST Basic Local Alignment Search Tool

The method that is used to try and find out these motifs is really a misuse or abusive use of an online program called BLAST (Basic Local Alignment Search Tool). This program will find proteins that are similar to an original query sequence. If you have a protein sequence or only a part of a sequence and you would like to know what that protein is or what protein it is most similar to, you would enter this query sequence in to the BLAST program and this would run a search against a protein database (i.e. long list of protein sequences) and find the proteins that were the most similar to the original query sequence. The results of a protein BLAST search would just be a list of proteins with the most similar proteins to the query sequence at the top of the list and the gradually more dissimilar proteins further down. In other words the proteins that are returned from a BLAST search are all ranked in order of similarity. The proteins listed at the top of the list would be the most suitable candidates for the identification of the original query sequence.

E-Values

A statistical number called the E-value is used in order to rank the proteins and place them in a certain order down the list. This E-value gives some indication of the statistical significance of the returned proteins, and the lower or smaller the E-value the more significant is the resulting protein and the higher it appears in the list of results.

Typically it would be desirable that the returned proteins had a very low E-values such as 1 times ten to the power of minus seventy or 1 X 10-70 and would of course be a very small number. The last thing you would want is for the E-value to be greater than 1. E-values such as 200 say, or 500 would mean that those particular proteins were totally different from your query sequence and should be disregarded.

Abusive use of the BLAST program

In this work however, since it is not necessary to identify the particular protein sequence in question as the protein is already known and probably the sequence has already been taken out of the database in the first place, it would be pointless to try and find it again in a BLAST search. In a BLAST search one can specify whether to search against the whole database or whether to BLAST search against a bacterial or viral databases (i.e. contains only bacterial or viral proteins respectively), and if one were to run a BLAST search for the Human Prion protein against a bacterial database with the E-value set at 1 X 10-2 there would be no results returned as the bacterial proteins would not in any way be a match for the Human Prion protein. This is a correct result, and for most people that should be the end of the matter as there are few bacterial proteins that have a matching similarity to the Human Prion protein.

The misuse or abusive use of the BLAST program starts by changing the E-value settings to get very high E-value this is of no value in identifying the particular protein but as the protein is already known this is not a problem. The real aim is to find microbial proteins that have small isolated islands of amino acids (motifs) that are similar or the same in the query sequence but with the rest if the protein possibly being totally different. In order to get any microbial results at all it is necessary to increase the BLAST settings of the E-value so as to allow E-values of 200 or more so as to get some results. By default such values will not be displayed with only values of near 0.1 or 1 being shown and the E-value acts as a discriminator and prevents the display of results that have larger E-values than this. Changing the E-value discriminator like this and searching for proteins that are very dissimilar in the main is not the correct way to use the BLAST program and hence the term abusive is used to describe the way the program is being utilised.

Originally this work started by just searching for one motif and analyzing one particular protein which happened to be the Human Prion protein but it was realised that it would be better to analyze multiple proteins to get clearer results rather than just looking at one protein. So if there were any diseases that had multiple proteins that may be implicated in some way in the disease process these diseases would be better targets for evaluation. This would probably produce a series of motifs rather than just one motif and this would be something similar in concept to a 'fingerprint' associating a particular pathogen to a disease via this series of motifs.

With this in mind it was decided to search the Swiss-Prot protein database and several databases have been created from this for future work. There are 20,277 human proteins out of database of nearly half a million proteins (at the time of writing though this number will change) and from this human database there have been found several databases such as the cancer database with 1,689 proteins and this can be subdivided into various cancer sets such as lung, breast, colon, prostate etc.

The problem here will be the number of false positives that are included in these data entries. In the Prion group for example with 8 proteins, one protein which was pulled out was ENOX2_HUMAN which was selected because it had the statement 'Has several properties associated with prions including resistance to proteases' and included the word 'prion' which was the keyword used in the search. Is it good enough to just associate one protein with another in an almost subjective manner? How relevant this particular protein's Prion associated properties will be in terms of yielding a disease motif that as some connection with a motif obtained from the Prion protein itself as yet to be discovered. Regardless of the number of false positives and the false negatives (entries that should have been found but weren't), provided there are enough true positives then hopefully there will be some chance of success.