Mastering the Art of NCBI: It's a BLAST

From GcatWiki
Revision as of 06:20, 8 October 2009 by Clcarcelen (talk | contribs)
Jump to: navigation, search

The National Center for Biotechnology Information (NCBI) is an organization founded in 1988 as a national resource available to the public for access to molecular biology information. NCBI creates numerous databases, online tools and research software programs to analyze genomes. The Basic Local Alignment Search Tool (BLAST) is an online tool designed to enable users to rapidly search through nucleotide and protein databases.While the website is designed for both novice and veteran users, the task of mastering the tool and the art can be daunting. This website is designed to provide a step-by-step process of how to use BLAST and interpret your results.


How to use Basic BLAST - Nucleotide Search:

When you get to the main page, you may notice you have a number of options to choose from:

BLAST Home.png


To search for matching nucleotide sequences in the database, choose: Nucleotide blast.png

This link will take you to the page shown below.

Nucleotide BLAST Entry.png


In the entry box below Enter Query Sequence there are three possible methods of entry for your search. The first is bare sequence, which refers to simply to the nucleotide sequence (ATCG, etc.) you wish to search for.

Enter Query Sequence.png

The second method uses FASTA format, shown below. This format requires the first line to be used as a descriptor, followed by a return and the nucleotide sequence. The descriptor can be found on the website where the gene sequence was obtained.

FASTA Entry.png

Finally, you may choose to use identifiers such as a gene's Accession Number as the query. It is important that there are no spaces in between letters or numbers, because they will be treated as separate sequences, or BLAST will fail to read them.

Identifier Entry.png


Once you have entered your query, you must choose which database you wish to search.

Choose Database.png

The most widely used database is the Nucleotide Collection (nr/nt) since it encompasses a broad range of nucleotide sequences across all domains, however you may choose to search another, depending on your research.


You may wish to restrict your search hits to only those found in certain organisms, or to exclude those found in a certain organism. You may do so by entering the common name, the binomial name or the taxonomic identification. Clicking Exclude excludes hits found in this organism's genome. Furthermore, clicking the + allows you to include or exclude multiple organisms or taxa.

Choose Organism.png


You have the option to further narrow your search using Entrez Query which limits searches a subset of the selected BLAST database. This tool uses special and specific syntax described on the NCBI website. This function is a specialized measure for narrowing search results, but it is only optional since the methods already described provide good results.


At this point, you need to choose the specificity of your search hits. You have three options: highly similar sequences (megablast), more dissimilar sequences (discontiguous megablast), and somewhat similar sequences (blastn). Megablast provides the small number of most exact matches, blastn provides a greater amount of matches that are not as close, and discontiguous megablast provides the greatest amount of matches that are only minimally related

Choose Algorithm.png.


At this point, clicking BLAST will take you to some intermediate waiting pages, and then to a page similar to the one below.

BLAST Results.png

The color chart uses color coding to demonstrate how much of the query sequence the result hits matched. The table below provides descriptive information regarding the statistical value of the results. The results can be sorted by clicking the heading of whichever column you wish to sort by. The key values you should look at when searching for a sequence match are Query Coverage, Max Identity, and E-Value. In the first two instances, you want to have a high percentage, which correlates to a high level of matching. The E-Value or the Expected Value is a value that tells you the probability that this match was due to chance. A good cutoff for a significant match is 0.001---anything smaller than that is a statistically significant match. Examples of good E-values are 2e-98 and 3e-57.

Descriptions Table.png

If you scroll down, NCBI provides detailed information on each hit that was returned, including information on what each hit encodes or what is encoded in that segment of DNA. These descriptions provide links to individual pages for each gene, which may be useful to your investigation.

Sample Hit.png


How to Use Basic Blast - Protein Search:

To search protein sequences return to the main page and click: Protein BLAST.png

This will take you to a page exactly like the one you encountered with the nucleotide search. In the Enter Query Sequence box, you may use the same three methods of entry previously described, however the sequence must use only the amino acid single-letter code.


The search databases you will choose from are also different, because they are protein databases as opposed to nucleotide databases.

Choose Protein Database.png

Finally, the algorithm used is slightly different. The possibilities are blastp (protein-protein BLAST), PSI-BLAST (Position-Specific Iterated BLAST), or PHI-BLAST (Pattern Hit Initiated BLAST). The most commonly used is blastp, which simply matches protein sequences to protein sequences. PSI-BLAST lets the user build a PSSM (position-specific scoring matrix) using the results of the first blastp run. PHI-BLAST performs the search but limits alignments to those that match a pattern in the query sequence. The latter two algorithms are sophisticated functions of this tool, however the PHI-BLAST function can prove to be very useful when studying families of proteins or conserved entities.



Created by Claudia M. Carcelen, 2009.