Determining Unique and Conserved Proteins: How to Use Katie's Webpage

From GcatWiki
Jump to: navigation, search

Before we begin, know that your two proteomes must be in FASTA format in order to use our Pairwise Genomic Comparison program. If your sequences are in GenBank format (or another format), visit Claudia's tutorial first to learn how to convert your sequences to FASTA.

Once you have the two sequences you want to compare in FASTA format, head to the Pairwise Genomic Comparison page.

Comparing Proteomes Online: For Smaller Proteomes
If each of your sequences is less than (Note: Add character limit here) characters, you can use our webpage to perform your comparison. I'll discuss how to download and use the Perl script later on in this tutorial in the likely event that your sequences are larger than our character limit. If they aren't too long, then enter your desired Expect (E), or threshold value. Keep in mind that lower E values will be more restrictive and lead to less matches by chance (though any matches found will be more statistically significant). For our purposes, using an E value around 0.0001 or 0.001 should be sufficient to obtain accurate data.

Next, input the comparison (subject) sequence into the first box. In the second box, copy and paste the sequence you want to receive unique and conserved proteins for (still in FASTA format). This is your query proteome - the proteome that you are comparing one or several other proteomes to. For example, in the image below, I'm comparing Halorhabdus utahensis to Halomicrobium mukohataei. That is, I want to know the unique proteins in Halomicrobium mukohataei, and I'm using the proteome of Halorhabdus utahensis to determine what they are. Once your page looks something like this, you're ready to "submit":


The program will provide two Excel files: one with a list of conserved proteins and their gene locations within the proteomes, and one with a list of unique proteins to the query proteome. (Note: Where does the website send the Excel files? Or does it produce a webpage witht he data?) It is helpful to compare your proteome of interest with several related proteomes in order to identify which genes are actually conserved and which are potentially unique to your species.

Downloading and Using the Perl Script for Larger Proteomes

If your proteome has more than (Note: add number when Katie's webpage is completed) characters, then you'll have to download and run the program yourself. WARNING: Depending on the size of your proteomes, the program may take up to 48 hours to complete the comparison. At the bottom of the program webpage you'll find a link to the Perl script for the comparison program. Make sure you have access to SubEthaEdit, which is needed to run programs written in Perl. You can download SubEthaEdit for free if you don't already have it.

To download the Perl script, right-click on "Download the perl script to run on bigger files" either here or on the Proteome Comparison website and choose "Save Link As."

Next, you'll need to make sure you have your two proteomes in separate FASTA (.fasta) files. Once you have those, create a folder and add the Perl script and your two FASTA files to it. This will allow the Proteome Comparison program to know where to look for the proteome files. For example, I'm going to compare the complete proteomes of Halomicrobium mukohataei and Halorhabdus utahensis. (Both of these FASTA files are available for download on our wiki if you'd like to follow along with this tutorial or perform a test run.) Now, you should have a folder with three items: the Perl script ( and the two proteomes in FASTA format.

The next step involves changing the Perl code so that it calls your two proteomes of interest. Open the Perl script in SubEthaEdit, and the interface should look similar to this:


To start, find the first maroon filename within the code. It should be preset to read "Halomicrobium_mukohataei_DSM_12286.fasta". This is where you'll input the filename of your query proteome - the proteome that you want to obtain unique and conserved proteins for. Either keep the preexisting filename if you're following along with this tutorial, or change it to the filename of your own query proteome:


Now you're ready to input the filename of your compare (subject) proteome - the proteome that the program uses to determine which proteins in your query proteome are unique and which are conserved between the two species. In the image below, you can see how I've tweaked the code and input the correct filename:


Almost done! The next two maroon filenames in the code are what the program will name the output files. The first file will be the output name for the proteome's unique proteins, and the second file will be the filename of the proteome's conserved proteins. You may change these names to whatever you'd like, though be sure to retain 'unique' and 'conserved' within the filenames so that you'll be able to distinguish between the output files after the program has completed. For instance:


One more thing - under Mode, click on chmod ug+x. You're finally ready to run the program! Go to Mode and click Run in Terminal. Leave the terminal window open for the duration of the time it takes for the program to complete the comparison. When it finally finishes, the program will have added two Excel files to your folder under the filenames that you assigned in the Perl script.

If you're just looking to compare two proteomes, then you're finished. Naturally, these data in the Excel files (particularly in the 'unique proteins' file) will be more significant if you can compare your query proteome to multiple other species' proteomes. Just be sure to always input your species of interest into the 'query proteome.' Next, go to Karen's tutorial, which will show you how to interpret the data in the Excel files.