4) Compare genes with an RBS upstream of the gene
For this problem, we would need to compare annotations from all three sites. This idea arose from the Bakke et al. paper that the last genomics class produced. In this study, Bakke et al. determined a consensus ribosome binding site (RBS) for their species. Next, the group did a search for genes that have this consensus RBS upstream of the gene allowing for up to 2 bases to differ from the consensus sequence. It was discovered that just below 50% of genes called from all three annotation sites contained a RBS upstream of the gene. When all three sites were compared, "only 47.7% of the predicted protein-coding regions were identical in all three annotations" (Bakke et al. 2009). It is interesting to me that these two statistics are so similar to one another. It makes me wonder if the genes that are identical in all three annotations are the same genes that have RBSs upstream of them. It would not take very long to determine if these two statistics represented the same set of genes.
Since we have already determined the RBS (Shine-delgarno sequence), we simply need to determine what percentage of genes have this consensus sequence upstream of the start codon. I would hope that there are programs out there that allow you to determine this information quickly and easily. After these percentages have been determined for each annotation service, we need to determine which genes are identical between all three annotations. The Bakke et al. group did this last year, so clearly a protocol must be in place to determine this information.
I think this would be an interesting question to answer, but it will certainly not take all semester. Therefore, I do not recommend that this become our main focus, but perhaps an interesting side project if time allows.