Beta-galactosidase (Olivia Ho-Shing)
I chose a well-known predicted gene involved in sugar metabolism for H. mukohataei: 644033004 beta-galactosidase/beta-glucuronidase ( EC:126.96.36.199 )
To verify this predicted protein, I used:
- Look for Shine-Dalgarno sequence within 50 bp upstream
- GC Calculator
Usually JGI highlights the start and stop codons in red, and any upstream or downstream sequence in green. However, with this nucleotide sequence, there was no start codon highlighted. The first codon of the sequence was TTG.
Here is the distribution and the alignment of the BLAST hits:
The first relevant BLAST hit I got from the sequence was Synthetic construct beta-galactosidase (lacZnls12co) gene, complete cds
Query Coverage = 44% Score = 206 bits (228) E-value = 4e-49 Identities = 658/1008 (65%) Gaps = 73/1008 (7%) Strand=Plus/Plus
These BLAST hits weren't as well-aligned as I thought they would be for this protein, and I was surprised that didn't seem to be a definitive start codon. The beginning of the query sequence did not align with the beginning of the hit described above either, but this could just mean that the protein is not well-conserved on the 5' and 3' ends.
Although the nucleotide sequence given by JGI did not begin with a definitive start codon, the amino acid sequence given still began with M, so JGI must use M as the default initiating amino acid without regarding the actual codon. The second codon is AAC, which it does call N as expected. The BLASTp hits aligned with the query very well. The first amino acid in the hits did not match the M though; at least the top 10 hits began with L (TTG).
The BLASTp results I think are a strong indication that this gene probably is beta-galactosidase.
Shine Dalgarno and GC content
The 50 bp upstream of the gene sequence are: AGCAATCTGCACGCCGGGACAGCGTGACTGCCTCGCCGTGGGTTCGGCGA
I could not identify what I thought looked like a Shine Dalgarno sequence. Overall it is not a very purine-rich sequence. It may mean the gene is part of an operon, and does not have a Shine Dalgarno sequence directly upstream.
I used the GC calculator to check the GC content of this sequence; it is 67.7%. The average GC content for our genome is 65.64%, but I don't know what the normal GC content of a coding region in our genome is. So I can only verify that the gene doesn't seem to be an alien gene in our genome.