Gene Annotation Template
Module Priority List
-Alternative ORFs
-Basic Info
-Sequence similarity
-Cell Localization
-Duplication and Degradation
-Structure-Based Evidence of Function
-Horizontal Gene Transfer
-Pathways
Gene Annotation Log - Template
Basic Information:
DNA Coordinates:
DNA Sequence (FASTA format):
Protein Sequence (FASTA format):
Isoelectric Point:
Similarity Data (Sequence-Based):
BLAST Data:
- Gene Product Name:
Better to do a protein BLAST than a nucleotide BLAST
- Top hit – organism:
- Length, Score, E-value, Identity, Positives and Gaps
NCBI Statistics
- Alignment of Top Hit and Query Sequence
Alignment Scoring
CDD: Conserved Domains Database
Have to enter a protein sequence to get a result.
- Significant COG Hits:
Definition of COG
- Names of COGs:
- Score:
- E-value:
CDD website
PDB: Protein Data Bank
- Significant Structure Hits:
This database provides information about the structures of proteins in addition to performing a BLAST alignment.
Have to enter the protein sequence to get a result.
o Length
o Score
o E-value
o Identities
o Positives
o Gaps
o Alignment
PDB website
T-Coffee:
- Multi-Sequence Alignment
T-coffee Website
This is a useful tool, but it is confusing to use.
Cellular Localization Data:
TMHMM:
http://www.cbs.dtu.dk/services/TMHMM-2.0/
This database predicts the number of transmembrane helices in a protein.
- Number of Predicted TMH’s
- Transmembrane Topology graph and comment
SignalP:
http://www.cbs.dtu.dk/services/SignalP/
This database predicts whether or not a protein is a signal protein.
- Signal Peptide Probability
- Signal Peptide Graph
PSORT:
http://psort.ims.u-tokyo.ac.jp/form.html
This database predicts protein localization sites.
- Cytoplasmic Score:
- Cytoplasmic Membrane Score:
- Periplasmic Score:
- Outer Membrane Score:
- Extracellular Score:
- Final Prediction for Protein Location (of the above listed):
Phobius:
http://phobius.sbc.su.se/
This database lists the locations of the predicted transmembrane helices and intervening loop regions.
Note: If the report states that the protein is non cytoplasmic or cytoplasmic, it simply predicts that no transmembrane helices are likely. It should not be used as a predictor of location.
- Enter Graph:
Final Hypothesis: Where do you expect to find this protein?
Alternative Open Reading Frames:
The alternative ORFs page is available via the JGI gene homepage, can be accessed by clicking on “Gene details”. This function allows you to examine the codons around the proposed start codon, possibly locating a more appropriate or likely place for the gene to start.
Proposed DNA Coordinates (as opposed to original/computer-allocated reading frame):
Reasoning:
Structure-Based Evidence of Function:
Pfam-A:
The Pfam database is a collection of protein domain families. A protein domain is a functional sequence of a protein; a protein's domains define what that protein is and what it does in the cell. Pfams are collections of multiple sequences that, within a certain range of variation, include domain sequences and whole protein sequences as defined by those domains. The HMM, or Hidden Markov Model, that Pfam uses to group these sequences can recognize variations in the base pair sequences in question, which allows for a much wider recognition of possible proteins.
- Significant Matches:
- Pfam Name:
- Pairwise Alignment:
- HMM logo:
- Key Functional Residues:
Resources on JGI:
JGI "Course Materials" page contains a list of Pfams for transcription factors and transporters.
PDB:
The Protein Data Bank contains detailed information about, and illustrations of, the physical 3-D structures of proteins. Searches can be conducted based on protein identification from multiple databases, structure, articles that a protein has been mentioned, DNA sequence, and other criteria.
Module here is similar to Sequence-Based Similarity Module.
- Significant Structure Hits:
o Length
o Score
o E-value
o Identities
o Positives
o Gaps
- Alignment:
Pathways:
KEGG:
This website has two tools:
- KEGG Pathway is a database that is a collection of pathway maps to represent the molecular interaction and reaction networks for:
- Metabolism - Genetic Information Processing - Environmental Information Processing - Cellular Processes - Human Diseases
- KEGG Module is a collection of pathway modules, molecular complexes, and other functional units
JGI wants us to use the website to enter a (pathway) map for each gene.
EcoCyc:
This is a bioinformatics database that describes the genome and the biochemical machinery of E. coli K-12 MG1655. It can be used as a reference source that we can relate our findings to.
- In the search bar you can enter:
- The name of a compound, gene, protein or pathway.
Examples: pyruvate, trpA
- Any substring of one of the above names that is 3 or more characters in length.
Examples: kinase, pyr
- An EC number (full or partial).
Examples: 1.2.3.3, 1.3.99
- An identifier from some external database to which they maintain links, e.g. UniProt.
(JGI didn't seem to be one of these databases)
Examples: P00561, NP_414543, C00047
- The internal object identifier for any compound, gene, protein, pathway, reaction, etc.
Examples: CPLX0-3661, HEMN-RXN
JGI wants us to use the website to enter the pathway for each gene and provide the E.C. number.
Duplication and Degradation:
Duplication:
Paralogs are homologous genes within a single species that arose by gene duplication. Through analysis of paralogs, we can determine which genes may have duplicates in our genome.
You can search for paralogs of an individual gene:
Scroll to the bottom of the Gene Detail page.
Under "Homolog Display", you will find a "Homolog selection" dropbox.
Select "Paralogs / Orthologs."
JGI requests certain information about the top paralog hit:
- Gene Object ID
- Length (bp)
- Score
- E-value
- Identity
- Positives
- Gaps
- Alignment of Top Hit and Query Sequence:
***Alignment Instructions***
Other possible information:
- Number of paralogs above a certain Bit Score.
- How could we measure Degradation?
Evidence of Horizontal Gene Transfer:
Evidence of Horizontal Gene Transfer (Module Eight) Instructions
Phylogenetic Tree Diagram:
Gene Context:
- Ortholog Neighborhood Region of Organism:
- Examples of similarities or Differences:
- Comment:
Chromosome Viewer GC Heat Map:
- Characteristic GC% of genome:
- Average GC% of gene:
RNA (Rfam):
RNA Family:
Bits Score: