Gene Annotation Template
Gene Annotation Log - Template
Basic Information:
DNA Coordinates:
DNA Sequence (FASTA format):
Protein Sequence (FASTA format):
Isoelectric Point:
Similarity Data (Sequence-Based):
BLAST Data:
- Gene Product Name:
Returned "No Significant Similarity" for a couple of the proteins with predicted function that I tried
- Top hit – organism:
- Length, Score, E-value, Identity, Positives and Gaps
NCBI Statistics
- Alignment of Top Hit and Query Sequence
Alignment Scoring
CDD: Conserved Domains Database
- Significant COG Hits:
Definition of COG
- Names of COGs:
- Score:
- E-value:
CDD website
PDB: Protein Data Bank
- Significant Structure Hits:
This database provides information about the structures of proteins in addition to performing a BLAST alignment. Have to enter the protein sequence to get a result.
o Length
o Score
o E-value
o Identities
o Positives
o Gaps
o Alignment
PDB website
T-Coffee:
- Multi-Sequence Alignment
T-coffee Website
This is a useful tool, but it is confusing to use.
Cellular Localization Data:
TMHMM:
http://www.cbs.dtu.dk/services/TMHMM-2.0/
This database predicts the number of transmembrane helices in a protein.
- Number of Predicted TMH’s
- Transmembrane Topology graph and comment
SignalP:
http://www.cbs.dtu.dk/services/SignalP/
This database predicts whether or not a protein is a signal protein.
- Signal Peptide Probability
- Signal Peptide Graph
PSORT:
http://psort.ims.u-tokyo.ac.jp/form.html
This database predicts protein localization sites.
- Cytoplasmic Score:
- Cytoplasmic Membrane Score:
- Periplasmic Score:
- Outer Membrane Score:
- Extracellular Score:
- Final Prediction for Protein Location (of the above listed):
Phobius:
http://phobius.sbc.su.se/
This database lists the locations of the predicted transmembrane helices and intervening loop regions.
Note: If the report states that the protein is non cytoplasmic or cytoplasmic, it simply predicts that no transmembrane helices are likely. It should not be used as a predictor of location.
- Enter Graph:
Final Hypothesis: Where do you expect to find this protein?
Alternative Open Reading Frames:
Proposed DNA Coordinates:
Reasoning:
Structure-Based Evidence of Function:
Pfam-A:
- Significant Matches:
- Pfam Name:
- Pairwise Alignment:
- HMM logo:
- Key Functional Residues:
PDB:
- Significant Structure Hits:
o Length
o Score
o E-value
o Identities
o Positives
o Gaps
- Alignment:
Pathways:
KEGG:
This website has two tools:
- KEGG Pathway is a database that is a collection of pathway maps to represent the molecular interaction and reaction networks for:
1. Metabolism 2. Genetic Information Processing 3. Environmental Information Processing 4. Cellular Processes 5. Human Diseases
- KEGG Module is a collection of pathway modules, molecular complexes, and other functional units
EcoCyc:
This is a bioinformatics database that describes the genome and the biochemical machinery of E. coli K-12 MG1655. It can be used as a reference source that we can relate our findings to.
E.C. Number:
Duplication and Degradation:
Duplication:
Paralogs are homologous genes within a single species that arose by gene duplication. Through analysis of paralogs, we can determine which genes may have been duplicated.
You can search for paralogs of an individual gene:
Scroll to the bottom of the Gene Detail page.
Under "Homolog Display", you will find a "Homolog selection" dropbox.
Select "Paralogs / Orthologs."
JGI requests certain information about the top paralog hit:
- Gene Object ID
- Length (bp)
- Score
- E-value
- Identity
- Positives
- Gaps
- Alignment of Top Hit and Query Sequence:
Alignment Instructions
Other possible information:
- Number of paralogs above a certain Bit Score.
- How could we measure Degradation?
Evidence of Horizontal Gene Transfer:
Phylogenetic Tree Diagram:
Gene Context:
- Ortholog Neighborhood Region of Organism:
- Examples of similarities or Differences:
- Comment:
Chromosome Viewer GC Heat Map:
- Characteristic GC% of genome:
- Average GC% of gene:
RNA (Rfam):
RNA Family:
Bits Score: