JP Jan 21 16

From GcatWiki
Jump to: navigation, search

Julia Preziosi

Looking at reports downloaded 1-19-16

split_1no_i.fastq, etc No = not fed i = intestine

Left: green check = good; orange ! = suspect; red X = something wrong.


Per base sequence quality 40 = perfect score for each base. Unsure bases get lower scores. >= 20 is good.


Per tile sequence quality cDNA all sequenced at once. Positions on the chip where sequences were read.


Per sequence quality scores A few reads below 30, but more sequences had quality around 38.


Per base sequence content First few bases are the bar codes. Each set of sequences has its own bar code (ex 1_i = AGG, 2_i = CGG) The bar codes aren't a part of the RNA sequence; they need to be removed from any analysis. Trim off the first 4; if any are scored below 15, they will be thrown away; sequences that are less than 30bp left are thrown out of analysis.


Per sequence GC content intestines' content all closely match the theoretical distribution. Liver has multiple peaks of distribution. Maybe this is biological and not data error.


Per base N content Almost 0 n; program was able to determine bases.


Sequence Length Distribution About 76 bp


Sequence Duplication Levels Unclear how to translate this. Deduplicated sequence? Almost all the reads are single copy (1 - 85). Might change when we delete bar codes.


Overrepresented sequences Some samples have more than others.

  • want to blast them ourselves


Kmer Content Repeat units at very early positions; could go to zero when bar codes are eliminated.


The Burmese python genome reveals the molecular basis for extreme adaptation in snakes

Figure 1

A: how the wet mass increases/decreases over days after feeding.

B: Clusters of genes whose transcription levels went in the same general expression. (ex 3,393 genes in the heart increased expression after 1 day post feeding then went back to basal levels by 4 days).

C: every row is a different gene, followed 0, 1, 4 days post feeding. The r1,r2,r3 indicated replicates (n = 3) (different snake per column). High expression levels are a darker color. The 300 genes on this list are those that are significantly differentially expressed. Based on color, they're clustered based on expression change across all the samples.

  • we would love to know the names of these genes
  • what's the gene that turned on that turned on all the other genes? Catalyst gene.

F: gene expression in categories determined by GO. Liver had more differentially expressed genes than the intestine. DNA replication genes differentially expressed; not just matching apoptosis?

  • Look for the known transcription factors which lead to expression of the genes.


Figure 2: A: Liver again has the most differentially expressed genes; 7024 in the liver by itself. Small intestine and liver have overlapping genes. B: gives us some gene names!!! The ones differentially expressed in all four tissues.

  • is chromatin in DNA replication?


Figure 3: C: GO, MKO (mouse knock out- deleted genes to see effects). We are not so interested in evolutionary trends.


"The annotated genome assembly is available under the National Center for Biotechnology Information Bioproject PRJNA61234 (GenBank ac- cession no. AEQU00000000)" (6).


Supplementary Information


"The final assembly resulted in 759,403 contigs, with a contig N50 size of 10,203 bp, and a total contig length of 1.4440 Gbp. The scaffold N50 for this assembly was 201,400 bp" (2)

'The N50 length is defined as the length for which the collection of all contigs of that length or longer contains at least half of the sum of the lengths of all contigs, and for which the collection of all contigs of that length or shorter also contains at least half of the sum of the lengths of all contigs.' (Google search)

Scaffold - putting together contigs with known lengths between them.

"25,385 genes" (3).

"Total RNA was extracted using Trizol Reagent (Invitrogen), following the manufacturer’s protocol. Illumina mRNAseq barcoded libraries were constructed with the Illumina TruSeq RNAseq kit and protocol. Total RNA and mRNA was quality-checked using a BioAnalyzer RNA 6000 pico chip (Agilent). Completed libraries were quantified and checked for appropriate size distribution using the DNA 7500 nono chip on a BioAnalyzer (Agilent)" (9).

"an R-script was used to filter and count the number of genes that were significantly differentially expressed in individual tissues or in all tissues in multi-tissue comparisons" (10).

Sup Table S10; it would be interesting to see how our numbers of differentially expressed genes shared between samples compare to theirs; since we have a shorter time frame, we will probably have less genes. 0-.5 hrs instead of 0-24 hrs.