FastQC Reports: how to read

Illumina sequencing

Per base sequence quality: x-axis is bp. Quality score from 0 to 40 (40 being perfect score, 20 and higher is considered good). Whiskers could represent different things.

Per tile sequence quality: these were all read simultaneously. Put cDNA in a device about the size of palm. Since all were done through the same device, makes sense that everyone in the class has the same panel. Not sure whether red flag is a deal breaker.

Per sequence quality scores: most of the sequences have a high quality (peak at 39). Anything from 20 or higher is considered good, but there are a few reads that aren't very good (below 20).

Per base sequence content: first few bases are the barcodes. Barcodes for intestine (1) AGG, (2) CGG, (3) AAC, (4) AGC, (5) ACG, (6) AGA. Wiki page has file with reagents. These are all correct, so they allow us to make sense of the barcode reports that we have gotten. Remember, barcodes aren't part of the RNA sequence. Dr. Heyer and Dr. Campbell are attempting to trim. Only keeping nucleotides w/ score >15, and then should be able to match w/ genome to see which transcripts there are.

Per sequence GC content: Intestines have a uniform trace that matches the blue theoretical distribution. Liver ones have very different slopes.

Per base N content: almost no Ns. A letter N would mean that they were unable to pick one base.

Sequence Length Distribution: always at 76

Sequence Duplication Levels: y-axis is the percentage of sequences that would remain if duplicates were removed? Showing that most of the reads are single copy. Follow up question- if we remove the barcode, will this graph become cleaner?

Overrepresented sequences: Sequences listed have duplicates. Possible source- probably BLASTed sequence and didn't get a hit.

Kmer Content: 'K' is a number. That table will probably go to zero when the barcodes are stripped off of the reads.

