#acl CscGroup:read = RNA-seq Data Analysis = <> == Next Generation Sequencing (NGS) == * Platforms: * Illumina/Solexa's Genome Analyzer, HiSeq systems, MiSeq etc. * Applied Biosystems' SOLiD * Roche's 454 Life Sciences * Helicos BioSciences' HeliScope * Terminology * Sequencing Depth or Coverage: Total number of reads mapped to the genome/transcriptome, also known as library size. * Transcript/gene length: Number of bases in a gene. * Read counts: Number of reads mapping to that gene/transcript (expression measurement). * Illumina's sequencing technology * One flow cell: 8 lanes * One lane is often used for the control sample. * Multiplexing: * a way to save money by sequencing multiple samples on a single unit (an Illumina's flow cell) * offers the exibility to construct balanced blocked designs for the purpose of testing differential expression. * Barcoding: * to separate inputs, can have many barcodes in a single unit * 12 different samples can be indexed with unique subsequences and loaded onto each lane. In total, 96 samples can be sequenced per run. * the output can be deconvoluted to individual samples. * Variations * Different genes have different variances and are potentially subject to different errors and biases. * Sources of variation affecting only a minority of genes should be integrated into the design as well (PCR-based GC bias). Complexity of the library. * Technical variability (experimental errors and biases): Two main sources of variation that may contribute to confounding effects: 1. Batch effects: errors that occur after random fragmentation of the RNA until it is input to the flow cell (PCR, reverse transcription). 1. Lane effects: errors that occur from the flow cell until obtaining the data from the sequencing machine (bad sequencing cycles, base-calling) * Biological variability == RNA Sequencing Pipeline == == RNA Sequencing: Experimental Design == * Aims * estimate the biological variation (using biological replicates) * avoid or reduce the technical variation (using experimental design or sequencing design) * Sequencing design (reads, depth, variability) * Sampling: Subject sampling, RNA sampling, and fragment sampling * Randomization: assigning individuals at random to groups (reduce the sample variability or variation) * Replication: The biological replicates allow for the estimation of within-treatment group (biological) variability, provide information that is necessary for making inferences between treatment groups. * Blocking: Experimental units are grouped into homogeneous clusters * Modes of Sequencing * Single-end Read: One read sequenced from one end of each cDNA insert * Paired-end Read: two reads sequenced from each cDNA sample insert (one from each end) * The reads are typically 30 ~ 400 bp, depending on the DNA-sequencing technology used. * The costs of paired end sequencing are higher than single end sequencing * Balanced Block Designs * Barcoding: DNA fragments can be labeled or barcoded with sample specific sequences that allow multiple samples to be included in the same sequencing. * Multiplexing * Pooling: All the samples of RNA are pooled into the same batch and then sequenced in one lane of a flow cell. * Balanced incomplete block designs (BIBD) (the samples cannot be in one lane when treatments > barcodes) == RNA Sequencing: Statistical Analysis == References