#acl CscGroup:read

= RNA-seq Data Analysis =

<<TableOfContents(2)>> 

== Next Generation Sequencing (NGS) ==

 * Platforms:
   * Illumina/Solexa's Genome Analyzer, HiSeq systems, MiSeq etc.
   * Applied Biosystems' SOLiD
   * Roche's 454 Life Sciences
   * Helicos BioSciences' HeliScope
 * Terminology
   * Sequencing Depth or Coverage: Total number of reads mapped to the genome/transcriptome, also known as library size.
   * Transcript/gene length: Number of bases in a gene.
   * Read counts: Number of reads mapping to that gene/transcript (expression measurement).
 * Illumina's sequencing technology
   * One flow cell: 8 lanes
   * One lane is often used for the control sample.
   * Multiplexing:
     * a way to save money by sequencing multiple samples on a single unit (an Illumina's flow cell)
     * offers the exibility to construct balanced blocked designs for the purpose of testing differential expression.
   * Barcoding: 
     * to separate inputs, can have many barcodes in a single unit
     * 12 different samples can be indexed with unique subsequences and loaded onto each lane. In total, 96 samples can be sequenced per run.
     * the output can be deconvoluted to individual samples.
 * Variations
   * Different genes have different variances and are potentially subject to different errors and biases.
   * Sources of variation affecting only a minority of genes should be integrated into the design as well (PCR-based GC bias). Complexity of the library.
   * Technical variability (experimental errors and biases): Two main sources of variation that may contribute to confounding effects:
     1. Batch effects: errors that occur after random fragmentation of the RNA until it is input to the flow cell (PCR, reverse transcription).
     1. Lane effects: errors that occur from the flow cell until obtaining the data from the sequencing machine (bad sequencing cycles, base-calling)
  * Biological variability 

== RNA Sequencing Pipeline ==

== RNA Sequencing: Experimental Design ==

 * Aims
   * estimate the biological variation (using biological replicates)
   * avoid or reduce the technical variation (using experimental design or sequencing design)
 * Sequencing design (reads, depth, variability)
   * Sampling: Subject sampling, RNA sampling, and fragment sampling
   * Randomization: assigning individuals at random to groups (reduce the sample variability or variation) 
   * Replication: The biological replicates allow for the estimation of within-treatment group (biological) variability, provide information that is necessary for making inferences between treatment groups.
   * Blocking: Experimental units are grouped into homogeneous clusters
 * Modes of Sequencing
   * Single-end Read: One read sequenced from one end of each cDNA insert
   * Paired-end Read: two reads sequenced from each cDNA sample insert (one from each end)
     * The reads are typically 30 ~ 400 bp, depending on the DNA-sequencing technology used.
     * The costs of paired end sequencing are higher than single end sequencing
 * Balanced Block Designs
   * Barcoding: DNA fragments can be labeled or barcoded with sample specific sequences that allow multiple samples to be included in the same sequencing.
   * Multiplexing
   * Pooling: All the samples of RNA are pooled into the same batch and then sequenced in one lane of a flow cell.
 * Balanced incomplete block designs (BIBD) (the samples cannot be in one lane when treatments > barcodes)

== RNA Sequencing: Statistical Analysis ==

References