#acl CscGroup:read

= RNA-seq Data Analysis =

<<TableOfContents(2)>> 

== Next Generation Sequencing (NGS) ==

 * Platforms:
   * Illumina's Genome Analyzer
   * Applied Biosystems' SOLiD
   * Roche's 454 Life Sciences
 * Terminology
   * Sequencing Depth or Coverage: Total number of reads mapped to the genome/transcriptome, also known as library size.
   * Transcript/gene length: Number of bases in a gene.
   * Read counts: Number of reads mapping to that gene/transcript (expression measurement). The reads are typically 30 ~ 400 bp, depending on the DNA-sequencing technology used.
 * Illumina's sequencing technology
   * One flow cell: 8 lanes
   * One lane is often used for the control sample.
   * Multiplexing:
     * a way to save money by sequencing multiple samples on a single unit (an Illumina's flow cell)
     * a feasibility to construct balanced blocked designs for the purpose of testing differential expression.
   * Barcoding: 
     * used to separate inputs.
     * The output can be deconvoluted to individual samples.
     * Many barcodes in a single unit: 12 different samples can be indexed with unique subsequences and loaded onto each lane. In total, 96 samples can be sequenced per run.
   * Quantitative standards (spike-ins, see ENCODE Consortium 2011)
     * It is highly desirable to include a ladder of RNA spike-ins to calibrate quantification, sensitivity, coverage and linearity.
   * Sequencing modes
     * Single-end Read: One read sequenced from one end of each cDNA insert
     * Paired-end Read: two reads sequenced from each cDNA sample insert (one from each end)
       * The costs of paired end sequencing are higher than single end sequencing 
   * Questions about the sequencing cost
     * Is the cost calculated by the number of lanes or flow cells used?
     * About $2,000 per lane?
     * How many samples can be in one lane? 
     * How many reads can be gotten in one lane?
     * How many reads per sample can be gotten in one lane or one flow cell? 
   
== RNA Sequencing Pipeline ==

== RNA Sequencing: Experimental Design ==

 * Sequencing variations
   * Different genes have different variances and are potentially subject to different errors and biases.
   * Sources of variation affecting only a minority of genes should be integrated into the design as well (PCR-based GC bias). 
   * Technical variability (experimental errors and biases): Repeated measurements of the same biological sample in multiple lanes or flow cells. For example, the same biological sample is in different lanes, which provides information about the variability of lanes. Two main sources of variation that may contribute to confounding effects:
     1. Batch effects: errors that occur after random fragmentation of the RNA until it is input to the flow cell (PCR, reverse transcription).
     1. Lane effects: errors that occur from the flow cell until obtaining the data from the sequencing machine (bad sequencing cycles, base-calling)
   * Biological variability (see ENCODE Consortium 2011)
     1. A biological replicate is defined as an independent growth of cells/tissue and subsequent analysis. 
     1. Experiments should be performed with two or more biological replicates, unless there is a compelling reason why this is impractical or wasteful (e.g. overlapping time points with high temporal resolution). 
     1. A typical R2 (Pearson) correlation of gene expression (RPKM) between two biological replicates, for RNAs that are detected in both samples using RPKM or read counts, should be between 0.92 to 0.98. Experiments with biological correlations that fall below 0.9 should be either be repeated or explained.
 * Sequencing depth (see ENCODE Consortium 2011)
   * The amount of sequencing needed for a given sample is determined by the goals of the experiment and the nature of the RNA sample.
   * Experiments whose purpose is to evaluate the similarity between the transcriptional profiles of two polyA+ samples may require only modest depths of sequencing (e.g. 30M pair-end reads of length > 30NT, of which 20-25M are mappable to the genome or known transcriptome) 
   * Experiments whose purpose is discovery of novel transcribed elements and strong quantification of known transcript isoforms requires more extensive sequencing. The ability to detect reliably low copy number transcripts/isoforms depends upon the depth of sequencing and on a sufficiently complex library. 
   * For experiments from a typical mammalian tissue or in which sensitivity of detection is important, a minimum depth of 100-200 M 2 x 76 bp or longer reads is currently recommended.
 * Purposes of the experimental design
   * avoid or eliminate the technical variation (possible confounding factors)
   * estimate the biological variation 
 * Sequencing design 
   * Sampling: subject sampling, RNA sampling, and fragment sampling
   * Randomization: assigning individuals at random to groups (reduce the sample variability or variation) 
   * Replication: biological replicates allow for estimation of within-treatment group (biological) variability, which is needed for making inferences between treatment groups.
   * Blocking: experimental units are grouped into homogeneous clusters (blocks)
 * Balanced block designs
   * Barcoding: DNA fragments can be labeled or barcoded with sample specific sequences that allow multiple samples to be included in the same sequencing.
   * Pooling: All the samples of RNA are pooled into the same batch and then sequenced in one lane of a flow cell.
   * Any batch and lane effects are the same for all the samples.
 * Balanced incomplete block designs (BIBD) 
   * Technical constraints and the scientific hypotheses:
     1. the number of treatments (I)
     1. the number of biological replicates per treatment (J)
     1. the number of unique barcodes (s) that can be included in a single lane (block)
     1. the number of lanes available for sequencing (L)
   * When s < I, i.e., the number of unique bar codes in one lane is less than the number of treatments or samples,
     * a complete block design is not be possible.
     * In these cases, a BIBD is suggested.
   * Balanced incomplete block:
     * Incomplete: cannot fit all treatments (samples) in each block
     * Balanced: each pair of treatments occur together in the same number of times, and then the variance of the difference between two treatments is constant.
   * BIBD:
     * The total number of possible technical replicates per biological replicate is T = sL/JI.
     * The number of times each pair of treatments occurs together is k = J(s-1)/(I-1), an integer.
     * Extensive list of BIBD can be found in Fisher and Yates (1963) and Cochran and Cox (1957)


== RNA Sequencing: Statistical Analysis ==


'''References'''

[[http://www.genetics.org/content/185/2/405.abstract | Auer and Doerge (2010) Statistical Design and Analysis of RNA Sequencing Data, Genetics]] <<BR>>
[[http://res.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf | Illumina's technical note: Estimating Sequencing Coverage]] <<BR>>
[[attachment:slides_RNAseq_Shaheena.pdf | Some Issues of Statistical Design & Analysis in RNA-seq Experiment, Shaheena Bashir (2012)]] <<BR>>
[[attachment:RNAseq_Standards_V1.0-1.pdf | Standards, Guidelines and Best Practices for RNA-Seq V1.0. The ENCODE Consortium (June 2011)]]