Skip to content

Instantly share code, notes, and snippets.

@jdblischak
Last active January 29, 2025 23:15

Revisions

  1. jdblischak revised this gist Sep 22, 2017. 1 changed file with 6 additions and 0 deletions.
    6 changes: 6 additions & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -12,3 +12,9 @@ Past versions:
    * [Tuesday, April 29, 2014](https://gist.github.com/jdblischak/11384914/9e6363755a0760f307b5011b66022675fd73afa2)

    For all the example code to work, you will need to install an R version greater than 3.0.

    **License:** This lesson is available under the [CC-BY license][cc-by] ([legalcode][cc-by-legalcode]).
    Essentially you can do whatever you want with it (teach it verbatim or a modified version) as long as you attribute me as the original source and provide a link to this Gist.

    [cc-by]: https://creativecommons.org/licenses/by/2.0/
    [cc-by-legalcode]: https://creativecommons.org/licenses/by/2.0/legalcode
  2. jdblischak revised this gist Sep 22, 2017. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -208,7 +208,7 @@ How many genes are differentially expressed at an FDR of 10%?

    ```{r count-de-edgeR}
    sum(results_edgeR$table$FDR < .1)
    plotSmear(et, de.tags = rownames(results_edgeR)[results_edgeR$table$FDR < .01])
    plotSmear(et, de.tags = rownames(results_edgeR)[results_edgeR$table$FDR < .1])
    abline(h = c(-2, 2), col = "blue")
    ```

  3. jdblischak revised this gist Apr 28, 2016. 1 changed file with 1 addition and 0 deletions.
    1 change: 1 addition & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -7,6 +7,7 @@ The specific example is a differential expression analysis with edgeR starting w

    Past versions:

    * [Tuesday, April 26, 2016](https://gist.github.com/jdblischak/11384914/a4b57e05fd77a3cd1012977662d7b0b31158dc8f)
    * [Tuesday, April 21, 2015](https://gist.github.com/jdblischak/11384914/fd4563da540b02852808cb59c3c95024dc358f36)
    * [Tuesday, April 29, 2014](https://gist.github.com/jdblischak/11384914/9e6363755a0760f307b5011b66022675fd73afa2)

  4. jdblischak revised this gist Apr 28, 2016. 2 changed files with 5 additions and 3 deletions.
    4 changes: 3 additions & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -1,7 +1,9 @@
    # Differential expression analysis with edgeR

    This is a tutorial I have presented for the class Genomics and Systems Biology at the University of Chicago.
    In this course the students learn about study design, normalization, and statistical testing for genomic studies.
    This is meant to introduce them to how these ideas are implemented in practice.
    The specific example is a differential expression analysis starting with a table of counts and ending with a list of differentially expressed genes.
    The specific example is a differential expression analysis with edgeR starting with a table of counts and ending with a list of differentially expressed genes.

    Past versions:

    4 changes: 2 additions & 2 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -41,7 +41,7 @@ Next, we need to download and install the package we will use to perform the ana

    ```{r install-packages, eval = FALSE}
    source("http://bioconductor.org/biocLite.R")
    biocLite("edgeR")
    biocLite("edgeR", dependencies = TRUE)
    ```

    ## Introduction
    @@ -63,7 +63,7 @@ edgeR (`r citet(c(Robinson2007="10.1093/bioinformatics/btm453", Robinson2010="10

    [nb]: http://en.wikipedia.org/wiki/Negative_binomial_distribution

    The SEQC data set will we analyze contains two different groups for comparison.
    The SEQC data set we will analyze contains two different groups for comparison.
    Group A is five technical replicates of the [Stratagene Universal Human Reference RNA][a].
    It is an equal mixture of RNA from ten different human cell lines.
    Group B is five technical replicates of [Ambion’s Human Brain Reference RNA][b].
  5. jdblischak revised this gist Apr 21, 2016. 1 changed file with 22 additions and 15 deletions.
    37 changes: 22 additions & 15 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -1,7 +1,7 @@
    ---
    title: Differential expression analysis
    title: Differential expression analysis with edgeR
    author: John Blischak
    date: 2015-04-21
    date: 2016-04-26
    output:
    html_document:
    toc: true
    @@ -66,7 +66,7 @@ edgeR (`r citet(c(Robinson2007="10.1093/bioinformatics/btm453", Robinson2010="10
    The SEQC data set will we analyze contains two different groups for comparison.
    Group A is five technical replicates of the [Stratagene Universal Human Reference RNA][a].
    It is an equal mixture of RNA from ten different human cell lines.
    Group B is five techincal replicates of [Ambion’s Human Brain Reference RNA][b].
    Group B is five technical replicates of [Ambion’s Human Brain Reference RNA][b].
    It is RNA that was pooled from several donors from several different regions of the brain.

    [a]: http://www.chem.agilent.com/Library/usermanuals/Public/740000.pdf
    @@ -155,7 +155,7 @@ summary(pca)

    Now we start preparing the data for the the test of differential expression.
    We create a vector called, `group`, that labels each of the columns as belonging to group A or B.
    We then use this vector and the gene counts to create a DGEList, which is the object that edgeR uses for storing the data from a differntial expression experiment.
    We then use this vector and the gene counts to create a DGEList, which is the object that edgeR uses for storing the data from a differential expression experiment.

    ```{r make-groups-edgeR}
    group <- substr(colnames(data_clean), 1, 1)
    @@ -179,13 +179,13 @@ y$samples

    The next step is to model the variance of the read counts per gene.
    A natural method for modeling gene counts is the Poisson distribution.
    However, the Poisson assumes the mean and variance are identical, but it has been found emipirically that the variance in RNA-seq measurements of gene expression are larger than the mean (termed "overdispersion").
    However, the Poisson assumes the mean and variance are identical, but it has been found empirically that the variance in RNA-seq measurements of gene expression are larger than the mean (termed "overdispersion").
    So instead the negative binomial distribution is used, which has a dispersion parameter for modeling the increase in variance from a Poisson process.
    edgeR treats the Poisson variance as simple sampling variance, and refers to the dispersion estimate as the "biolocial coefficient of variation."
    edgeR treats the Poisson variance as simple sampling variance, and refers to the dispersion estimate as the "biological coefficient of variation."
    Though it should be mentioned that any technical biases are also included in this estimate.
    edgeR shares information across genes to determine a common dispersion.
    It extends this to a trended dispersion to model the mean-variance relationship (lowly expressed genes are typically more noisy).
    Lastly it calculates a dispersion esitmate per gene and shrinks it towards the trended dispersion.
    Lastly it calculates a dispersion estimate per gene and shrinks it towards the trended dispersion.
    The gene-specific (referred to in edgeR as tagwise) dispersion estimates are used in the test for differential expression.

    ```{r model-edgeR}
    @@ -222,21 +222,22 @@ The blue lines represent a four-fold change in expression.
    ## Adding covariates

    The previous example was a two group comparison, but if you have additional covariates, you'll need to use a generalized linear model (GLM) framework.
    Let's say we processed the samples in two batches, the samples were collected from males and females, and from a range of ages.
    Let's say we processed the samples in two batches and also recorded the [RIN scores][rin] to control for differences in RNA quality.

    [rin]: http://www.genomics.agilent.com/article.jsp?pageId=2181&_requestid=506775

    ```{r}
    set.seed(123)
    batch <- sample(c("one", "two"), 10, replace = TRUE)
    sex <- sample(c("M", "F"), 10, replace = TRUE)
    age <- sample(18:50, 10, replace = TRUE)
    rin <- sample(6:10, 10, replace = TRUE)
    ```

    We need to create a design matrix to describe the statistical model.

    ```{r}
    y <- DGEList(data_clean)
    y <- calcNormFactors(y)
    design <- model.matrix(~group + batch + sex + age)
    design <- model.matrix(~group + batch + rin)
    design
    ```

    @@ -256,16 +257,22 @@ And here is an example gene.
    boxplot(as.numeric(data_clean["HBB", ]) ~ group)
    ```


    ## To learn more

    Check out the manual for [edgeR][erman] and search the Bioconductor [support site][bioc-support].
    For a rigorous comparison of RNA-seq methods, see `r citep(c(Rapaport2013="10.1186/gb-2013-14-9-r95", Soneson2013="10.1186/1471-2105-14-91"))`.
    The [source code][gist] for this lesson is availabe as a GitHub Gist.
    * Check out the manual for [edgeR][erman] and search the Bioconductor [support site][bioc-support].
    * For a rigorous comparison of RNA-seq methods, see `r citet(c(Rapaport2013="10.1186/gb-2013-14-9-r95"))`
    and `r citet(c(Soneson2013="10.1186/1471-2105-14-91"))`.
    * The [source code][gist] for this lesson is available as a GitHub Gist.
    * In addition to [edgeR][], there are many statistical tools available for performing differential expression analysis.
    Other commonly used count-based methods are [DESeq2][] and [limma+voom][limma].
    For a different approach from the traditional count-based methods, check out [Cufflinks][].

    [erman]: http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
    [bioc-support]: https://support.bioconductor.org/
    [gist]: https://gist.github.com/jdblischak/11384914
    [DESeq2]: http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html
    [limma]: http://www.bioconductor.org/packages/release/bioc/html/limma.html
    [Cufflinks]: http://cole-trapnell-lab.github.io/cufflinks/

    ## References

  6. jdblischak revised this gist Apr 21, 2016. 1 changed file with 6 additions and 11 deletions.
    17 changes: 6 additions & 11 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -184,23 +184,19 @@ So instead the negative binomial distribution is used, which has a dispersion pa
    edgeR treats the Poisson variance as simple sampling variance, and refers to the dispersion estimate as the "biolocial coefficient of variation."
    Though it should be mentioned that any technical biases are also included in this estimate.
    edgeR shares information across genes to determine a common dispersion.
    It then calculates a dispersion esitmate per gene and shrinks it towards the common dispersion.
    It extends this to a trended dispersion to model the mean-variance relationship (lowly expressed genes are typically more noisy).
    Lastly it calculates a dispersion esitmate per gene and shrinks it towards the trended dispersion.
    The gene-specific (referred to in edgeR as tagwise) dispersion estimates are used in the test for differential expression.

    **Note:** edgeR also has a trended dispersion to use in place of the common dispersion for more complicated experimental designs.

    ```{r model-edgeR}
    y <- estimateCommonDisp(y, verbose = TRUE)
    y <- estimateTagwiseDisp(y)
    y <- estimateDisp(y)
    sqrt(y$common.dispersion) # biological coefficient of variation
    plotBCV(y)
    ```

    The biological coefficient of variation is lower than normally seen in human studies (~0.4) because the samples are technical replicates.

    edgeR tests for differential expression between two classes using a modified version of the Fisher's Exact Test.

    **Note:** It also provides a generalized linear model frameworks for more complicated experimental designs.
    edgeR tests for differential expression between two classes using a method similar in idea to the Fisher's Exact Test.

    ```{r test-edgeR}
    et <- exactTest(y)
    @@ -245,11 +241,10 @@ design
    ```

    And now we test.
    The argument `coef = 2` corresponds to testing the second column of the design matrix, which in this case is whether the sample is from group A or B.

    ```{r}
    y <- estimateGLMCommonDisp(y, design)
    y <- estimateGLMTrendedDisp(y, design)
    y <- estimateGLMTagwiseDisp(y, design)
    y <- estimateDisp(y, design)
    fit <- glmFit(y, design)
    lrt <- glmLRT(fit, coef = 2)
    topTags(lrt)
  7. jdblischak revised this gist Apr 21, 2016. 1 changed file with 18 additions and 17 deletions.
    35 changes: 18 additions & 17 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -5,6 +5,7 @@ date: 2015-04-21
    output:
    html_document:
    toc: true
    toc_float: true
    ---

    ```{r settings, include = FALSE}
    @@ -14,9 +15,11 @@ opts_chunk$set(tidy = FALSE, fig.width = 8, fig.height = 8, fig.pos = "center",
    cache = FALSE)
    ```

    In this tutorial, we will perform a basic differential expression analysis with RNA sequencing data using R/Bioconductor.
    In this tutorial, we will perform a basic differential expression analysis with RNA sequencing data using [R/Bioconductor][bioc].
    Before class, please download the data set and install the software as explained in the following section.

    [bioc]: http://www.bioconductor.org

    ## Download data and install software

    For this tutorial, we will use the data set generated by the Sequencing Quality Control ([SEQC][]) project.
    @@ -32,9 +35,9 @@ fname <- "http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE49712&format=file&fil
    download.file(fname, destfile = "GSE49712_HTSeq.txt.gz")
    ```

    Next, we need to download and install the two [Bioconductor][bioc] packages we will use to perform the analysis.
    Next, we need to download and install the package we will use to perform the analysis: [edgeR][].

    [bioc]: http://www.bioconductor.org
    [edgeR]: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html

    ```{r install-packages, eval = FALSE}
    source("http://bioconductor.org/biocLite.R")
    @@ -43,22 +46,21 @@ biocLite("edgeR")

    ## Introduction

    As you learned in class, RNA sequencing (RNA-seq) is an experimental method for measuring RNA expression levels via high-throughput sequencing of small adapter-ligated fragments (see figure below from `r citet(c(Oshlack2010="10.1038/nrg2484"))`).
    As you learned in class, RNA sequencing (RNA-seq) is an experimental method for measuring RNA expression levels via high-throughput sequencing of small adapter-ligated fragments (see figure below from `r citet(c(Wang2009="10.1038/nrg2484"))`).
    The number of reads that map to a gene is the proxy for its expression level.

    ![RNA-seq experiment](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949280/bin/nihms229948f1.jpg)

    There are many steps between receiving the raw reads from the sequencer and gaining biological insight.
    In this tutorial, we will start with a "Table of counts" and end with a "List of differentially expressed genes", as diagrammed in the RNA-seq analysis pipeline below (from `r citet(c(Wang2009="10.1186/gb-2010-11-12-220"))`).
    In this tutorial, we will start with a "Table of counts" and end with a "List of differentially expressed genes", as diagrammed in the RNA-seq analysis pipeline below (from `r citet(c(Oshlack2010="10.1186/gb-2010-11-12-220"))`).

    ![RNA-seq analysis pipeline](http://genomebiology.com/content/figures/gb-2010-11-12-220-1.jpg)
    ![RNA-seq analysis pipeline](https://static-content.springer.com/image/art%3A10.1186%2Fgb-2010-11-12-220/MediaObjects/13059_2010_Article_2518_Fig1_HTML.jpg)

    Unlike the continuous data that is generated by microarrays, RNA-seq generates counts.
    While it is possible to transform the counts into a continuous distribution and perform an analysis similar to microarrays, e.g. linear regression, multiple approaches have been developed to directly model the data as counts (`r citet(c(Marioni2008="10.1101/gr.079558.108"))`).
    We will perform a simple analysis using one popular Bioconductor package for differential expression analysis, [edgeR][].
    edgeR (`r citet(c(Robinson2007="10.1093/bioinformatics/btm453", Robinson2010="10.1093/bioinformatics/btp616"))`) models the count data with a [negative binomial model][nb].

    [edgeR]: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html
    [nb]: http://en.wikipedia.org/wiki/Negative_binomial_distribution

    The SEQC data set will we analyze contains two different groups for comparison.
    @@ -89,7 +91,7 @@ data_raw <- read.table("GSE49712_HTSeq.txt.gz", header = TRUE)
    ## Quality control

    Let's quickly explore the data.
    Each column corresponds to a sample, and each row corresponds to a gene.
    Each each row corresponds to a gene, and each column corresponds to a sample.

    ```{r view}
    dim(data_raw)
    @@ -104,8 +106,8 @@ data_clean <- data_raw[1:(nrow(data_raw) - 5), ]
    ```

    We should also remove genes that are unexpressed or very lowly expressed in the samples.
    One simple method to do this is to choose a cutoff based on the median log2-transformed counts per gene per million mapped reads (cpm).
    edgeR provides the function, `cpm`, to compute the counts-per-million.
    One simple method to do this is to choose a cutoff based on the median log~2~-transformed counts per gene per million mapped reads (cpm).
    edgeR provides the function, `cpm`, to compute the counts per million.

    ```{r expr-cutoff}
    cpm_log <- cpm(data_clean, log = TRUE)
    @@ -117,15 +119,15 @@ sum(median_log2_cpm > expr_cutoff)
    data_clean <- data_clean[median_log2_cpm > expr_cutoff, ]
    ```

    After removing all genes with a median log2 cpm below `r expr_cutoff`, we have `r sum(median_log2_cpm > expr_cutoff)` genes remaining.
    After removing all genes with a median log~2~ cpm below `r expr_cutoff`, we have `r sum(median_log2_cpm > expr_cutoff)` genes remaining.
    A good rule of thumb when analyzing RNA-seq data from a single cell type is to expect 9-12 thousand expressed genes.
    In this case, we have many more because the samples include RNA collected from many different cell types, and thus it is not surprising that many more genes are robustly expressed.

    **Question:**
    What are some ways that we could improve this simple cutoff?
    What information are we ignoring?

    After filtering lowly expressed genes, we recalculate the log cpm.
    After filtering lowly expressed genes, we recalculate the log~2~ cpm.

    ```{r}
    cpm_log <- cpm(data_clean, log = TRUE)
    @@ -215,7 +217,7 @@ abline(h = c(-2, 2), col = "blue")
    ```

    As expected from the description of the samples and the heatmap, there are many differentially expressed genes.
    The [MA plot][ma] above plots the log2 fold change on the y-axis versus the average log2 counts-per-million on the x-axis.
    The [MA plot][ma] above plots the log~2~ fold change on the y-axis versus the average log~2~ counts-per-million on the x-axis.
    The red dots are genes with an FDR less than 10%.
    The blue lines represent a four-fold change in expression.

    @@ -262,13 +264,12 @@ boxplot(as.numeric(data_clean["HBB", ]) ~ group)

    ## To learn more

    Check out the manual for [edgeR][erman], search the Bioconductor mailing list [archives][], and consider joining the Bioconductor mailing [list][].
    For a rigorous comparison RNA-seq methods, see `r citep(c(Rapaport2013="10.1186/gb-2013-14-9-r95", Soneson2013="10.1186/1471-2105-14-91"))`.
    Check out the manual for [edgeR][erman] and search the Bioconductor [support site][bioc-support].
    For a rigorous comparison of RNA-seq methods, see `r citep(c(Rapaport2013="10.1186/gb-2013-14-9-r95", Soneson2013="10.1186/1471-2105-14-91"))`.
    The [source code][gist] for this lesson is availabe as a GitHub Gist.

    [erman]: http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
    [archives]: http://dir.gmane.org/gmane.science.biology.informatics.conductor
    [list]: http://www.bioconductor.org/help/mailing-list/
    [bioc-support]: https://support.bioconductor.org/
    [gist]: https://gist.github.com/jdblischak/11384914

    ## References
  8. jdblischak revised this gist Apr 22, 2015. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -5,7 +5,7 @@ The specific example is a differential expression analysis starting with a table

    Past versions:

    * [Tuesday, April 21, 2015](https://gist.github.com/jdblischak/11384914/cc13133bb5ce319342936be16c7549d922ffc218)
    * [Tuesday, April 21, 2015](https://gist.github.com/jdblischak/11384914/fd4563da540b02852808cb59c3c95024dc358f36)
    * [Tuesday, April 29, 2014](https://gist.github.com/jdblischak/11384914/9e6363755a0760f307b5011b66022675fd73afa2)

    For all the example code to work, you will need to install an R version greater than 3.0.
  9. jdblischak revised this gist Apr 22, 2015. 1 changed file with 8 additions and 11 deletions.
    19 changes: 8 additions & 11 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -1,14 +1,11 @@
    This was a tutorial I presented for the class Genomics and Systems Biology at the University of Chicago on Tuesday, April 29, 2014.
    The students had been learning about study design, normalization, and statistical testing for genomic studies.
    This was meant to introduce them to how these ideas are implemented in practice.
    The specific example is a differential expression analysis with RNA-seq data for a two-class comparison.
    This is a tutorial I have presented for the class Genomics and Systems Biology at the University of Chicago.
    In this course the students learn about study design, normalization, and statistical testing for genomic studies.
    This is meant to introduce them to how these ideas are implemented in practice.
    The specific example is a differential expression analysis starting with a table of counts and ending with a list of differentially expressed genes.

    To render this lesson, you'll need to first install the R package knitr and the R/Bioconductor packages edgeR and DESeq2.
    You can then create the lesson by running the following from the R console:
    Past versions:

    ```r
    library("knitr")
    knit2html("rnaseq-de-tutorial.Rmd", envir = new.env())
    ```
    * [Tuesday, April 21, 2015](https://gist.github.com/jdblischak/11384914/cc13133bb5ce319342936be16c7549d922ffc218)
    * [Tuesday, April 29, 2014](https://gist.github.com/jdblischak/11384914/9e6363755a0760f307b5011b66022675fd73afa2)

    One known issue is that if you do not have the latest version of DESeq2 because you have an older version of R, the function `rlog` may not be available.
    For all the example code to work, you will need to install an R version greater than 3.0.
  10. jdblischak revised this gist Apr 21, 2015. 1 changed file with 8 additions and 2 deletions.
    10 changes: 8 additions & 2 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -125,6 +125,12 @@ In this case, we have many more because the samples include RNA collected from m
    What are some ways that we could improve this simple cutoff?
    What information are we ignoring?

    After filtering lowly expressed genes, we recalculate the log cpm.

    ```{r}
    cpm_log <- cpm(data_clean, log = TRUE)
    ```

    We can also explore the relationships between the samples by visualizing a heatmap of the correlation matrix.
    The heatmap result corresponds to what we know about the data set.
    First, the samples in group A and B come from very different cell populations, so the two groups are very different.
    @@ -137,9 +143,9 @@ heatmap(cor(cpm_log))
    Another method to view the relationships between samples is principal components analysis (PCA).

    ```{r pca}
    pca <- prcomp(t(data_clean), scale. = TRUE)
    pca <- prcomp(t(cpm_log), scale. = TRUE)
    plot(pca$x[, 1], pca$x[, 2], pch = ".", xlab = "PC1", ylab = "PC2")
    text(pca$x[, 1], pca$x[, 2], labels = colnames(data_clean))
    text(pca$x[, 1], pca$x[, 2], labels = colnames(cpm_log))
    summary(pca)
    ```

  11. jdblischak revised this gist Apr 21, 2015. 1 changed file with 17 additions and 3 deletions.
    20 changes: 17 additions & 3 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -218,27 +218,41 @@ The blue lines represent a four-fold change in expression.
    ## Adding covariates

    The previous example was a two group comparison, but if you have additional covariates, you'll need to use a generalized linear model (GLM) framework.
    Let's say we processed the samples in two batches, the samples were collected from males and females, and from a range of ages.

    ```{r}
    set.seed(123)
    sex <- sample(c("M", "F"), 10, replace = TRUE)
    batch <- sample(c("one", "two"), 10, replace = TRUE)
    sex <- sample(c("M", "F"), 10, replace = TRUE)
    age <- sample(18:50, 10, replace = TRUE)
    ```

    We need to create a design matrix to describe the statistical model.

    ```{r}
    design <- model.matrix(~group + sex + batch + age)
    y <- DGEList(data_clean)
    y <- calcNormFactors(y)
    design <- model.matrix(~group + batch + sex + age)
    design
    ```

    And now we test.

    ```{r}
    y <- estimateGLMCommonDisp(data_clean, design)
    y <- estimateGLMCommonDisp(y, design)
    y <- estimateGLMTrendedDisp(y, design)
    y <- estimateGLMTagwiseDisp(y, design)
    fit <- glmFit(y, design)
    lrt <- glmLRT(fit, coef = 2)
    topTags(lrt)
    ```

    And here is an example gene.

    ```{r}
    boxplot(as.numeric(data_clean["HBB", ]) ~ group)
    ```


    ## To learn more

  12. jdblischak revised this gist Apr 21, 2015. 1 changed file with 25 additions and 0 deletions.
    25 changes: 25 additions & 0 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -215,6 +215,31 @@ The blue lines represent a four-fold change in expression.

    [ma]: https://en.wikipedia.org/wiki/MA_plot

    ## Adding covariates

    The previous example was a two group comparison, but if you have additional covariates, you'll need to use a generalized linear model (GLM) framework.

    ```{r}
    set.seed(123)
    sex <- sample(c("M", "F"), 10, replace = TRUE)
    batch <- sample(c("one", "two"), 10, replace = TRUE)
    age <- sample(18:50, 10, replace = TRUE)
    ```

    ```{r}
    design <- model.matrix(~group + sex + batch + age)
    ```

    ```{r}
    y <- estimateGLMCommonDisp(data_clean, design)
    y <- estimateGLMTrendedDisp(y, design)
    y <- estimateGLMTagwiseDisp(y, design)
    fit <- glmFit(y, design)
    lrt <- glmLRT(fit, coef = 2)
    topTags(lrt)
    ```


    ## To learn more

    Check out the manual for [edgeR][erman], search the Bioconductor mailing list [archives][], and consider joining the Bioconductor mailing [list][].
  13. jdblischak revised this gist Apr 21, 2015. 1 changed file with 11 additions and 0 deletions.
    11 changes: 11 additions & 0 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -134,6 +134,17 @@ Second, since the samples in each group are technical replicates, the within gro
    heatmap(cor(cpm_log))
    ```

    Another method to view the relationships between samples is principal components analysis (PCA).

    ```{r pca}
    pca <- prcomp(t(data_clean), scale. = TRUE)
    plot(pca$x[, 1], pca$x[, 2], pch = ".", xlab = "PC1", ylab = "PC2")
    text(pca$x[, 1], pca$x[, 2], labels = colnames(data_clean))
    summary(pca)
    ```

    ## Two group comparison

    Now we start preparing the data for the the test of differential expression.
    We create a vector called, `group`, that labels each of the columns as belonging to group A or B.
    We then use this vector and the gene counts to create a DGEList, which is the object that edgeR uses for storing the data from a differntial expression experiment.
  14. jdblischak revised this gist Apr 21, 2015. 1 changed file with 9 additions and 8 deletions.
    17 changes: 9 additions & 8 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -11,7 +11,7 @@ output:
    library("knitcitations")
    library("knitr")
    opts_chunk$set(tidy = FALSE, fig.width = 8, fig.height = 8, fig.pos = "center",
    cache = TRUE)
    cache = FALSE)
    ```

    In this tutorial, we will perform a basic differential expression analysis with RNA sequencing data using R/Bioconductor.
    @@ -56,8 +56,10 @@ In this tutorial, we will start with a "Table of counts" and end with a "List of
    Unlike the continuous data that is generated by microarrays, RNA-seq generates counts.
    While it is possible to transform the counts into a continuous distribution and perform an analysis similar to microarrays, e.g. linear regression, multiple approaches have been developed to directly model the data as counts (`r citet(c(Marioni2008="10.1101/gr.079558.108"))`).
    We will perform a simple analysis using one popular Bioconductor package for differential expression analysis, [edgeR][].
    edgeR (`r citet(c(Robinson2007="10.1093/bioinformatics/btm453", Robinson2010="10.1093/bioinformatics/btp616"))`) models the count data with a [negative binomial model][nb].

    [edgeR]: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html
    [nb]: http://en.wikipedia.org/wiki/Negative_binomial_distribution

    The SEQC data set will we analyze contains two different groups for comparison.
    Group A is five technical replicates of the [Stratagene Universal Human Reference RNA][a].
    @@ -68,11 +70,7 @@ It is RNA that was pooled from several donors from several different regions of
    [a]: http://www.chem.agilent.com/Library/usermanuals/Public/740000.pdf
    [b]: http://www.lifetechnologies.com/order/catalog/product/AM6050

    ## edgeR

    edgeR (`r citet(c(Robinson2007="10.1093/bioinformatics/btm453", Robinson2010="10.1093/bioinformatics/btp616"))`) models the count data with a [negative binomial model][nb].

    [nb]: http://en.wikipedia.org/wiki/Negative_binomial_distribution
    ## Setup

    First, we need to load the package into R.

    @@ -88,6 +86,8 @@ We can check our current working directory with the command `getwd()`, and we ca
    data_raw <- read.table("GSE49712_HTSeq.txt.gz", header = TRUE)
    ```

    ## Quality control

    Let's quickly explore the data.
    Each column corresponds to a sample, and each row corresponds to a gene.

    @@ -108,7 +108,8 @@ One simple method to do this is to choose a cutoff based on the median log2-tran
    edgeR provides the function, `cpm`, to compute the counts-per-million.

    ```{r expr-cutoff}
    median_log2_cpm <- apply(cpm(data_clean, log = TRUE), 1, median)
    cpm_log <- cpm(data_clean, log = TRUE)
    median_log2_cpm <- apply(cpm_log, 1, median)
    hist(median_log2_cpm)
    expr_cutoff <- -1
    abline(v = expr_cutoff, col = "red", lwd = 3)
    @@ -130,7 +131,7 @@ First, the samples in group A and B come from very different cell populations, s
    Second, since the samples in each group are technical replicates, the within group variance is very low.

    ```{r heatmap}
    heatmap(cor(cpm(data_clean, log = TRUE)))
    heatmap(cor(cpm_log))
    ```

    Now we start preparing the data for the the test of differential expression.
  15. jdblischak revised this gist Apr 21, 2015. 1 changed file with 4 additions and 83 deletions.
    87 changes: 4 additions & 83 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -38,7 +38,7 @@ Next, we need to download and install the two [Bioconductor][bioc] packages we w

    ```{r install-packages, eval = FALSE}
    source("http://bioconductor.org/biocLite.R")
    biocLite(c("edgeR", "DESeq2"))
    biocLite("edgeR")
    ```

    ## Introduction
    @@ -55,10 +55,9 @@ In this tutorial, we will start with a "Table of counts" and end with a "List of

    Unlike the continuous data that is generated by microarrays, RNA-seq generates counts.
    While it is possible to transform the counts into a continuous distribution and perform an analysis similar to microarrays, e.g. linear regression, multiple approaches have been developed to directly model the data as counts (`r citet(c(Marioni2008="10.1101/gr.079558.108"))`).
    We will perform a simple analysis using two very popular Bioconductor packages for differential expression analysis, [edgeR][] and [DESeq2][].
    We will perform a simple analysis using one popular Bioconductor package for differential expression analysis, [edgeR][].

    [edgeR]: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html
    [DESeq2]: http://bioconductor.org/packages/release/bioc/html/DESeq2.html

    The SEQC data set will we analyze contains two different groups for comparison.
    Group A is five technical replicates of the [Stratagene Universal Human Reference RNA][a].
    @@ -204,91 +203,13 @@ The blue lines represent a four-fold change in expression.

    [ma]: https://en.wikipedia.org/wiki/MA_plot

    ## DESeq2

    DESeq2 (`r citet(c(Love2014="10.1186/s13059-014-0550-8"))`) also models the gene counts as arising from a [negative binomial distribution][nb].

    ```{r load-DESeq2}
    library("DESeq2")
    dds <- DESeqDataSetFromMatrix(data_clean, as.data.frame(group), ~ group)
    dds
    ```

    For data exploration, the DESeq2 package provides a more sophisticated version of edgeR's `cpm` function which shrinks the dispersion for lowly expressed genes.
    The function is called `rlog`, for **r**egularlized **log** transformation.

    **Note:** `rlog` is a newer feature and may not be available for older versions of R.

    ```{r heatmap-DESeq2}
    rld <- rlog(dds)
    rlogMat <- assay(rld)
    colnames(rlogMat) <- group
    heatmap(cor(rlogMat))
    ```

    Similar to edgeR, DESeq2 assumes that most genes are not differentially expressed.
    Each sample is assigned a scaling factor which is computed by comparing the gene counts of the sample to the median gene counts of all the samples combined (`r citet(c(Anders2010="10.1186/gb-2010-11-10-r106"))`).

    ```{r normalize-DESeq2}
    dds <- estimateSizeFactors(dds)
    ```

    For the dispersion, DESeq2 fits gene-wise dispersion estimates and then shrinks them towards a trended dispersion estimated by sharing information across genes.
    We can see in the figure below that the dispersion estimates are shrunk towards the red trend line.

    ```{r model-DESeq2}
    dds <- estimateDispersions(dds)
    plotDispEsts(dds)
    ```

    For testing for differentially expressed genes, DESeq2 uses a negative binomial generalized linear model (GLM).

    **Note:** This GLM allows for more complicated experimental designs.

    ```{r test-DESeq2}
    dds <- nbinomWaldTest(dds)
    results_DESeq2 <- results(dds)
    results_DESeq2
    ```

    How many genes are differentially expressed at an FDR of 10%?

    ```{r count-de-DESeq2}
    sum(results_DESeq2@listData$padj < .1, na.rm = TRUE)
    plotMA(results_DESeq2)
    ```

    Similar to the edgeR results, many genes are differntially expressed.
    The red dots are differentially expressed genes.

    DESeq2 also provides a convenience function that will perform the normalization, modeling, and testing steps.

    ```{r quick-DESeq2}
    dds <- DESeq(dds)
    res <- results(dds)
    res
    ```

    ## Comparison

    How similar are the FDR values obtained with edgeR and DESeq2?

    ```{r compare-fdr}
    fdr_edgeR <- results_edgeR$table$FDR
    fdr_DESeq2 <- results_DESeq2@listData$padj
    plot(fdr_edgeR, fdr_DESeq2)
    abline(lm(fdr_DESeq2 ~ fdr_edgeR), col = "red")
    ```

    For a more rigorous comparison of these and other methods, see `r citep(c(Rapaport2013="10.1186/gb-2013-14-9-r95", Soneson2013="10.1186/1471-2105-14-91"))`.

    ## To learn more

    Check out the manuals for [edgeR][erman] and [DESeq2][dsman], search the Bioconductor mailing list [archives][], and consider joining the Bioconductor mailing [list][].
    Check out the manual for [edgeR][erman], search the Bioconductor mailing list [archives][], and consider joining the Bioconductor mailing [list][].
    For a rigorous comparison RNA-seq methods, see `r citep(c(Rapaport2013="10.1186/gb-2013-14-9-r95", Soneson2013="10.1186/1471-2105-14-91"))`.
    The [source code][gist] for this lesson is availabe as a GitHub Gist.

    [erman]: http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
    [dsman]: http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
    [archives]: http://dir.gmane.org/gmane.science.biology.informatics.conductor
    [list]: http://www.bioconductor.org/help/mailing-list/
    [gist]: https://gist.github.com/jdblischak/11384914
  16. jdblischak revised this gist Apr 21, 2015. 1 changed file with 8 additions and 8 deletions.
    16 changes: 8 additions & 8 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -1,5 +1,11 @@
    Differential expression analysis
    ========================================================
    ---
    title: Differential expression analysis
    author: John Blischak
    date: 2015-04-21
    output:
    html_document:
    toc: true
    ---

    ```{r settings, include = FALSE}
    library("knitcitations")
    @@ -8,12 +14,6 @@ opts_chunk$set(tidy = FALSE, fig.width = 8, fig.height = 8, fig.pos = "center",
    cache = TRUE)
    ```

    John Blischak

    Genomics and Sytems Biology, University of Chicago

    Date: `r as.character(Sys.Date())`

    In this tutorial, we will perform a basic differential expression analysis with RNA sequencing data using R/Bioconductor.
    Before class, please download the data set and install the software as explained in the following section.

  17. jdblischak revised this gist Apr 21, 2015. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -28,8 +28,8 @@ And as a last resort, you can download it by clicking on this [link][geo].
    [geo]: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE49712&format=file&file=GSE49712_HTSeq.txt.gz

    ```{r download-data}
    fname <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE49712&format=file&file=GSE49712_HTSeq.txt.gz"
    download.file(fname, destfile = "GSE49712_HTSeq.txt.gz", method = "curl")
    fname <- "http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE49712&format=file&file=GSE49712_HTSeq.txt.gz"
    download.file(fname, destfile = "GSE49712_HTSeq.txt.gz")
    ```

    Next, we need to download and install the two [Bioconductor][bioc] packages we will use to perform the analysis.
  18. jdblischak revised this gist Mar 10, 2015. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -206,7 +206,7 @@ The blue lines represent a four-fold change in expression.

    ## DESeq2

    DESeq2 (`r citet(c(Love2014="10.1101/002832"))`) also models the gene counts as arising from a [negative binomial distribution][nb].
    DESeq2 (`r citet(c(Love2014="10.1186/s13059-014-0550-8"))`) also models the gene counts as arising from a [negative binomial distribution][nb].

    ```{r load-DESeq2}
    library("DESeq2")
    @@ -296,7 +296,7 @@ The [source code][gist] for this lesson is availabe as a GitHub Gist.
    ## References

    ```{r refs, results = 'asis', echo = FALSE}
    bibliography("html")
    bibliography()
    ```

    ## Session information
  19. jdblischak revised this gist Apr 29, 2014. 1 changed file with 14 additions and 0 deletions.
    14 changes: 14 additions & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,14 @@
    This was a tutorial I presented for the class Genomics and Systems Biology at the University of Chicago on Tuesday, April 29, 2014.
    The students had been learning about study design, normalization, and statistical testing for genomic studies.
    This was meant to introduce them to how these ideas are implemented in practice.
    The specific example is a differential expression analysis with RNA-seq data for a two-class comparison.

    To render this lesson, you'll need to first install the R package knitr and the R/Bioconductor packages edgeR and DESeq2.
    You can then create the lesson by running the following from the R console:

    ```r
    library("knitr")
    knit2html("rnaseq-de-tutorial.Rmd", envir = new.env())
    ```

    One known issue is that if you do not have the latest version of DESeq2 because you have an older version of R, the function `rlog` may not be available.
  20. jdblischak revised this gist Apr 29, 2014. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -288,7 +288,7 @@ Check out the manuals for [edgeR][erman] and [DESeq2][dsman], search the Biocond
    The [source code][gist] for this lesson is availabe as a GitHub Gist.

    [erman]: http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
    [deman]: http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
    [dsman]: http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
    [archives]: http://dir.gmane.org/gmane.science.biology.informatics.conductor
    [list]: http://www.bioconductor.org/help/mailing-list/
    [gist]: https://gist.github.com/jdblischak/11384914
  21. jdblischak revised this gist Apr 29, 2014. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -1,7 +1,7 @@
    Differential expression analysis
    ========================================================

    ```{r include = FALSE}
    ```{r settings, include = FALSE}
    library("knitcitations")
    library("knitr")
    opts_chunk$set(tidy = FALSE, fig.width = 8, fig.height = 8, fig.pos = "center",
    @@ -108,7 +108,7 @@ We should also remove genes that are unexpressed or very lowly expressed in the
    One simple method to do this is to choose a cutoff based on the median log2-transformed counts per gene per million mapped reads (cpm).
    edgeR provides the function, `cpm`, to compute the counts-per-million.

    ```{r}
    ```{r expr-cutoff}
    median_log2_cpm <- apply(cpm(data_clean, log = TRUE), 1, median)
    hist(median_log2_cpm)
    expr_cutoff <- -1
  22. jdblischak revised this gist Apr 29, 2014. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -285,11 +285,13 @@ For a more rigorous comparison of these and other methods, see `r citep(c(Rapapo
    ## To learn more

    Check out the manuals for [edgeR][erman] and [DESeq2][dsman], search the Bioconductor mailing list [archives][], and consider joining the Bioconductor mailing [list][].
    The [source code][gist] for this lesson is availabe as a GitHub Gist.

    [erman]: http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
    [deman]: http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
    [archives]: http://dir.gmane.org/gmane.science.biology.informatics.conductor
    [list]: http://www.bioconductor.org/help/mailing-list/
    [gist]: https://gist.github.com/jdblischak/11384914

    ## References

  23. jdblischak revised this gist Apr 29, 2014. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -217,6 +217,8 @@ dds
    For data exploration, the DESeq2 package provides a more sophisticated version of edgeR's `cpm` function which shrinks the dispersion for lowly expressed genes.
    The function is called `rlog`, for **r**egularlized **log** transformation.

    **Note:** `rlog` is a newer feature and may not be available for older versions of R.

    ```{r heatmap-DESeq2}
    rld <- rlog(dds)
    rlogMat <- assay(rld)
  24. jdblischak revised this gist Apr 29, 2014. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -21,7 +21,7 @@ Before class, please download the data set and install the software as explained

    For this tutorial, we will use the data set generated by the Sequencing Quality Control ([SEQC][]) project.
    Please download the data by running the line of R code below.
    If this fails, you can try changing the `method` parameter, e.g. `method = "wget"` or `method = "curl"`.
    If this fails, you can try changing the `method` parameter, e.g. `method = "auto"`, `method = "wget"` or `method = "curl"`.
    And as a last resort, you can download it by clicking on this [link][geo].

    [SEQC]: http://www.fda.gov/ScienceResearch/BioinformaticsTools/MicroarrayQualityControlProject/
  25. jdblischak revised this gist Apr 29, 2014. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -186,7 +186,7 @@ edgeR tests for differential expression between two classes using a modified ver
    ```{r test-edgeR}
    et <- exactTest(y)
    results_edgeR <- topTags(et, n = nrow(data_clean), sort.by = "none")
    head(results_edgeR)
    head(results_edgeR$table)
    ```

    How many genes are differentially expressed at an FDR of 10%?
  26. jdblischak created this gist Apr 28, 2014.
    302 changes: 302 additions & 0 deletions rnaseq-de-tutorial.Rmd
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,302 @@
    Differential expression analysis
    ========================================================

    ```{r include = FALSE}
    library("knitcitations")
    library("knitr")
    opts_chunk$set(tidy = FALSE, fig.width = 8, fig.height = 8, fig.pos = "center",
    cache = TRUE)
    ```

    John Blischak

    Genomics and Sytems Biology, University of Chicago

    Date: `r as.character(Sys.Date())`

    In this tutorial, we will perform a basic differential expression analysis with RNA sequencing data using R/Bioconductor.
    Before class, please download the data set and install the software as explained in the following section.

    ## Download data and install software

    For this tutorial, we will use the data set generated by the Sequencing Quality Control ([SEQC][]) project.
    Please download the data by running the line of R code below.
    If this fails, you can try changing the `method` parameter, e.g. `method = "wget"` or `method = "curl"`.
    And as a last resort, you can download it by clicking on this [link][geo].

    [SEQC]: http://www.fda.gov/ScienceResearch/BioinformaticsTools/MicroarrayQualityControlProject/
    [geo]: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE49712&format=file&file=GSE49712_HTSeq.txt.gz

    ```{r download-data}
    fname <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE49712&format=file&file=GSE49712_HTSeq.txt.gz"
    download.file(fname, destfile = "GSE49712_HTSeq.txt.gz", method = "curl")
    ```

    Next, we need to download and install the two [Bioconductor][bioc] packages we will use to perform the analysis.

    [bioc]: http://www.bioconductor.org

    ```{r install-packages, eval = FALSE}
    source("http://bioconductor.org/biocLite.R")
    biocLite(c("edgeR", "DESeq2"))
    ```

    ## Introduction

    As you learned in class, RNA sequencing (RNA-seq) is an experimental method for measuring RNA expression levels via high-throughput sequencing of small adapter-ligated fragments (see figure below from `r citet(c(Oshlack2010="10.1038/nrg2484"))`).
    The number of reads that map to a gene is the proxy for its expression level.

    ![RNA-seq experiment](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949280/bin/nihms229948f1.jpg)

    There are many steps between receiving the raw reads from the sequencer and gaining biological insight.
    In this tutorial, we will start with a "Table of counts" and end with a "List of differentially expressed genes", as diagrammed in the RNA-seq analysis pipeline below (from `r citet(c(Wang2009="10.1186/gb-2010-11-12-220"))`).

    ![RNA-seq analysis pipeline](http://genomebiology.com/content/figures/gb-2010-11-12-220-1.jpg)

    Unlike the continuous data that is generated by microarrays, RNA-seq generates counts.
    While it is possible to transform the counts into a continuous distribution and perform an analysis similar to microarrays, e.g. linear regression, multiple approaches have been developed to directly model the data as counts (`r citet(c(Marioni2008="10.1101/gr.079558.108"))`).
    We will perform a simple analysis using two very popular Bioconductor packages for differential expression analysis, [edgeR][] and [DESeq2][].

    [edgeR]: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html
    [DESeq2]: http://bioconductor.org/packages/release/bioc/html/DESeq2.html

    The SEQC data set will we analyze contains two different groups for comparison.
    Group A is five technical replicates of the [Stratagene Universal Human Reference RNA][a].
    It is an equal mixture of RNA from ten different human cell lines.
    Group B is five techincal replicates of [Ambion’s Human Brain Reference RNA][b].
    It is RNA that was pooled from several donors from several different regions of the brain.

    [a]: http://www.chem.agilent.com/Library/usermanuals/Public/740000.pdf
    [b]: http://www.lifetechnologies.com/order/catalog/product/AM6050

    ## edgeR

    edgeR (`r citet(c(Robinson2007="10.1093/bioinformatics/btm453", Robinson2010="10.1093/bioinformatics/btp616"))`) models the count data with a [negative binomial model][nb].

    [nb]: http://en.wikipedia.org/wiki/Negative_binomial_distribution

    First, we need to load the package into R.

    ```{r load-package}
    library("edgeR")
    ```

    Second, we need to load the data into R.
    In order to use the code below, we need to ensure the data file is in R's working directory.
    We can check our current working directory with the command `getwd()`, and we can change it using `setwd("path/to/file")` or using RStudio (type Ctrl-Shift-K).

    ```{r import-data}
    data_raw <- read.table("GSE49712_HTSeq.txt.gz", header = TRUE)
    ```

    Let's quickly explore the data.
    Each column corresponds to a sample, and each row corresponds to a gene.

    ```{r view}
    dim(data_raw)
    head(data_raw)
    tail(data_raw)
    ```

    Notice that the last five lines contain summary statistics and thus need to be removed prior to testing.

    ```{r remove-sum-lines}
    data_clean <- data_raw[1:(nrow(data_raw) - 5), ]
    ```

    We should also remove genes that are unexpressed or very lowly expressed in the samples.
    One simple method to do this is to choose a cutoff based on the median log2-transformed counts per gene per million mapped reads (cpm).
    edgeR provides the function, `cpm`, to compute the counts-per-million.

    ```{r}
    median_log2_cpm <- apply(cpm(data_clean, log = TRUE), 1, median)
    hist(median_log2_cpm)
    expr_cutoff <- -1
    abline(v = expr_cutoff, col = "red", lwd = 3)
    sum(median_log2_cpm > expr_cutoff)
    data_clean <- data_clean[median_log2_cpm > expr_cutoff, ]
    ```

    After removing all genes with a median log2 cpm below `r expr_cutoff`, we have `r sum(median_log2_cpm > expr_cutoff)` genes remaining.
    A good rule of thumb when analyzing RNA-seq data from a single cell type is to expect 9-12 thousand expressed genes.
    In this case, we have many more because the samples include RNA collected from many different cell types, and thus it is not surprising that many more genes are robustly expressed.

    **Question:**
    What are some ways that we could improve this simple cutoff?
    What information are we ignoring?

    We can also explore the relationships between the samples by visualizing a heatmap of the correlation matrix.
    The heatmap result corresponds to what we know about the data set.
    First, the samples in group A and B come from very different cell populations, so the two groups are very different.
    Second, since the samples in each group are technical replicates, the within group variance is very low.

    ```{r heatmap}
    heatmap(cor(cpm(data_clean, log = TRUE)))
    ```

    Now we start preparing the data for the the test of differential expression.
    We create a vector called, `group`, that labels each of the columns as belonging to group A or B.
    We then use this vector and the gene counts to create a DGEList, which is the object that edgeR uses for storing the data from a differntial expression experiment.

    ```{r make-groups-edgeR}
    group <- substr(colnames(data_clean), 1, 1)
    group
    y <- DGEList(counts = data_clean, group = group)
    y
    ```


    edgeR normalizes the genes counts using the method TMM (trimmed means of m values) developed by `r citet(c(Robinson2010="10.1186/gb-2010-11-3-r25"))`.
    Recall from lecture that the read counts for moderately to lowly expressed genes can be strongly influenced by small fluctuations in the expression level of highly expressed genes.
    In other words, small differences in expression of highly expressed genes between samples can give the appearance that many lower expressed genes are differentially expressed between conditions.
    TMM adjusts for this by removing the extremely lowly and highly expressed genes and also those genes that are very different across samples.
    It then compares the total counts for this subset of genes between the two samples to get the scaling factor (this is a simplification).
    Similar to normalization methods for microarray data, this method assumes the majority of genes are not differentially expressed between any two samples.

    ```{r normalize-edgeR}
    y <- calcNormFactors(y)
    y$samples
    ```

    The next step is to model the variance of the read counts per gene.
    A natural method for modeling gene counts is the Poisson distribution.
    However, the Poisson assumes the mean and variance are identical, but it has been found emipirically that the variance in RNA-seq measurements of gene expression are larger than the mean (termed "overdispersion").
    So instead the negative binomial distribution is used, which has a dispersion parameter for modeling the increase in variance from a Poisson process.
    edgeR treats the Poisson variance as simple sampling variance, and refers to the dispersion estimate as the "biolocial coefficient of variation."
    Though it should be mentioned that any technical biases are also included in this estimate.
    edgeR shares information across genes to determine a common dispersion.
    It then calculates a dispersion esitmate per gene and shrinks it towards the common dispersion.
    The gene-specific (referred to in edgeR as tagwise) dispersion estimates are used in the test for differential expression.

    **Note:** edgeR also has a trended dispersion to use in place of the common dispersion for more complicated experimental designs.

    ```{r model-edgeR}
    y <- estimateCommonDisp(y, verbose = TRUE)
    y <- estimateTagwiseDisp(y)
    sqrt(y$common.dispersion) # biological coefficient of variation
    plotBCV(y)
    ```

    The biological coefficient of variation is lower than normally seen in human studies (~0.4) because the samples are technical replicates.

    edgeR tests for differential expression between two classes using a modified version of the Fisher's Exact Test.

    **Note:** It also provides a generalized linear model frameworks for more complicated experimental designs.

    ```{r test-edgeR}
    et <- exactTest(y)
    results_edgeR <- topTags(et, n = nrow(data_clean), sort.by = "none")
    head(results_edgeR)
    ```

    How many genes are differentially expressed at an FDR of 10%?

    ```{r count-de-edgeR}
    sum(results_edgeR$table$FDR < .1)
    plotSmear(et, de.tags = rownames(results_edgeR)[results_edgeR$table$FDR < .01])
    abline(h = c(-2, 2), col = "blue")
    ```

    As expected from the description of the samples and the heatmap, there are many differentially expressed genes.
    The [MA plot][ma] above plots the log2 fold change on the y-axis versus the average log2 counts-per-million on the x-axis.
    The red dots are genes with an FDR less than 10%.
    The blue lines represent a four-fold change in expression.

    [ma]: https://en.wikipedia.org/wiki/MA_plot

    ## DESeq2

    DESeq2 (`r citet(c(Love2014="10.1101/002832"))`) also models the gene counts as arising from a [negative binomial distribution][nb].

    ```{r load-DESeq2}
    library("DESeq2")
    dds <- DESeqDataSetFromMatrix(data_clean, as.data.frame(group), ~ group)
    dds
    ```

    For data exploration, the DESeq2 package provides a more sophisticated version of edgeR's `cpm` function which shrinks the dispersion for lowly expressed genes.
    The function is called `rlog`, for **r**egularlized **log** transformation.

    ```{r heatmap-DESeq2}
    rld <- rlog(dds)
    rlogMat <- assay(rld)
    colnames(rlogMat) <- group
    heatmap(cor(rlogMat))
    ```

    Similar to edgeR, DESeq2 assumes that most genes are not differentially expressed.
    Each sample is assigned a scaling factor which is computed by comparing the gene counts of the sample to the median gene counts of all the samples combined (`r citet(c(Anders2010="10.1186/gb-2010-11-10-r106"))`).

    ```{r normalize-DESeq2}
    dds <- estimateSizeFactors(dds)
    ```

    For the dispersion, DESeq2 fits gene-wise dispersion estimates and then shrinks them towards a trended dispersion estimated by sharing information across genes.
    We can see in the figure below that the dispersion estimates are shrunk towards the red trend line.

    ```{r model-DESeq2}
    dds <- estimateDispersions(dds)
    plotDispEsts(dds)
    ```

    For testing for differentially expressed genes, DESeq2 uses a negative binomial generalized linear model (GLM).

    **Note:** This GLM allows for more complicated experimental designs.

    ```{r test-DESeq2}
    dds <- nbinomWaldTest(dds)
    results_DESeq2 <- results(dds)
    results_DESeq2
    ```

    How many genes are differentially expressed at an FDR of 10%?

    ```{r count-de-DESeq2}
    sum(results_DESeq2@listData$padj < .1, na.rm = TRUE)
    plotMA(results_DESeq2)
    ```

    Similar to the edgeR results, many genes are differntially expressed.
    The red dots are differentially expressed genes.

    DESeq2 also provides a convenience function that will perform the normalization, modeling, and testing steps.

    ```{r quick-DESeq2}
    dds <- DESeq(dds)
    res <- results(dds)
    res
    ```

    ## Comparison

    How similar are the FDR values obtained with edgeR and DESeq2?

    ```{r compare-fdr}
    fdr_edgeR <- results_edgeR$table$FDR
    fdr_DESeq2 <- results_DESeq2@listData$padj
    plot(fdr_edgeR, fdr_DESeq2)
    abline(lm(fdr_DESeq2 ~ fdr_edgeR), col = "red")
    ```

    For a more rigorous comparison of these and other methods, see `r citep(c(Rapaport2013="10.1186/gb-2013-14-9-r95", Soneson2013="10.1186/1471-2105-14-91"))`.

    ## To learn more

    Check out the manuals for [edgeR][erman] and [DESeq2][dsman], search the Bioconductor mailing list [archives][], and consider joining the Bioconductor mailing [list][].

    [erman]: http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
    [deman]: http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
    [archives]: http://dir.gmane.org/gmane.science.biology.informatics.conductor
    [list]: http://www.bioconductor.org/help/mailing-list/

    ## References

    ```{r refs, results = 'asis', echo = FALSE}
    bibliography("html")
    ```

    ## Session information

    ```{r info}
    sessionInfo()
    ```