Created
September 9, 2015 22:54
-
-
Save datalove/a1828247918f978608c7 to your computer and use it in GitHub Desktop.
R preso for WARG
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
```{r, echo = FALSE} | |
library(knitr) | |
library(plyr) | |
library(data.table) | |
library(dplyr) | |
gears <- mtcars$gear | |
mtcars <- mtcars[,1:6] | |
mtcars$gear <- gears | |
``` | |
Welcome to dplyr | |
======================================================== | |
author: Tommy M O'Dell ([email protected]) | |
date: September 10th, 2015 | |
transition: rotate | |
transition-speed: fast | |
width: 1440 | |
height: 900 | |
dplyr? | |
======================================================== | |
Friction-less manipulation of data frames in R | |
*** | |
Package goals: | |
1. simple interface | |
2. good performance | |
3. same interface for many 'backends' | |
<!-- Notes: | |
- the ordering of the points is relevant (hadley will trade off performance for a clean interface) | |
- familiar with plyr? Think of dplyr as plyr specialised for data frames | |
--> | |
Data manipulation in base R | |
======================================================== | |
type: section | |
Filter the rows of a data frame | |
======================================================== | |
Keep rows where mpg > 30 and cyl >= 4 | |
```{r} | |
mtcars[mtcars$mpg > 30 & mtcars$cyl >= 4,] | |
``` | |
Add or modify columns | |
======================================================== | |
Create a new column for displacement per cylinder | |
```{r} | |
mtcars$disp_cyl <- mtcars$disp / mtcars$cyl | |
head(mtcars) | |
``` | |
Sort a data frame by its values | |
======================================================== | |
```{r} | |
mt <- mtcars[order(mtcars$mpg, mtcars$cyl),] | |
head(mt) | |
``` | |
Select columns from a data frame | |
======================================================== | |
Method 1 - Numerical indices | |
```{r} | |
head(mtcars[,1:4]) | |
``` | |
Select columns from a data frame | |
======================================================== | |
title:false | |
Method 2 - Named indices | |
```{r} | |
head(mtcars[,c('mpg','cyl','disp','gear')]) | |
``` | |
Summarise (aggregate) a data frame | |
======================================================== | |
```{r} | |
aggregate(mpg ~ cyl + gear, data = mtcars, mean) | |
``` | |
What about CRAN? | |
======================================================== | |
type: section | |
======================================================== | |
Using `ddply` from **plyr** to get the mean mpg per cylinder and gear | |
```{r} | |
ddply( | |
mtcars, # data frame | |
.(cyl, gear), # grouping columns | |
summarise, # type of | |
mpg = mean(mpg) # aggregations | |
) | |
``` | |
======================================================== | |
Using `data.table` to get the mean mpg per cylinder and gear | |
```{r} | |
dtcars <- as.data.table(mtcars) | |
dtcars[, mean(mpg), by = list(cyl,gear)] | |
``` | |
If it aint broke? | |
======================================================== | |
type: section | |
Are any of these both **readable** and **fast**? | |
The dplyr promise | |
======================================================== | |
99% of data manipulation can be described 6 key operations ('verbs') | |
1. **filter**: filter the rows of a data frame | |
2. **mutate**: modify or create new columns | |
3. **group by**: set grouping variables | |
4. **summarise**: aggregate a data frame | |
5. **arrange**: sort columns of a data frame | |
6. **select**: select a set of columns | |
```{r, echo = FALSE} | |
mtcars <- tbl_df(mtcars) | |
``` | |
filter | |
======================================================== | |
```{r} | |
filter(mtcars, mpg > 30, cyl >= 4) | |
``` | |
mutate | |
======================================================== | |
(modify or create columns) | |
Same as previous | |
```{r} | |
mutate(mtcars, disp_cyl = disp/cyl) | |
``` | |
======================================================== | |
Multiple columns in one | |
```{r} | |
mutate( | |
mtcars, | |
disp_cyl = disp/cyl, | |
kw = hp/0.746 | |
) | |
``` | |
======================================================== | |
Can even refer to newly created columns immediately... | |
```{r} | |
mutate( | |
mtcars, | |
disp_cyl = disp/cyl, | |
k_watt = hp/0.746, | |
watts = k_watt*1000 | |
) | |
``` | |
Group by and Summarise | |
======================================================== | |
```{r} | |
mtcars <- group_by(mtcars, cyl, gear) | |
mtcars | |
``` | |
======================================================== | |
```{r} | |
summarise(mtcars, mpg = mean(mpg)) # uses the grouping set previously | |
``` | |
======================================================== | |
Multiple aggregations in one | |
```{r} | |
summarise( | |
mtcars, | |
mpg = mean(mpg), | |
hp = mean(hp) | |
) | |
``` | |
arrange | |
======================================================== | |
```{r} | |
arrange(mtcars, mpg, cyl) | |
``` | |
======================================================== | |
```{r} | |
arrange(mtcars, mpg, desc(cyl)) # desending! | |
``` | |
select | |
======================================================== | |
```{r, echo = FALSE} | |
mtcars <- ungroup(mtcars) | |
``` | |
```{r} | |
select(mtcars, 1:4) | |
``` | |
======================================================== | |
```{r} | |
select(mtcars, mpg, cyl, disp, gear) | |
``` | |
======================================================== | |
```{r} | |
select(mtcars, mpg:hp) | |
``` | |
======================================================== | |
```{r} | |
select(mtcars, contains('a')) | |
``` | |
======================================================== | |
```{r} | |
select(mtcars, starts_with('d')) | |
``` | |
======================================================== | |
```{r} | |
select(mtcars, -starts_with('d')) | |
``` | |
Putting it all together | |
======================================================== | |
type: section | |
======================================================== | |
Let's say during our exploratory analysis we want to create new column, fitler on that new column, | |
then get the mean of that new column for each cylinder and gear | |
```{r} | |
mt <- mutate(mtcars, disp_cyl = disp/cyl) | |
mt <- filter(mt, disp_cyl > 30, mpg < 25) | |
mt <- group_by(mt, cyl, gear) | |
``` | |
```{r} | |
summarise(mt, avg_d_cyl = mean(disp_cyl), min_d_cyl = min(disp_cyl)) | |
``` | |
That's a lot of repetition, and we now have an extra variable **mt** sitting around taking up space... | |
======================================================== | |
If we just want to print the answer without intermediary variables... | |
```{r} | |
summarise(group_by(filter(mutate(mtcars, disp_cyl = disp/cyl), disp_cyl > 30, mpg > 23), cyl, gear), avg_d_cyl = mean(disp_cyl), min_d_cyl = min(disp_cyl)) | |
``` | |
(Say what???!) | |
======================================================== | |
We can make that a bit easier to read... but not great | |
```{r} | |
summarise( | |
group_by( | |
filter( | |
mutate(mtcars, disp_cyl = disp/cyl), | |
disp_cyl > 30, | |
mpg > 23 | |
), | |
cyl, | |
gear | |
), | |
avg_d_cyl = mean(disp_cyl), | |
min_d_cyl = min(disp_cyl) | |
) | |
``` | |
(Butt ugly!) | |
Ceci n'est pas une pipe | |
======================================================= | |
type: section | |
======================================================= | |
Our last example was barely readable. What can we do? **Pipes** to the rescue! | |
* Introduced through the **magrittr** package and **dplyr** package around the same time | |
* Inspired by unix pipes, and F-sharp pipes | |
A pipe ('`%>%`') takes the left-hand side and passes it to the right-hand side as the first argument. | |
======================================================= | |
```{r} | |
summarise( | |
group_by( | |
filter( | |
mutate( | |
mtcars, | |
disp_cyl = disp/cyl | |
), | |
disp_cyl > 30, | |
mpg > 23 | |
), | |
cyl, | |
gear | |
), | |
avg_d_cyl = mean(disp_cyl), | |
min_d_cyl = min(disp_cyl) | |
) | |
``` | |
*** | |
```{r} | |
mtcars %>% | |
mutate(disp_cyl = disp/cyl) %>% | |
filter(disp_cyl>30, mpg>23) %>% | |
group_by(cyl, gear) %>% | |
summarise( | |
avg_d_cyl = mean(disp_cyl), | |
min_d_cyl = min(disp_cyl) | |
) | |
``` | |
That's not where it ends... | |
========================== | |
type: section | |
========================== | |
Let's load up a bigger dat set | |
```{r} | |
library(hflights) # to load the flights data set | |
tbl_df(hflights) | |
``` | |
========================== | |
```{r, eval = FALSE} | |
hflights %>% | |
mutate(ArrEarly = ArrDelay < 0) %>% | |
filter(DepDelay > 60, Distance > 200) %>% | |
mutate() | |
``` | |
*** | |
```{r} | |
``` | |
What is this black magic?? | |
========================== | |
type: section | |
Questions? | |
======================================================== | |
type: prompt | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment