datalove · September 9, 2015 22:54
diff --git a/preso.R b/preso.R
 ```{r, echo = FALSE}
 library(knitr)
 library(plyr)
 library(data.table)
 library(dplyr)

 gears <- mtcars$gear 
 mtcars <- mtcars[,1:6]
 mtcars$gear <- gears

 ```
 Welcome to dplyr
 ========================================================
 author: Tommy M O'Dell ([email protected])
 date: September 10th, 2015
 transition: rotate
 transition-speed: fast
 width: 1440
 height: 900
 
 dplyr?
 ========================================================
 Friction-less manipulation of data frames in R
 ***
 Package goals:
 
 1. simple interface
 2. good performance
 3. same interface for many 'backends'

 <!-- Notes:
 - the ordering of the points is relevant (hadley will trade off performance for a clean interface)
 - familiar with plyr? Think of dplyr as plyr specialised for data frames
 -->
 
 Data manipulation in base R
 ========================================================
 type: section

 
 Filter the rows of a data frame
 ========================================================
 
 Keep rows where mpg > 30 and cyl >= 4
 ```{r}
 mtcars[mtcars$mpg > 30 & mtcars$cyl >= 4,]
 ```
 
 Add or modify columns
 ========================================================
 
 Create a new column for displacement per cylinder
 ```{r}
 mtcars$disp_cyl <- mtcars$disp / mtcars$cyl
 head(mtcars)
 ```
 
 Sort a data frame by its values
 ========================================================
 
 ```{r}
 mt <- mtcars[order(mtcars$mpg, mtcars$cyl),]
 head(mt)
 ```
 
 Select columns from a data frame
 ========================================================
 
 Method 1 - Numerical indices
 ```{r}
 head(mtcars[,1:4])
 ```

 Select columns from a data frame
 ========================================================
 title:false
 
 Method 2 - Named indices
 ```{r}
 head(mtcars[,c('mpg','cyl','disp','gear')])
 ```

 Summarise (aggregate) a data frame
 ========================================================

 ```{r}
 aggregate(mpg ~ cyl + gear, data = mtcars, mean)
 ``` 

 What about CRAN?
 ========================================================
 type: section


 ========================================================

 Using `ddply` from **plyr** to get the mean mpg per cylinder and gear
 ```{r}
 ddply(
  mtcars,         # data frame
  .(cyl, gear),   # grouping columns 
  summarise,      # type of
  mpg = mean(mpg) # aggregations
 )
 ``` 

 ========================================================

 Using `data.table` to get the mean mpg per cylinder and gear
 ```{r}
 dtcars <- as.data.table(mtcars)
 dtcars[, mean(mpg), by = list(cyl,gear)]
 ``` 

 If it aint broke?
 ========================================================
 type: section

 Are any of these both **readable** and **fast**?


 
 The dplyr promise
 ========================================================
 99% of data manipulation can be described 6 key operations ('verbs') 

 1. **filter**: filter the rows of a data frame
 2. **mutate**: modify or create new columns
 3. **group by**: set grouping variables
 4. **summarise**: aggregate a data frame
 5. **arrange**: sort columns of a data frame
 6. **select**: select a set of columns

 ```{r, echo = FALSE}
 mtcars <- tbl_df(mtcars)
 ```


 filter
 ========================================================
 ```{r}
 filter(mtcars, mpg > 30, cyl >= 4)
 ```

 mutate 
 ========================================================
 (modify or create columns)

 Same as previous
 ```{r}
 mutate(mtcars, disp_cyl = disp/cyl)
 ```

 ========================================================
 Multiple columns in one
 ```{r}
 mutate(
  mtcars, 
  disp_cyl = disp/cyl,
  kw = hp/0.746
 )
 ```

 ========================================================
 Can even refer to newly created columns immediately...
 ```{r}
 mutate(
  mtcars, 
  disp_cyl = disp/cyl,
  k_watt = hp/0.746,
  watts  = k_watt*1000
 )
 ```

 Group by and Summarise
 ========================================================
 ```{r}
 mtcars <- group_by(mtcars, cyl, gear)
 mtcars
 ```

 ========================================================
 ```{r}
 summarise(mtcars, mpg = mean(mpg)) # uses the grouping set previously
 ```

 ========================================================
 Multiple aggregations in one
 ```{r}
 summarise(
  mtcars, 
  mpg = mean(mpg), 
  hp = mean(hp)
 )
 ```

 arrange 
 ========================================================
 ```{r}
 arrange(mtcars, mpg, cyl)
 ```

 ========================================================
 ```{r}
 arrange(mtcars, mpg, desc(cyl)) # desending!
 ```

 select
 ========================================================
 ```{r, echo = FALSE}
 mtcars <- ungroup(mtcars)
 ```

 ```{r}
 select(mtcars, 1:4)
 ```

 ========================================================
 ```{r}
 select(mtcars, mpg, cyl, disp, gear)
 ```

 ========================================================
 ```{r}
 select(mtcars, mpg:hp)
 ```

 ========================================================
 ```{r}
 select(mtcars, contains('a'))
 ```

 ========================================================
 ```{r}
 select(mtcars, starts_with('d'))
 ```

 ========================================================
 ```{r}
 select(mtcars, -starts_with('d'))
 ```

 Putting it all together
 ========================================================
 type: section


 ========================================================
 Let's say during our exploratory analysis we want to create new column, fitler on that new column, 
 then get the mean of that new column for each cylinder and gear

 ```{r}
 mt <- mutate(mtcars, disp_cyl = disp/cyl)
 mt <- filter(mt, disp_cyl > 30, mpg < 25)
 mt <- group_by(mt, cyl, gear)
 ```

 ```{r}
 summarise(mt, avg_d_cyl = mean(disp_cyl), min_d_cyl = min(disp_cyl))

 ```


 That's a lot of repetition, and we now have an extra variable **mt** sitting around taking up space...

 ========================================================

 If we just want to print the answer without intermediary variables...
 ```{r}
 summarise(group_by(filter(mutate(mtcars, disp_cyl = disp/cyl), disp_cyl > 30,  mpg > 23), cyl, gear), avg_d_cyl = mean(disp_cyl), min_d_cyl = min(disp_cyl))
 ```
 (Say what???!)

 ========================================================

 We can make that a bit easier to read... but not great
 ```{r}
 summarise(
  group_by(
    filter(
      mutate(mtcars, disp_cyl = disp/cyl),
      disp_cyl > 30, 
      mpg > 23
    ),
    cyl,
    gear
  ),
  avg_d_cyl = mean(disp_cyl),
  min_d_cyl = min(disp_cyl)
 )
 ```
 (Butt ugly!)


 Ceci n'est pas une pipe
 =======================================================
 type: section


 =======================================================
 Our last example was barely readable. What can we do? **Pipes** to the rescue!

  * Introduced through the **magrittr** package and **dplyr** package around the same time
  * Inspired by unix pipes, and F-sharp pipes

 A pipe ('`%>%`') takes the left-hand side and passes it to the right-hand side as the first argument.

 =======================================================
 ```{r}
 summarise(
  group_by(
    filter(
      mutate(
        mtcars, 
        disp_cyl = disp/cyl
      ),
      disp_cyl > 30, 
      mpg > 23
    ),
    cyl,
    gear
  ),
  avg_d_cyl = mean(disp_cyl),
  min_d_cyl = min(disp_cyl)
 )
 ```
 ***
 ```{r}
 mtcars %>% 
  mutate(disp_cyl = disp/cyl) %>% 
  filter(disp_cyl>30, mpg>23) %>% 
  group_by(cyl, gear) %>% 
  summarise(
    avg_d_cyl = mean(disp_cyl),
    min_d_cyl = min(disp_cyl)
  )
 ```

 That's not where it ends...
 ==========================
 type: section


 ==========================
 Let's load up a bigger dat set
 ```{r}
 library(hflights) # to load the flights data set
 tbl_df(hflights)
 ```


 ==========================
 ```{r, eval = FALSE}
 hflights %>% 
  mutate(ArrEarly = ArrDelay < 0) %>% 
  filter(DepDelay > 60, Distance > 200) %>% 
  mutate()

 ```
 ***
 ```{r}

 ```




 What is this black magic??
 ==========================
 type: section

 Questions?
 ========================================================
 type: prompt
	```{r, echo = FALSE}
	library(knitr)
	library(plyr)
	library(data.table)
	library(dplyr)

	gears <- mtcars$gear
	mtcars <- mtcars[,1:6]
	mtcars$gear <- gears

	```
	Welcome to dplyr
	========================================================
	author: Tommy M O'Dell ([email protected])
	date: September 10th, 2015
	transition: rotate
	transition-speed: fast
	width: 1440
	height: 900

	dplyr?
	========================================================
	Friction-less manipulation of data frames in R
	***
	Package goals:

	1. simple interface
	2. good performance
	3. same interface for many 'backends'

	<!-- Notes:
	- the ordering of the points is relevant (hadley will trade off performance for a clean interface)
	- familiar with plyr? Think of dplyr as plyr specialised for data frames
	-->

	Data manipulation in base R
	========================================================
	type: section


	Filter the rows of a data frame
	========================================================

	Keep rows where mpg > 30 and cyl >= 4
	```{r}
	mtcars[mtcars$mpg > 30 & mtcars$cyl >= 4,]
	```

	Add or modify columns
	========================================================

	Create a new column for displacement per cylinder
	```{r}
	mtcars$disp_cyl <- mtcars$disp / mtcars$cyl
	head(mtcars)
	```

	Sort a data frame by its values
	========================================================

	```{r}
	mt <- mtcars[order(mtcars$mpg, mtcars$cyl),]
	head(mt)
	```

	Select columns from a data frame
	========================================================

	Method 1 - Numerical indices
	```{r}
	head(mtcars[,1:4])
	```

	Select columns from a data frame
	========================================================
	title:false

	Method 2 - Named indices
	```{r}
	head(mtcars[,c('mpg','cyl','disp','gear')])
	```

	Summarise (aggregate) a data frame
	========================================================

	```{r}
	aggregate(mpg ~ cyl + gear, data = mtcars, mean)
	```

	What about CRAN?
	========================================================
	type: section


	========================================================

	Using `ddply` from plyr to get the mean mpg per cylinder and gear
	```{r}
	ddply(
	mtcars, # data frame
	.(cyl, gear), # grouping columns
	summarise, # type of
	mpg = mean(mpg) # aggregations
	)
	```

	========================================================

	Using `data.table` to get the mean mpg per cylinder and gear
	```{r}
	dtcars <- as.data.table(mtcars)
	dtcars[, mean(mpg), by = list(cyl,gear)]
	```

	If it aint broke?
	========================================================
	type: section

	Are any of these both readable and fast?



	The dplyr promise
	========================================================
	99% of data manipulation can be described 6 key operations ('verbs')

	1. filter: filter the rows of a data frame
	2. mutate: modify or create new columns
	3. group by: set grouping variables
	4. summarise: aggregate a data frame
	5. arrange: sort columns of a data frame
	6. select: select a set of columns

	```{r, echo = FALSE}
	mtcars <- tbl_df(mtcars)
	```


	filter
	========================================================
	```{r}
	filter(mtcars, mpg > 30, cyl >= 4)
	```

	mutate
	========================================================
	(modify or create columns)

	Same as previous
	```{r}
	mutate(mtcars, disp_cyl = disp/cyl)
	```

	========================================================
	Multiple columns in one
	```{r}
	mutate(
	mtcars,
	disp_cyl = disp/cyl,
	kw = hp/0.746
	)
	```

	========================================================
	Can even refer to newly created columns immediately...
	```{r}
	mutate(
	mtcars,
	disp_cyl = disp/cyl,
	k_watt = hp/0.746,
	watts = k_watt*1000
	)
	```

	Group by and Summarise
	========================================================
	```{r}
	mtcars <- group_by(mtcars, cyl, gear)
	mtcars
	```

	========================================================
	```{r}
	summarise(mtcars, mpg = mean(mpg)) # uses the grouping set previously
	```

	========================================================
	Multiple aggregations in one
	```{r}
	summarise(
	mtcars,
	mpg = mean(mpg),
	hp = mean(hp)
	)
	```

	arrange
	========================================================
	```{r}
	arrange(mtcars, mpg, cyl)
	```

	========================================================
	```{r}
	arrange(mtcars, mpg, desc(cyl)) # desending!
	```

	select
	========================================================
	```{r, echo = FALSE}
	mtcars <- ungroup(mtcars)
	```

	```{r}
	select(mtcars, 1:4)
	```

	========================================================
	```{r}
	select(mtcars, mpg, cyl, disp, gear)
	```

	========================================================
	```{r}
	select(mtcars, mpg:hp)
	```

	========================================================
	```{r}
	select(mtcars, contains('a'))
	```

	========================================================
	```{r}
	select(mtcars, starts_with('d'))
	```

	========================================================
	```{r}
	select(mtcars, -starts_with('d'))
	```

	Putting it all together
	========================================================
	type: section


	========================================================
	Let's say during our exploratory analysis we want to create new column, fitler on that new column,
	then get the mean of that new column for each cylinder and gear

	```{r}
	mt <- mutate(mtcars, disp_cyl = disp/cyl)
	mt <- filter(mt, disp_cyl > 30, mpg < 25)
	mt <- group_by(mt, cyl, gear)
	```

	```{r}
	summarise(mt, avg_d_cyl = mean(disp_cyl), min_d_cyl = min(disp_cyl))

	```


	That's a lot of repetition, and we now have an extra variable mt sitting around taking up space...

	========================================================

	If we just want to print the answer without intermediary variables...
	```{r}
	summarise(group_by(filter(mutate(mtcars, disp_cyl = disp/cyl), disp_cyl > 30, mpg > 23), cyl, gear), avg_d_cyl = mean(disp_cyl), min_d_cyl = min(disp_cyl))
	```
	(Say what???!)

	========================================================

	We can make that a bit easier to read... but not great
	```{r}
	summarise(
	group_by(
	filter(
	mutate(mtcars, disp_cyl = disp/cyl),
	disp_cyl > 30,
	mpg > 23
	),
	cyl,
	gear
	),
	avg_d_cyl = mean(disp_cyl),
	min_d_cyl = min(disp_cyl)
	)
	```
	(Butt ugly!)


	Ceci n'est pas une pipe
	=======================================================
	type: section


	=======================================================
	Our last example was barely readable. What can we do? Pipes to the rescue!

	* Introduced through the magrittr package and dplyr package around the same time
	* Inspired by unix pipes, and F-sharp pipes

	A pipe ('`%>%`') takes the left-hand side and passes it to the right-hand side as the first argument.

	=======================================================
	```{r}
	summarise(
	group_by(
	filter(
	mutate(
	mtcars,
	disp_cyl = disp/cyl
	),
	disp_cyl > 30,
	mpg > 23
	),
	cyl,
	gear
	),
	avg_d_cyl = mean(disp_cyl),
	min_d_cyl = min(disp_cyl)
	)
	```
	***
	```{r}
	mtcars %>%
	mutate(disp_cyl = disp/cyl) %>%
	filter(disp_cyl>30, mpg>23) %>%
	group_by(cyl, gear) %>%
	summarise(
	avg_d_cyl = mean(disp_cyl),
	min_d_cyl = min(disp_cyl)
	)
	```

	That's not where it ends...
	==========================
	type: section


	==========================
	Let's load up a bigger dat set
	```{r}
	library(hflights) # to load the flights data set
	tbl_df(hflights)
	```


	==========================
	```{r, eval = FALSE}
	hflights %>%
	mutate(ArrEarly = ArrDelay < 0) %>%
	filter(DepDelay > 60, Distance > 200) %>%
	mutate()

	```
	***
	```{r}

	```




	What is this black magic??
	==========================
	type: section

	Questions?
	========================================================
	type: prompt