Skip to content

Instantly share code, notes, and snippets.

@PeteHaitch
Last active August 11, 2017 13:12
Show Gist options
  • Save PeteHaitch/9c973f2b05908f5a065bbaa3de0bbe35 to your computer and use it in GitHub Desktop.
Save PeteHaitch/9c973f2b05908f5a065bbaa3de0bbe35 to your computer and use it in GitHub Desktop.
Compare strategies for computing colCounts (number of elements per column equal to a threshold) for a sparse matrix
library(Matrix)
library(pryr)
library(profmem)
library(matrixStats)
library(microbenchmark)
nrow <- 20000
ncol <- 600
threshold <- 0
matrix <- matrix(runif(nrow * ncol), nrow = nrow, ncol = ncol)
zero_idx <- sample(length(matrix), 0.6 * length(matrix))
matrix[zero_idx] <- 0
# 60% zero elements
sparse_matrix <- Matrix::Matrix(matrix, sparse = TRUE)
all.equal(matrix, as.matrix(sparse_matrix), check.attributes = FALSE)
#> [1] TRUE
object_size(matrix)
#> 96 MB
object_size(sparse_matrix)
#> 57.6 MB
# Measure total memory allocations (not peak memory usage)
# (1) Just computing colSums of a matrix/Matrix
total(profmem(colSums(matrix)))
#> [1] 4840
total(profmem(colSums(sparse_matrix)))
#> [1] 4840
# (2) Just finding elements of matrix equal to threshold
total(profmem(matrix == threshold))
#> [1] 48000040
total(profmem(sparse_matrix == threshold))
#> [1] 163537608
# (3) Computing number of elements in each column equal to threshold;
# conceptually (1) + (2)
# (3) == (1) + (2)
total(profmem(colSums(matrix == threshold)))
#> [1] 48004880
# (3) > (1) + (2)
# There's some additional coercion from logical to numeric (I think)
total(profmem(colSums(sparse_matrix == threshold)))
#> [1] 259260616
# With a matrix we can use the very efficient matrixStats::colCounts(). No matrix
# allocation is performed (i.e. (2) is implicit)
total(profmem(matrixStats::colCounts(matrix, value = threshold)))
#> [1] 37264
# Unfortunately, this doesn't work for Matrix (don't try this, it 'runs' but
# with unininted behaviour).
# In this example, coercing the Matrix to an ordinary matrix and then using
# matrixStats::colCounts() is more efficient (in terms of total
# memory allocation) and comparabile in terms of speed to the original approach
all.equal(colSums(sparse_matrix == threshold),
matrixStats::colCounts(as.matrix(sparse_matrix),
value = threshold))
#> [1] TRUE
total(profmem(colSums(sparse_matrix == threshold)))
#> [1] 259206240
total(profmem(matrixStats::colCounts(as.matrix(sparse_matrix),
value = threshold)))
#> [1] 96002480
microbenchmark(colSums(sparse_matrix == threshold),
matrixStats::colCounts(as.matrix(sparse_matrix),
value = threshold),
times = 100)
#> Unit: milliseconds
#> expr
#> colSums(sparse_matrix == threshold)
#> matrixStats::colCounts(as.matrix(sparse_matrix), value = threshold)
#> min lq mean median uq max neval
#> 153.9220 226.9063 270.3231 252.1848 289.6850 610.2432 100
#> 161.3339 202.5328 252.2943 230.5502 272.9558 583.8499 100
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment