-
Star
(124)
You must be signed in to star a gist -
Fork
(19)
You must be signed in to fork a gist
-
-
Save halhen/659780120accd82e043986c8b57deae0 to your computer and use it in GitHub Desktop.
# data from http://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/population-distribution-demography/geostat | |
# Originally seen at http://spatial.ly/2014/08/population-lines/ | |
# So, this blew up on both Reddit and Twitter. Two bugs fixed (southern Spain was a mess, | |
# and some countries where missing -- measure twice, submit once, damnit), and two silly superflous lines removed after | |
# @hadleywickham pointed that out. Also, switched from geom_segment to geom_line. | |
# The result of the code below can be seen at http://imgur.com/ob8c8ph | |
library(tidyverse) | |
read_csv('../data/geostat-2011/GEOSTAT_grid_POP_1K_2011_V2_0_1.csv') %>% | |
rbind(read_csv('../data/geostat-2011/JRC-GHSL_AIT-grid-POP_1K_2011.csv') %>% | |
mutate(TOT_P_CON_DT='')) %>% | |
mutate(lat = as.numeric(gsub('.*N([0-9]+)[EW].*', '\\1', GRD_ID))/100, | |
lng = as.numeric(gsub('.*[EW]([0-9]+)', '\\1', GRD_ID)) * ifelse(gsub('.*([EW]).*', '\\1', GRD_ID) == 'W', -1, 1) / 100) %>% | |
filter(lng > 25, lng < 60) %>% | |
group_by(lat=round(lat, 1), lng=round(lng, 1)) %>% | |
summarize(value = sum(TOT_P, na.rm=TRUE)) %>% | |
ungroup() %>% | |
complete(lat, lng) %>% | |
ggplot(aes(lng, lat + 5*(value/max(value, na.rm=TRUE)))) + | |
geom_line(size=0.4, alpha=0.8, color='#5A3E37', aes(group=lat), na.rm=TRUE) + | |
ggthemes::theme_map() + | |
coord_equal(0.9) | |
ggsave('/tmp/europe.png', width=10, height=10) |
it appears ggthemes is not loaded via tidyverse; one has to load it correct?
IIRC, yes. Prefixing with ggthemes::
is probably in my fingers for a reason (which by all means may be that I simply started doing it)
I've never seen such a continuous usage of %>% before! Is the format you used primarily to reproduce as a one-liner? If so, am I correct to believe one's workflow wouldn't typically do this until the final code was known (otherwise you repeat the reading in of data every time you tweak the plot)?
Oh, no, on the contrary. I write 10-30 line %>%
flows for most of my analyses. Getting into the pipe way of thinking is super convenient. My hurdle was getting over the vectorization mindset (which is kinda' orthogonal to this anyways). From a naive standpoint %>%
simply replaces intermediate variables, or nested functions. If you ever catch yourself doing either, %>%
is quite likely a better choice.
I use group_by() quite a bit and have never passed it some var = fun(var) argument before. I take it you're grouping by rounded lat and lon to sort of "cluster" your summed populations (some set of lat/lon combination will have the same summed population since they were in the same group, but they retain their individual values for plotting)?
group_by(var = fun(x))
creates a new variable within the data frame named var
with fun(x)
as it's value. It's a convenience over mutate(var=fun(x)) %>% group_by(var)
Thanks for making this. I've been meaning to dive deeper into Hadley's magical land of tidyverse and keep not getting around to it. I learned the separate() function as a result of this. and at least more familiar with filter, ungroup, complete, and select, so thanks for posting the code an indirect motivation for me!
http://r4ds.had.co.nz/ . Buy it, read it, practice. Tidyverse is a gift from heaven.
May I ask the meaning of the strings, like '1kmN2689E4337'? Can these represent real geographic coordinates?
Neat! I've been playing around with this tonight and it's quite interesting. Some questions:
it appears
ggthemes
is not loaded viatidyverse
; one has to load it correct?I've never seen such a continuous usage of
%>%
before! Is the format you used primarily to reproduce as a one-liner? If so, am I correct to believe one's workflow wouldn't typically do this until the final code was known (otherwise you repeat the reading in of data every time you tweak the plot)?I use
group_by()
quite a bit and have never passed it somevar = fun(var)
argument before. I take it you're grouping by roundedlat
andlon
to sort of "cluster" your summed populations (some set oflat/lon
combination will have the same summed population since they were in the same group, but they retain their individual values for plotting)?After investigating the data itself, I think it can be improved speed wise quite a bit since it's so big.
TOT_P
andGRD_ID
; subset those early and no need to setTOT_P_CON_DT
to""
GRD_ID
column, you'll find that there are noW
values! There's no need for theifelse()
looking for west longitudes to make negativegsub()
repeated twiceI was trying to figure out a clever
strsplit()
method to strip of the1kmN
chunk, keep the digits, and split again onE
, but couldn't figure out how.Thanks for making this. I've been meaning to dive deeper into Hadley's magical land of
tidyverse
and keep not getting around to it. I learned theseparate()
function as a result of this. and at least more familiar withfilter
,ungroup
,complete
, andselect
, so thanks for posting the code an indirect motivation for me!@p0bs: I don't think rounding to 2 decimals does anything. When
as.numeric(POP_T)/100
is run, everything of the formxx.xx
. Basically, you're not grouping but summarizing for every unique [raw]lat
/lon
combination. Or at least that's my interpretation.