shawngraham · October 9, 2016 02:23
diff --git a/wordvectors-in-mobilizing-the-past.rmd b/wordvectors-in-mobilizing-the-past.rmd
 ---
 title: "Word Vectors & Mobilizing the Past"
 subtitle: "Distantly Reading Digital Archaeology, Part II"
 author: "Shawn Graham"
 date: "`r Sys.Date()`"
 output:
  tufte::tufte_html: default
  tufte::tufte_handout:
    citation_package: natbib
    latex_engine: xelatex
  tufte::tufte_book:
    citation_package: natbib
    latex_engine: xelatex
 bibliography: skeleton.bib
 link-citations: yes
 ---

 ```{r setup, include=FALSE}
 library(tufte)
 # invalidate cache when the tufte version changes
 knitr::opts_chunk$set(tidy = FALSE, cache.extra = packageVersion('tufte'))
 options(htmltools.dir.version = FALSE)
 library(magrittr)
 library(wordVectors)
 library(tsne)
 library(dplyr)
 library(ggplot2)
 setwd("/Users/shawngraham/experiments/mob-past/fulltext-500lines")
 ```

 # Top down, bottom up

 In the first exploration of the text of _Mobilizing the Past_^[See the press blurb, download, [here](https://thedigitalpress.org/mobilizing-the-past-for-a-digital-future/)] I generated a quick 'n' dirty topic model exploring some of the larger trends in the discourses within that volume. In a future exploration, I will compare it with an earlier volume on digitial archaeology from 2011.^[Kansa, Kansa, & Watrall, _Archaeology 2.0_. It's available [here](http://escholarship.org/uc/item/1r6137tb)]. 

 If topic models give you a top-down perspective on what is happening with one's corpus, word vectors reverse the view to let us see what is happening at the level of individual words. One thing that I find interesting to do is to define binary pairs, and see what words stretch along the continuum between them. Going further, as Ben Schmidt showed us,^[For more on word vectors, see [this post of his.](http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html).] we can explore the ways _pairs_ of binaries intersect. So, let's start exploring.^[You can get the package [here](https://github.com/bmschmidt/wordVectors).] 

 Firstly, I have the text of the volume, all in lowercase, which I feed into the `train_word2vec` function:

 ```{r, include=FALSE}
 blogmodel = train_word2vec("mobpast.csv",output="mobpast_vectors.bin",threads = 4,vectors = 1500,window=12,force=TRUE)
 ```

 ## Digital v. Analog

 That done, what words are nearest, in the model, to 'digital'?

 ```{r}
 nearest_to(blogmodel,blogmodel[["digital"]])
 ```

 It's quite clear that in _Mobilizing the Past_, 'paperless' archaeology is very much hand-in-glove with the idea of going digital. So, let's create a vector from the idea of 'paperless, digital' archaeology:

 ```{r}
 digital_words = blogmodel %>% nearest_to(blogmodel[[c("digital","paperless")]],100) %>% names
 sample(digital_words,50)
 ```

 I like particularly the idea that digital archaeology is 'reflective', 'entangled', and 'exciting'. I certainly feel this way, and am glad to see it emerge in this volume (which, mark you, I still haven't read yet). Let's take a look at how these words relate to one another, using a dendrogram:

 ```{r fig-fullwidth1, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE}
 g1 = blogmodel[rownames(blogmodel) %in% digital_words [1:50],]

 group_distances1 = cosineDist(g1,g1) %>% as.dist
 plot(as.dendrogram(hclust(group_distances1)),cex=1, main="Cluster dendrogram of the fifty words closest to a 'digital' vector\nin Mobilizing the Past")
 ```

 We don't know, yet, how this 'digital' vector plays out in value-space: is 'digital' good? bad? Is it gendered? But before we do that, let's look at an antonymn for 'digital': 'analog'.

 ```{r}
 nearest_to(blogmodel,blogmodel[["analog"]])
 ```

 A great deal more ambivalence. My initial impression here is not an opposition to digital, but rather, the role of digital in supplanting tried-and-true analog methods, and whether or not this is perhaps a wise idea. Let's explore it a bit more:

 ```{r}
 analog_words = blogmodel %>% nearest_to(blogmodel[[c("analog")]],100) %>% names
 sample(digital_words,50)
 ```
 ```{r fig-fullwidth2, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE}
 g2 = blogmodel[rownames(blogmodel) %in% analog_words [1:50],]

 group_distances2 = cosineDist(g2,g2) %>% as.dist
 plot(as.dendrogram(hclust(group_distances2)),cex=1, main="Cluster dendrogram of the fifty words closest to an 'analog' vector\nin Mobilizing the Past")
 ```

 A definite ambivalence there. Let's see how the words run when we define a vector from digital through to analog.

 ```{r, include=FALSE}
 mode_vector = blogmodel[["analog"]] - blogmodel[["digital"]]
 word_scores = data.frame(word=rownames(blogmodel))
 word_scores$mode_score = blogmodel %>% cosineSimilarity(mode_vector) %>% as.vector
 ```

 ```{r, fig-fullwidth3, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE}
 ggplot(word_scores %>% filter(abs(mode_score)>.425)) + geom_bar(aes(y=mode_score,x=reorder(word,mode_score),fill=mode_score<0),stat="identity") + coord_flip()+scale_fill_discrete("Indicative of mode",labels=c("analog","digital")) + labs(title="The words showing the strongest skew along the analog-digital binary")
 ```

 I think, perhaps, what this is suggesting is a fear of what we might be losing in the march of the digital, or maybe, things that we need to be aware of as we explore this area. It is interesting that the words that are most 'digital' in this vector seem to be in connection with publishing (or at least, that's how I read 'quarterly' and 'saa' and 'literary').

 ## Value Judgements?

 Archaeologists are human. What things are 'good' in this model, and what things are 'bad'? Turns out, the word 'bad' is not in the model at all. The closest term would seem to be 'problematic':

 ```{r, include=FALSE}
 value_vector = blogmodel[["good"]] - blogmodel[["problematic"]]
 word_scores = data.frame(word=rownames(blogmodel))
 word_scores$value_score = blogmodel %>% cosineSimilarity(value_vector) %>% as.vector
 ```

 ```{r fig-fullwidth4, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE}
 ggplot(word_scores %>% filter(abs(value_score)>.45)) + geom_bar(aes(y=value_score,x=reorder(word,value_score),fill=value_score<0),stat="identity") + coord_flip()+scale_fill_discrete("Indicative of medium",labels=c("good","problematic")) + labs(title="The words showing the strongest skew along the value binary")
 ```

 That one doesn't tell us much, other than to say perhaps these archaeologists are an optimistic bunch.

 ## Gender?

 Finally, let's do the same again, by defining a 'gender' vector using pronouns. 

 ```{r, include=FALSE}
 gender_vector = blogmodel[[c("he","his","him")]] - blogmodel[[c("she","hers","her")]]
 word_scores = data.frame(word=rownames(blogmodel))
 word_scores$gender_score = blogmodel %>% cosineSimilarity(gender_vector) %>% as.vector
 ```

 ```{r fig-fullwidth5, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE}
 ggplot(word_scores %>% filter(abs(gender_score)>.35)) + geom_bar(aes(y=gender_score,x=reorder(word,gender_score),fill=gender_score<0),stat="identity") + coord_flip()+scale_fill_discrete("Indicative of gender",labels=c("he","she")) + labs(title="The words showing the strongest skew along the gender binary")
 ```

 ## Is 'Digital' Gendered? What is the value of 'Digital'?

 Finally, let us take these vectors and combine them in interesting ways. In the digital to analog vector, which words are gendered male, and which are gendered female? This involves crossing our 'mode_vector' against the 'gender_vector'.

 ```{r, include=FALSE}
 word_scores$mode_score = blogmodel %>% cosineSimilarity(mode_vector) %>% as.vector

 word_scores$gender_score = cosineSimilarity(blogmodel,gender_vector) %>% as.vector

 groups = c("mode_score","gender_score")
 ```

 ```{r fig-fullwidth6, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE}
 word_scores %>% mutate( modeedness=ifelse(mode_score>0,"analog","digital"),gender=ifelse(gender_score>0,"male","female")) %>% group_by(gender,modeedness) %>% filter(rank(-(abs(mode_score*gender_score)))<=36) %>% mutate(eval=-1+rank(abs(gender_score)/abs(mode_score))) %>% ggplot() + geom_text(aes(x=eval %/% 12,y=eval%%12,label=word,fontface=ifelse(modeedness=="analog",2,3),color=gender),hjust=0) + facet_grid(gender~modeedness) + theme_minimal() + scale_x_continuous("",lim=c(0,3)) + scale_y_continuous("") + labs(title="The top words gendered female (red) and male (blue)\n used with 'digital' (italics) and 'analog'(bold) words") + theme(legend.position="none")
 ```

 Science is male. It's large-scale, if it's digital. It's important, it's an improvement, though it has limitations. If it's digital, it's open and reproducible. When it's female and digital, it seems to be a footnote (or at least, that's how I'm interpreting those fragments.) When it's female and analog, it's supervisors who communicate...^[This perhaps ought to be unpacked a bit more: which is what you do when you read distantly. You spot patterns, then dive into the text forwarned and forearmed, and come back again to rerun your distant reading. I'm not entirely happy with what's going on in this diagram, which makes me think that the vectors have to be defined more carefully.]

 Let's do the same again, this time comparing the 'mode_vector' against the 'value_vector'

 ```{r, include=FALSE}
 word_scores$mode_score = blogmodel %>% cosineSimilarity(mode_vector) %>% as.vector

 word_scores$value_score = cosineSimilarity(blogmodel,value_vector) %>% as.vector

 groups = c("mode_score","value_score")
 ```

 ```{r fig-fullwidth7, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE}
 word_scores %>% mutate( modeedness=ifelse(mode_score>0,"analog","digital"),value=ifelse(value_score>0,"positive","negative")) %>% group_by(value,modeedness) %>% filter(rank(-(abs(mode_score*value_score)))<=36) %>% mutate(eval=-1+rank(abs(value_score)/abs(mode_score))) %>% ggplot() + geom_text(aes(x=eval %/% 12,y=eval%%12,label=word,fontface=ifelse(modeedness=="analog",2,3),color=value),hjust=0) + facet_grid(value~modeedness) + theme_minimal() + scale_x_continuous("",lim=c(0,3)) + scale_y_continuous("") + labs(title="The top negative (red) and positive (blue) words \n used with 'digital' (italics) and 'analog'(bold) words") + theme(legend.position="none")
 ```

 In this one, I think the fact that the 'value_vector' runs from `good` to `problematic` is probably making things a bit squiffy. But there certainly seems to be a sense that the punk qualities of digital archaeology, the introspection, the craft of it, is broadly positive. The negative qualities of the digital, if I can unpick one thread, seem to perhaps connect with the teaching (or lack thereof) in the classroom.^[Again, vectors constructed more carefully than perhaps what I have done this evening would be clearer and make more sense.]

 # To wind up this quick distant read

 In this quick glance from a distance at _Mobilizing the Past_, we see a digital archaeology that in some respects is a continuation of the analog archaeology in which it is entwined. There's a clear sense that digital is transforming the practice of archaeology, but also, that this is freighted with anxiety. Some of the greatest work in digital archaeology is explicitly feminist archaeology (thinking Tringham, Morgan, etc), and I don't get that sense from this volume _read at a distance_. 

 You will draw different conclusions, of course. When you get your hands on the book, save all the text from the pdf into a txt file, make it all lowercase, and then grab my code and feed it the text yourself. Look for other interesting words or binaries. Correct the flaws in what I've done. 

 _to come: a comparison of topics with Kansa, Kansa, & Watrall's 2011 Archaeology 2.0; also, a similar exploration of word vectors therein. How far have we come over the last five years, if these two volumes are used as bookends?_

 ```{r bib, include=FALSE}
 # create a bib file for the R packages used in this document
 knitr::write_bib(c('base', 'rmarkdown'), file = 'skeleton.bib')
 ```
	---
	title: "Word Vectors & Mobilizing the Past"
	subtitle: "Distantly Reading Digital Archaeology, Part II"
	author: "Shawn Graham"
	date: "`r Sys.Date()`"
	output:
	tufte::tufte_html: default
	tufte::tufte_handout:
	citation_package: natbib
	latex_engine: xelatex
	tufte::tufte_book:
	citation_package: natbib
	latex_engine: xelatex
	bibliography: skeleton.bib
	link-citations: yes
	---

	```{r setup, include=FALSE}
	library(tufte)
	# invalidate cache when the tufte version changes
	knitr::opts_chunk$set(tidy = FALSE, cache.extra = packageVersion('tufte'))
	options(htmltools.dir.version = FALSE)
	library(magrittr)
	library(wordVectors)
	library(tsne)
	library(dplyr)
	library(ggplot2)
	setwd("/Users/shawngraham/experiments/mob-past/fulltext-500lines")
	```

	# Top down, bottom up

	In the first exploration of the text of _Mobilizing the Past_^[See the press blurb, download, [here](https://thedigitalpress.org/mobilizing-the-past-for-a-digital-future/)] I generated a quick 'n' dirty topic model exploring some of the larger trends in the discourses within that volume. In a future exploration, I will compare it with an earlier volume on digitial archaeology from 2011.^[Kansa, Kansa, & Watrall, _Archaeology 2.0_. It's available [here](http://escholarship.org/uc/item/1r6137tb)].

	If topic models give you a top-down perspective on what is happening with one's corpus, word vectors reverse the view to let us see what is happening at the level of individual words. One thing that I find interesting to do is to define binary pairs, and see what words stretch along the continuum between them. Going further, as Ben Schmidt showed us,^[For more on word vectors, see [this post of his.](http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html).] we can explore the ways _pairs_ of binaries intersect. So, let's start exploring.^[You can get the package [here](https://github.com/bmschmidt/wordVectors).]

	Firstly, I have the text of the volume, all in lowercase, which I feed into the `train_word2vec` function:

	```{r, include=FALSE}
	blogmodel = train_word2vec("mobpast.csv",output="mobpast_vectors.bin",threads = 4,vectors = 1500,window=12,force=TRUE)
	```

	## Digital v. Analog

	That done, what words are nearest, in the model, to 'digital'?

	```{r}
	nearest_to(blogmodel,blogmodel[["digital"]])
	```

	It's quite clear that in _Mobilizing the Past_, 'paperless' archaeology is very much hand-in-glove with the idea of going digital. So, let's create a vector from the idea of 'paperless, digital' archaeology:

	```{r}
	digital_words = blogmodel %>% nearest_to(blogmodel[[c("digital","paperless")]],100) %>% names
	sample(digital_words,50)
	```

	I like particularly the idea that digital archaeology is 'reflective', 'entangled', and 'exciting'. I certainly feel this way, and am glad to see it emerge in this volume (which, mark you, I still haven't read yet). Let's take a look at how these words relate to one another, using a dendrogram:

	```{r fig-fullwidth1, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE}
	g1 = blogmodel[rownames(blogmodel) %in% digital_words [1:50],]

	group_distances1 = cosineDist(g1,g1) %>% as.dist
	plot(as.dendrogram(hclust(group_distances1)),cex=1, main="Cluster dendrogram of the fifty words closest to a 'digital' vector\nin Mobilizing the Past")
	```

	We don't know, yet, how this 'digital' vector plays out in value-space: is 'digital' good? bad? Is it gendered? But before we do that, let's look at an antonymn for 'digital': 'analog'.

	```{r}
	nearest_to(blogmodel,blogmodel[["analog"]])
	```

	A great deal more ambivalence. My initial impression here is not an opposition to digital, but rather, the role of digital in supplanting tried-and-true analog methods, and whether or not this is perhaps a wise idea. Let's explore it a bit more:

	```{r}
	analog_words = blogmodel %>% nearest_to(blogmodel[[c("analog")]],100) %>% names
	sample(digital_words,50)
	```
	```{r fig-fullwidth2, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE}
	g2 = blogmodel[rownames(blogmodel) %in% analog_words [1:50],]

	group_distances2 = cosineDist(g2,g2) %>% as.dist
	plot(as.dendrogram(hclust(group_distances2)),cex=1, main="Cluster dendrogram of the fifty words closest to an 'analog' vector\nin Mobilizing the Past")
	```

	A definite ambivalence there. Let's see how the words run when we define a vector from digital through to analog.

	```{r, include=FALSE}
	mode_vector = blogmodel[["analog"]] - blogmodel[["digital"]]
	word_scores = data.frame(word=rownames(blogmodel))
	word_scores$mode_score = blogmodel %>% cosineSimilarity(mode_vector) %>% as.vector
	```

	```{r, fig-fullwidth3, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE}
	ggplot(word_scores %>% filter(abs(mode_score)>.425)) + geom_bar(aes(y=mode_score,x=reorder(word,mode_score),fill=mode_score<0),stat="identity") + coord_flip()+scale_fill_discrete("Indicative of mode",labels=c("analog","digital")) + labs(title="The words showing the strongest skew along the analog-digital binary")
	```

	I think, perhaps, what this is suggesting is a fear of what we might be losing in the march of the digital, or maybe, things that we need to be aware of as we explore this area. It is interesting that the words that are most 'digital' in this vector seem to be in connection with publishing (or at least, that's how I read 'quarterly' and 'saa' and 'literary').

	## Value Judgements?

	Archaeologists are human. What things are 'good' in this model, and what things are 'bad'? Turns out, the word 'bad' is not in the model at all. The closest term would seem to be 'problematic':

	```{r, include=FALSE}
	value_vector = blogmodel[["good"]] - blogmodel[["problematic"]]
	word_scores = data.frame(word=rownames(blogmodel))
	word_scores$value_score = blogmodel %>% cosineSimilarity(value_vector) %>% as.vector
	```

	```{r fig-fullwidth4, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE}
	ggplot(word_scores %>% filter(abs(value_score)>.45)) + geom_bar(aes(y=value_score,x=reorder(word,value_score),fill=value_score<0),stat="identity") + coord_flip()+scale_fill_discrete("Indicative of medium",labels=c("good","problematic")) + labs(title="The words showing the strongest skew along the value binary")
	```

	That one doesn't tell us much, other than to say perhaps these archaeologists are an optimistic bunch.

	## Gender?

	Finally, let's do the same again, by defining a 'gender' vector using pronouns.

	```{r, include=FALSE}
	gender_vector = blogmodel[[c("he","his","him")]] - blogmodel[[c("she","hers","her")]]
	word_scores = data.frame(word=rownames(blogmodel))
	word_scores$gender_score = blogmodel %>% cosineSimilarity(gender_vector) %>% as.vector
	```

	```{r fig-fullwidth5, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE}
	ggplot(word_scores %>% filter(abs(gender_score)>.35)) + geom_bar(aes(y=gender_score,x=reorder(word,gender_score),fill=gender_score<0),stat="identity") + coord_flip()+scale_fill_discrete("Indicative of gender",labels=c("he","she")) + labs(title="The words showing the strongest skew along the gender binary")
	```

	## Is 'Digital' Gendered? What is the value of 'Digital'?

	Finally, let us take these vectors and combine them in interesting ways. In the digital to analog vector, which words are gendered male, and which are gendered female? This involves crossing our 'mode_vector' against the 'gender_vector'.

	```{r, include=FALSE}
	word_scores$mode_score = blogmodel %>% cosineSimilarity(mode_vector) %>% as.vector

	word_scores$gender_score = cosineSimilarity(blogmodel,gender_vector) %>% as.vector

	groups = c("mode_score","gender_score")
	```

	```{r fig-fullwidth6, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE}
	word_scores %>% mutate( modeedness=ifelse(mode_score>0,"analog","digital"),gender=ifelse(gender_score>0,"male","female")) %>% group_by(gender,modeedness) %>% filter(rank(-(abs(mode_score*gender_score)))<=36) %>% mutate(eval=-1+rank(abs(gender_score)/abs(mode_score))) %>% ggplot() + geom_text(aes(x=eval %/% 12,y=eval%%12,label=word,fontface=ifelse(modeedness=="analog",2,3),color=gender),hjust=0) + facet_grid(gender~modeedness) + theme_minimal() + scale_x_continuous("",lim=c(0,3)) + scale_y_continuous("") + labs(title="The top words gendered female (red) and male (blue)\n used with 'digital' (italics) and 'analog'(bold) words") + theme(legend.position="none")
	```

	Science is male. It's large-scale, if it's digital. It's important, it's an improvement, though it has limitations. If it's digital, it's open and reproducible. When it's female and digital, it seems to be a footnote (or at least, that's how I'm interpreting those fragments.) When it's female and analog, it's supervisors who communicate...^[This perhaps ought to be unpacked a bit more: which is what you do when you read distantly. You spot patterns, then dive into the text forwarned and forearmed, and come back again to rerun your distant reading. I'm not entirely happy with what's going on in this diagram, which makes me think that the vectors have to be defined more carefully.]

	Let's do the same again, this time comparing the 'mode_vector' against the 'value_vector'

	```{r, include=FALSE}
	word_scores$mode_score = blogmodel %>% cosineSimilarity(mode_vector) %>% as.vector

	word_scores$value_score = cosineSimilarity(blogmodel,value_vector) %>% as.vector

	groups = c("mode_score","value_score")
	```

	```{r fig-fullwidth7, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE}
	word_scores %>% mutate( modeedness=ifelse(mode_score>0,"analog","digital"),value=ifelse(value_score>0,"positive","negative")) %>% group_by(value,modeedness) %>% filter(rank(-(abs(mode_score*value_score)))<=36) %>% mutate(eval=-1+rank(abs(value_score)/abs(mode_score))) %>% ggplot() + geom_text(aes(x=eval %/% 12,y=eval%%12,label=word,fontface=ifelse(modeedness=="analog",2,3),color=value),hjust=0) + facet_grid(value~modeedness) + theme_minimal() + scale_x_continuous("",lim=c(0,3)) + scale_y_continuous("") + labs(title="The top negative (red) and positive (blue) words \n used with 'digital' (italics) and 'analog'(bold) words") + theme(legend.position="none")
	```

	In this one, I think the fact that the 'value_vector' runs from `good` to `problematic` is probably making things a bit squiffy. But there certainly seems to be a sense that the punk qualities of digital archaeology, the introspection, the craft of it, is broadly positive. The negative qualities of the digital, if I can unpick one thread, seem to perhaps connect with the teaching (or lack thereof) in the classroom.^[Again, vectors constructed more carefully than perhaps what I have done this evening would be clearer and make more sense.]

	# To wind up this quick distant read

	In this quick glance from a distance at _Mobilizing the Past_, we see a digital archaeology that in some respects is a continuation of the analog archaeology in which it is entwined. There's a clear sense that digital is transforming the practice of archaeology, but also, that this is freighted with anxiety. Some of the greatest work in digital archaeology is explicitly feminist archaeology (thinking Tringham, Morgan, etc), and I don't get that sense from this volume _read at a distance_.

	You will draw different conclusions, of course. When you get your hands on the book, save all the text from the pdf into a txt file, make it all lowercase, and then grab my code and feed it the text yourself. Look for other interesting words or binaries. Correct the flaws in what I've done.

	_to come: a comparison of topics with Kansa, Kansa, & Watrall's 2011 Archaeology 2.0; also, a similar exploration of word vectors therein. How far have we come over the last five years, if these two volumes are used as bookends?_

	```{r bib, include=FALSE}
	# create a bib file for the R packages used in this document
	knitr::write_bib(c('base', 'rmarkdown'), file = 'skeleton.bib')
	```