Created
October 9, 2016 02:23
-
-
Save shawngraham/5609ac1455c7b4bd6f87f42c5d9d1f43 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Word Vectors & Mobilizing the Past" | |
subtitle: "Distantly Reading Digital Archaeology, Part II" | |
author: "Shawn Graham" | |
date: "`r Sys.Date()`" | |
output: | |
tufte::tufte_html: default | |
tufte::tufte_handout: | |
citation_package: natbib | |
latex_engine: xelatex | |
tufte::tufte_book: | |
citation_package: natbib | |
latex_engine: xelatex | |
bibliography: skeleton.bib | |
link-citations: yes | |
--- | |
```{r setup, include=FALSE} | |
library(tufte) | |
# invalidate cache when the tufte version changes | |
knitr::opts_chunk$set(tidy = FALSE, cache.extra = packageVersion('tufte')) | |
options(htmltools.dir.version = FALSE) | |
library(magrittr) | |
library(wordVectors) | |
library(tsne) | |
library(dplyr) | |
library(ggplot2) | |
setwd("/Users/shawngraham/experiments/mob-past/fulltext-500lines") | |
``` | |
# Top down, bottom up | |
In the first exploration of the text of _Mobilizing the Past_^[See the press blurb, download, [here](https://thedigitalpress.org/mobilizing-the-past-for-a-digital-future/)] I generated a quick 'n' dirty topic model exploring some of the larger trends in the discourses within that volume. In a future exploration, I will compare it with an earlier volume on digitial archaeology from 2011.^[Kansa, Kansa, & Watrall, _Archaeology 2.0_. It's available [here](http://escholarship.org/uc/item/1r6137tb)]. | |
If topic models give you a top-down perspective on what is happening with one's corpus, word vectors reverse the view to let us see what is happening at the level of individual words. One thing that I find interesting to do is to define binary pairs, and see what words stretch along the continuum between them. Going further, as Ben Schmidt showed us,^[For more on word vectors, see [this post of his.](http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html).] we can explore the ways _pairs_ of binaries intersect. So, let's start exploring.^[You can get the package [here](https://github.com/bmschmidt/wordVectors).] | |
Firstly, I have the text of the volume, all in lowercase, which I feed into the `train_word2vec` function: | |
```{r, include=FALSE} | |
blogmodel = train_word2vec("mobpast.csv",output="mobpast_vectors.bin",threads = 4,vectors = 1500,window=12,force=TRUE) | |
``` | |
## Digital v. Analog | |
That done, what words are nearest, in the model, to 'digital'? | |
```{r} | |
nearest_to(blogmodel,blogmodel[["digital"]]) | |
``` | |
It's quite clear that in _Mobilizing the Past_, 'paperless' archaeology is very much hand-in-glove with the idea of going digital. So, let's create a vector from the idea of 'paperless, digital' archaeology: | |
```{r} | |
digital_words = blogmodel %>% nearest_to(blogmodel[[c("digital","paperless")]],100) %>% names | |
sample(digital_words,50) | |
``` | |
I like particularly the idea that digital archaeology is 'reflective', 'entangled', and 'exciting'. I certainly feel this way, and am glad to see it emerge in this volume (which, mark you, I still haven't read yet). Let's take a look at how these words relate to one another, using a dendrogram: | |
```{r fig-fullwidth1, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE} | |
g1 = blogmodel[rownames(blogmodel) %in% digital_words [1:50],] | |
group_distances1 = cosineDist(g1,g1) %>% as.dist | |
plot(as.dendrogram(hclust(group_distances1)),cex=1, main="Cluster dendrogram of the fifty words closest to a 'digital' vector\nin Mobilizing the Past") | |
``` | |
We don't know, yet, how this 'digital' vector plays out in value-space: is 'digital' good? bad? Is it gendered? But before we do that, let's look at an antonymn for 'digital': 'analog'. | |
```{r} | |
nearest_to(blogmodel,blogmodel[["analog"]]) | |
``` | |
A great deal more ambivalence. My initial impression here is not an opposition to digital, but rather, the role of digital in supplanting tried-and-true analog methods, and whether or not this is perhaps a wise idea. Let's explore it a bit more: | |
```{r} | |
analog_words = blogmodel %>% nearest_to(blogmodel[[c("analog")]],100) %>% names | |
sample(digital_words,50) | |
``` | |
```{r fig-fullwidth2, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE} | |
g2 = blogmodel[rownames(blogmodel) %in% analog_words [1:50],] | |
group_distances2 = cosineDist(g2,g2) %>% as.dist | |
plot(as.dendrogram(hclust(group_distances2)),cex=1, main="Cluster dendrogram of the fifty words closest to an 'analog' vector\nin Mobilizing the Past") | |
``` | |
A definite ambivalence there. Let's see how the words run when we define a vector from digital through to analog. | |
```{r, include=FALSE} | |
mode_vector = blogmodel[["analog"]] - blogmodel[["digital"]] | |
word_scores = data.frame(word=rownames(blogmodel)) | |
word_scores$mode_score = blogmodel %>% cosineSimilarity(mode_vector) %>% as.vector | |
``` | |
```{r, fig-fullwidth3, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE} | |
ggplot(word_scores %>% filter(abs(mode_score)>.425)) + geom_bar(aes(y=mode_score,x=reorder(word,mode_score),fill=mode_score<0),stat="identity") + coord_flip()+scale_fill_discrete("Indicative of mode",labels=c("analog","digital")) + labs(title="The words showing the strongest skew along the analog-digital binary") | |
``` | |
I think, perhaps, what this is suggesting is a fear of what we might be losing in the march of the digital, or maybe, things that we need to be aware of as we explore this area. It is interesting that the words that are most 'digital' in this vector seem to be in connection with publishing (or at least, that's how I read 'quarterly' and 'saa' and 'literary'). | |
## Value Judgements? | |
Archaeologists are human. What things are 'good' in this model, and what things are 'bad'? Turns out, the word 'bad' is not in the model at all. The closest term would seem to be 'problematic': | |
```{r, include=FALSE} | |
value_vector = blogmodel[["good"]] - blogmodel[["problematic"]] | |
word_scores = data.frame(word=rownames(blogmodel)) | |
word_scores$value_score = blogmodel %>% cosineSimilarity(value_vector) %>% as.vector | |
``` | |
```{r fig-fullwidth4, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE} | |
ggplot(word_scores %>% filter(abs(value_score)>.45)) + geom_bar(aes(y=value_score,x=reorder(word,value_score),fill=value_score<0),stat="identity") + coord_flip()+scale_fill_discrete("Indicative of medium",labels=c("good","problematic")) + labs(title="The words showing the strongest skew along the value binary") | |
``` | |
That one doesn't tell us much, other than to say perhaps these archaeologists are an optimistic bunch. | |
## Gender? | |
Finally, let's do the same again, by defining a 'gender' vector using pronouns. | |
```{r, include=FALSE} | |
gender_vector = blogmodel[[c("he","his","him")]] - blogmodel[[c("she","hers","her")]] | |
word_scores = data.frame(word=rownames(blogmodel)) | |
word_scores$gender_score = blogmodel %>% cosineSimilarity(gender_vector) %>% as.vector | |
``` | |
```{r fig-fullwidth5, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE} | |
ggplot(word_scores %>% filter(abs(gender_score)>.35)) + geom_bar(aes(y=gender_score,x=reorder(word,gender_score),fill=gender_score<0),stat="identity") + coord_flip()+scale_fill_discrete("Indicative of gender",labels=c("he","she")) + labs(title="The words showing the strongest skew along the gender binary") | |
``` | |
## Is 'Digital' Gendered? What is the value of 'Digital'? | |
Finally, let us take these vectors and combine them in interesting ways. In the digital to analog vector, which words are gendered male, and which are gendered female? This involves crossing our 'mode_vector' against the 'gender_vector'. | |
```{r, include=FALSE} | |
word_scores$mode_score = blogmodel %>% cosineSimilarity(mode_vector) %>% as.vector | |
word_scores$gender_score = cosineSimilarity(blogmodel,gender_vector) %>% as.vector | |
groups = c("mode_score","gender_score") | |
``` | |
```{r fig-fullwidth6, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE} | |
word_scores %>% mutate( modeedness=ifelse(mode_score>0,"analog","digital"),gender=ifelse(gender_score>0,"male","female")) %>% group_by(gender,modeedness) %>% filter(rank(-(abs(mode_score*gender_score)))<=36) %>% mutate(eval=-1+rank(abs(gender_score)/abs(mode_score))) %>% ggplot() + geom_text(aes(x=eval %/% 12,y=eval%%12,label=word,fontface=ifelse(modeedness=="analog",2,3),color=gender),hjust=0) + facet_grid(gender~modeedness) + theme_minimal() + scale_x_continuous("",lim=c(0,3)) + scale_y_continuous("") + labs(title="The top words gendered female (red) and male (blue)\n used with 'digital' (italics) and 'analog'(bold) words") + theme(legend.position="none") | |
``` | |
Science is male. It's large-scale, if it's digital. It's important, it's an improvement, though it has limitations. If it's digital, it's open and reproducible. When it's female and digital, it seems to be a footnote (or at least, that's how I'm interpreting those fragments.) When it's female and analog, it's supervisors who communicate...^[This perhaps ought to be unpacked a bit more: which is what you do when you read distantly. You spot patterns, then dive into the text forwarned and forearmed, and come back again to rerun your distant reading. I'm not entirely happy with what's going on in this diagram, which makes me think that the vectors have to be defined more carefully.] | |
Let's do the same again, this time comparing the 'mode_vector' against the 'value_vector' | |
```{r, include=FALSE} | |
word_scores$mode_score = blogmodel %>% cosineSimilarity(mode_vector) %>% as.vector | |
word_scores$value_score = cosineSimilarity(blogmodel,value_vector) %>% as.vector | |
groups = c("mode_score","value_score") | |
``` | |
```{r fig-fullwidth7, fig.fullwidth = TRUE, warning = FALSE, cache=TRUE} | |
word_scores %>% mutate( modeedness=ifelse(mode_score>0,"analog","digital"),value=ifelse(value_score>0,"positive","negative")) %>% group_by(value,modeedness) %>% filter(rank(-(abs(mode_score*value_score)))<=36) %>% mutate(eval=-1+rank(abs(value_score)/abs(mode_score))) %>% ggplot() + geom_text(aes(x=eval %/% 12,y=eval%%12,label=word,fontface=ifelse(modeedness=="analog",2,3),color=value),hjust=0) + facet_grid(value~modeedness) + theme_minimal() + scale_x_continuous("",lim=c(0,3)) + scale_y_continuous("") + labs(title="The top negative (red) and positive (blue) words \n used with 'digital' (italics) and 'analog'(bold) words") + theme(legend.position="none") | |
``` | |
In this one, I think the fact that the 'value_vector' runs from `good` to `problematic` is probably making things a bit squiffy. But there certainly seems to be a sense that the punk qualities of digital archaeology, the introspection, the craft of it, is broadly positive. The negative qualities of the digital, if I can unpick one thread, seem to perhaps connect with the teaching (or lack thereof) in the classroom.^[Again, vectors constructed more carefully than perhaps what I have done this evening would be clearer and make more sense.] | |
# To wind up this quick distant read | |
In this quick glance from a distance at _Mobilizing the Past_, we see a digital archaeology that in some respects is a continuation of the analog archaeology in which it is entwined. There's a clear sense that digital is transforming the practice of archaeology, but also, that this is freighted with anxiety. Some of the greatest work in digital archaeology is explicitly feminist archaeology (thinking Tringham, Morgan, etc), and I don't get that sense from this volume _read at a distance_. | |
You will draw different conclusions, of course. When you get your hands on the book, save all the text from the pdf into a txt file, make it all lowercase, and then grab my code and feed it the text yourself. Look for other interesting words or binaries. Correct the flaws in what I've done. | |
_to come: a comparison of topics with Kansa, Kansa, & Watrall's 2011 Archaeology 2.0; also, a similar exploration of word vectors therein. How far have we come over the last five years, if these two volumes are used as bookends?_ | |
```{r bib, include=FALSE} | |
# create a bib file for the R packages used in this document | |
knitr::write_bib(c('base', 'rmarkdown'), file = 'skeleton.bib') | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment