December 17, 2010

Fun with Google's New "Ngram Viewer"

News of Google Book's new Ngram Viewer, which allows one to make a graph of the printed usage of any word over time, has been making the rounds on the internet for the last couple days. The advanced technology that makes this massive database possible has even spawned an upcoming article in the journal Science which announces a new discipline: culturomics.

After playing around with the website for a few hours, I have to say: this is pretty amazing. The ability to quantify things that had once been subjective 'hunches' on the part of scholars ("did publications about witches decline during the Enlightenment?") is nothing short of revolutionary. Of course, this must come with the caveat that the Google Books database, large as it is, still amounts to only a small fraction of all printed materials (perhaps 4%), and there may well be significant errors in the dating of books and Google's text-recognition tools.

In short, I'm not yet prepared to use this stuff for my academic work, but I do think it has amazing promise, and graphing the histories of different words can be enormous fun to boot. Some examples:

Note the spike in references to the devil during the rise of the Puritans in the 1610-30 period! And the great decline circa 1740, advent of the so-called 'Age of Reason.'
Here I was trying to get a glimpse into the shift in discourses about the supernatural in printed English over the course of the long eighteenth century. Capitalized 'Witch' declines very sharply around 1710 -- a counterpoint with the more vague 'prodigies' and 'apparitions,' which rise steadily throughout the period. Seems to accord nicely with the view that the supernatural became less easily explainable in this period.

I might post more in the next few days. Also: I invite readers to post their own in the comments section!

4 comments:

Sorcier said...

I'm suspicious of the scope of the data. The books collected must've been quite random and hence very unrepresentative of certain random times.

For example, the conspicuous valley before 1650 must've been the result of "oversampling" of books during those years, suppressing the the % of "Witch" to be so low.

I believe the truth is much much more subtle and nuanced than sudden rises and drops as depicted by any graph from Ngram.

Yarg said...

Tools like this are absolutely fascinating, true. I wonder if we'll ever see them reach a level of sophistication where they could accurately chart such things. I remain dubious personally, but it's still a pretty cool thought.

Benjamin Breen said...

I agree with the cautious notes you're both sounding. I don't yet think this is a usable technique for historians because of errors in text-recognition technology and dating problems. But I do think the melding of computer science and historical studies that this represents has huge potential, and will probably be usable in the not-too distant future.

As for the small/unrepresentative sample size - point well taken. But then again, most of the data sets that early modern historians have available to them are incomplete and unrepresentative (social historians frequently have to deal with years or even decades of document series that are missing due to fire, theft, misplacement, etc. - Lisbon pre-earthquake being a classic example). In other words, as long as we're fully aware of the limitations and integrate that awareness into our methods, I don't think quantitative data from Google Books need necessarily be discounted simply because it doesn't represent ALL books that have ever been printed, or even a large percentage of them.

For now though, I definitely agree that its more of a parlor game than anything else - although a potentially revealing one. As an example, I like the clever exploration of typographical changes that I came across elsewhere on the web: do a search comparing "best" and "beft" between 1700 and 1900. The result is an x-shaped chart that pinpoints the years when printers decided to abandon the old style 'long s' shape. Example: http://ngrams.googlelabs.com/graph?content=best%2Cbeft&year_start=1700&year_end=1900&corpus=0&smoothing=3

Thanks to you both for your stimulating comments!

Meredith G said...

Love & Work rose and fell together.