DANIEL JACOBINOHow can one quantify something as volatile and multifaceted as culture? How can one find a common denominator that indicates a trend or change over a period of time in fields as intensely subject to rain and thunderstorms, such as grammar, literature, censorship and behavior?
This is the aim of the ambitious Culturomics program developed jointly by professors, researchers, and students from Harvard University and the Massachusetts Institute of Technology, in the United States. Part of the results have been condensed in the article “Quantitative analysis of culture using millions of digitized books,” the second one in the program, which was published last January in Science.
Signed by such heavyweights as psycholinguist Steve Pinker, the project pored over a corpus of 5,195,769 books scanned by Google Books. According to the coordinators, this number is equivalent to 4% of all the books ever been printed in the course of history. As a result, the California company has become “the biggest and most important source of funding of the project,” states Adrian Veres, one of the authors of the article.
Led by Jean-Baptiste Michel and Erez Lieberman Aiden, from Harvard University’s Evolutionary Dynamics Department, the article published in the American journal is the result of a study that the two researchers conducted to quantify the evolution of irregular verbs in English from secondary sources. “In a way,” says Veres, “this helped to consolidate the idea that important and significant results could be obtained, on a quantitative level, by means of data such as the repetition of a given word over time.”
Alcir Pécora, theory of literature professor at the State University of Campinas (Unicamp), is equally enthusiastic about the project: “I think this kind of research is very interesting, in that nowadays machines allow us to work with huge amounts of data. It is a fabulous mass of information.” However, he ponders that: “When it comes to trying to understand what this data means, one must have a qualified interpreter rather than a database analyst. Nevertheless, this doesn’t mean that this kind of research is useless, or even offensive, as some traditional humanistic communities seem to believe.”
Within the scope of the broad universe of the cultural world that Culturomics intends to map, language has turned out to be one of the most reliable items to be measured , “a classic model of grammatical changes.” Because, the authors state, “unlike the regular verbs in the English language, the past form of which is formed by adding the suffix “ed” at the end, irregular verbs are conjugated idiosyncratically.”
Thus, while in the United States the use of regular forms of certain verbs was disseminated (such as “burn”/”burned” and “spell”/”spelled”), the use of the verbs’ irregular forms (“burnt” and “spelt,” respectively) prevailed in Europe.
However, the quantitative study of English grammar showed a change in the cultural and geopolitical paradigm, due to the growing influence of American English upon speakers of British English. In time, the British also started to adopt the forms used by the speakers of the former British colony, as shown in the quantifications produced by Culturomics.
“The irregular verb forms that end in “t” might be dying out in England. Every year, a population equivalent to that of the city of Cambridge adopts the form burned instead of burnt.”
Still, American speakers have also retrieved some irregular forms that were half-forgotten in the Mother Country and that were later re-incorporated by the British into their daily language.
These statistics led the study’s authors to refer to the United States as “the biggest exporter of both regular and irregular verbs.”
Not only language but also celebrity can be measured by means of tabulations. “One can measure how quickly someone becomes famous, how quickly someone’s celebrity status fades, the intensity of this fame and at which moment of his or her life a given person became famous or ceased to be famous,” explains Veres, whose line of research focuses on “the dynamics of fame.”
One of the most impacting- and cruel – conclusions about contemporary society described in the article in Science is how people become famous more precociously and how they are forgotten at an increasingly fast pace.
To come to this conclusion, the research study resorted to 740 thousand people whose names had been entered in Wikipedia; identical names were discarded. The researchers then tabulated the rest of the names on the basis of the date of birth and the frequency with which a given name was mentioned. Next, taking into account the period from 1800 to 1950, they created a group with the 50 most famous people born in each of those years. For example, the year 1882 includes writer Virginia Woolf, and the year 1946 includes former United States president Bill Clinton and the film director Steven Spielberg.
Statistics showed that the period during which celebrities reached their peak remained regular; in other words, it corresponded to roughly 75 years after birth. However, other parameters underwent drastic changes during the course of the analyzed period: “The most famous people in recent times are more famous than the famous people of previous generations. However, this fame is becoming increasingly shorter. The period that follows the peak of fame plunged from 120 to 71 years in the nineteenth century.”
This data “is particularly impressive because we are measuring fame based on published books, which of course are a much slower media than newspapers, magazines or periodicals covering music,” says Veres.
Asked whether Culturomics validates the prophecy of artist Andy Warhol in 1968, that “in the future, everyone will be famous for 15 minutes,” Veres answers with humor: “If we consider current society, I think this time will soon drop to 7.5 minutes of fame.” He concludes by saying that “The pace is certainly picking up and society is moving faster and faster.”
The consequence is that the perception of that which is old and that which is new is also changing at the same speed, with much stronger emphasis on the present. A random year, such as, for example, “1880,” suffered a drop of 50% in the number of quotes 32 years later, i.e., in 1912. A more recent year, such as “1973,” had an equivalent drop in the number of quotes in a considerably shorter time span: just 10 years later, in 1983.
“Each year that goes by shows that we are forgetting our past at a much faster pace,” say the authors.
However, aren’t certain conclusions of the article too obvious, such as the statement that “the year “1951” was rarely discussed until the years that immediately preceded it?”
Veres agrees that “this is indeed a risk in research of this kind. For example, it is obvious that when a country changes its name (for example, from Rhodesia to Zimbabwe), the former name will decline within a short period of time and the new name will rise.” However, he ponders, “the existence of such “obvious conclusions” is often useful because it acts as a control over data bases” “precisely because it draws researchers” attention to the said risk.
“What could turn out to be an unimportant conclusion becomes a major form of control.”
It is at this point that researchers risk falling into a trap; more specifically, when crossing the limit between fact and interpretation. At the end of the article, they acknowledge this fact – “the challenge of Culturomics resides in the interpretation of its evidence.”
Veres explains the group’s methodology to overcome this dichotomy. “The data is the frequency with which words come up during the course of time. Still in reference to data, perhaps some minor corrections have to be made, such as notes being taken down incorrectly or optical reading mistakes. Interpretation, on the other hand, is the process that seeks to explain what led the data to take on their form. Therefore, the challenge is to find the best home that fits the data.” “Veres refers to home as the different histories and views of the world that are available.
Indeed, there are many topics pointed out in the article that are still open and will be explored in the upcoming stages of the project. One such example is censorship of ideas and people. During the Nazi regime in Germany, the number of entries on members of the Nazi Party grew by 500%! In contrast, entries on leading artists who were qualified by the Nazi regime as “degenerate” – for example, Spanish painter Pablo Picasso or the Bauhaus architect Walter Gropius – plummeted wildly.
According to the authors, this data might lead to the creation of a “suppression index,” “formulating a swift strategy to identify possible victims of censorship.”
For example, “Freud” seems to be more ingrained in people’s minds than “Galileo,” “Darwin” or “Einstein;” Likewise, “God” has not been overly popular recently; according to the quantification, one might suppose that the typical American diet consists of “steak,” “cold cuts,” “ice cream,” “hamburgers,” “pizzas,” “pasta” and “sushi.”
Finally, the feminist movement seems to have laid down its roots in France at first, yet it developed to a greater extent in the United States. In the battle between the genders, “woman” beats “man” – at least in terms of the number of entries.
Unfortunately, the Portuguese language has not yet been considered in the project. The reason is related not only to the Portuguese language’s insignificant cultural and geographical penetration, but also to the size and digitization of local libraries.
Veres argues that Portuguese was not part of the project because it did not meet the established criteria. “In the future, the idea is to include Portuguese and a number of other languages in the Cultoromics data base,” he adds.Republish