Imprimir Republish

Statistics

The life of words

Physicists and linguist examine the evolution of the vocabulary of on-line communities

LARISSA RIBEIRONobody knows how just many words are being born at any given time. Language scholars are only certain that it must be many and that the vast majority are rarely used, and usually forgotten. After all, there are many more words than a single human being could learn in a lifetime. To give you some idea, the Google search engine registered 13 million distinct words in the English language, used at least 200 times on Web pages by 2006, while researchers estimate that the vocabulary size of a well-educated adult does not exceed 100 thousand words.

The mystery surrounding the creation of words continues, but a study published in May in the journal PloS ONE, conducted by Brazilian physicists Eduardo Altmann and Adilson Motter, in partnership with an American linguist, helps us to better understand how the vocabulary of a community evolves over time. By statistically analyzing thousands of words used by almost 167,000 participants in two discussion groups on the Internet for a decade, the trio of researchers concluded that the chances of a word, old or new, remaining in use in the future does not depend that much on the frequency with it is currently used, but rather on the variety of subjects in which it is employed and, more importantly, the number of people who use it. In the words of the first author of the study, Altmann, of the Max Planck Institute for Physics of Complex Systems in Dresden, Germany, to maintain the variety of words used in a community, “it is better for many people to say little than it is for a few people to talk a lot”.

This is not the first article on the evolution of vocabulary published by Altmann and Motter, of Northwestern University in Evanston (Illinois, USA). The exchange of messages between millions of people through electronic means leaves traces in the form of databases that physicists are increasingly interested in exploring, looking for patterns that reveal the social dynamics behind the digital interaction. “Physicists are very good at finding relationships between the underlying mechanisms and the observed patterns,” said another author of the study, linguist Janet Pierrehumbert, also of Northwestern University, about the collaboration. “They are also very good at making analogies between one and another kind of phenomenon.”

The researchers chose to analyze the activity, up to 2008, of two discussion groups in the public communication forum Usenet, which is currently hosted by Google, but existed way back in 1979, 10 years before the invention of web pages. One of the discussion forums the researchers studied is comp.os.linux.misc,created in 1993 to discuss the Linux operating system, which logged 128,903 participants, 140,517 of whom initiated topics of conversation. The other forum they studied was rec.music.hip-hop, a discussion group devoted to Hip-Hop, initiated in 1995, and in which 37,779 people participated at least once in one of the 94,074 discussion topics started. The total number of words written by the users of one of these groups during a six month period ranged from almost 1 million to more than 5 million.

To quantify how each of the words used in these groups was disseminated among the users and topics over time, it was not enough to simply count the number of times each participant used a word and how many times the word appeared in each discussion topic every six months. The statistical analysis had to take into account the fact that the activity of users and the size of the conversations of these groups varied widely. Very few participants write a lot, all the time, while many contribute only a little now and then. At the same time, more than a thousand messages were posted for very few topics, where the discussion lasted more than three years, while the average topic received five messages and lasted five days. In the end, they were able to define a value that measures the degree of dissemination of a word among participants and conversations, regardless of the frequency with which that word is used. In this way, they were able to compare the spread of rarely used words with that of frequently used words.

Present and Future
The next step was to compare the number of times each word appeared in the discussions and the extent of dissemination of each word over a period of six months with changes in the frequency of their use two years later. Computing the numbers, the researchers found that the frequency that a word was used at a given time said little about how often it would be employed in the future. They also observed that the number of times a word was mentioned two years later seemed to have a direct relationship with the degree of dissemination of the word in the past. They concluded that the probability that a word is used increases as more and more people use it. This means that even if a word is widely used today, it runs the risk of falling into disuse a few years later if the number of conversations and topics in which it is used today is low.

According to the researchers, the situation appears to be a lot like that of living beings fighting for their survival. Each word can be thought of as a biological species. “Each use of the word can be compared to an individual of a species,” says Altmann. To survive, the word needs to be reproduced, which is analogous to what happens from the moment someone reads the word somewhere and stores it for future use. The spread of the word, again according to the researchers, can be thought of as the ecological niche (capacity for interaction) of the species in the environment. The narrower the niche of a species is, the greater its risk of extinction. Therefore, a population explosion does not guarantee the survival of a species if its niche is small. “The word needs to be distributed among a certain number of users, otherwise it dies,” says the physicist.

LARISSA RIBEIROOne of the results that the group obtained – the fact that the current frequency with witch a word is used does not influence the frequency of its use in the future – contradicts the conclusion of recent studies that examined the dynamics of words over much longer periods of time (centuries) and suggested the importance of the frequency of use. In the most well known of these studies, published in Nature in 2007, a group led by Erez Lieberman, currently a visiting professor at Google, showed that, in English, little-used irregular verbs tend to turn into regular verbs, while only the most widely adopted by the population keep the irregular form. This would explain why the irregular verb ‘to be’, the most widely used in the English language, is still and will remain irregular. While the irregular verb ‘to slink’, which means to move smoothly and quietly with gliding steps or in a stealthy or sensuous manner, and is hardly known to most people, is losing its irregular past tense form, ‘slunk’, in favor of the regular variant, ‘slinked’.

Janet believes that these historical studies are looking at very specific cases in which two words compete for a niche in the same language. She explains that most words are not really in competition with each other, since synonyms are extremely rare. For example, the words ‘yes’ and ‘yup’ both mean the same thing, but the fact that the latter is more colloquial than the former, implies that each one is used in different situations and therefore the mutually exclusivity of their respective niches is guaranteed. “I predict that these factors [spread between users and topics] will also prove to be very important in explaining the fluctuations in the frequency of use over historical periods [in the order of centuries],” she says.

In this sense, Altmann suggests that the measures for the dissemination of words that they developed can be applied to any other similar database, like that of the more than 5 million books that have now been scanned by Google Books – the target of a study recently published in Science and led by Lieberman, which measured and compared the frequency of use of various keywords of historical and cultural interest (see Pesquisa FAPESP nº 183). In this case, the authors of the books assume the role of users and each book can be thought of as a separate post on a discussion topic.

Two other results that came from the analysis of the Usenet groups puzzled the researchers. One was the fact that the spread of a word between users influences changes in the frequency the word is used more than it influences the spread of the word between discussion topics. The other curious result is that words are often linked to more users than they are to different discussion threads. Taken together, these findings suggest that the idiosyncrasies of individuals or subgroups of individuals have a central role in maintaining the vocabulary of a community. “Who reads the messages in the Hip-Hop forum, for example, will realize that people make a concerted effort to write in different manner or style than other users as a way to position themselves socially,” says Altmann.

The dissemination of words is not the only factor that determines their success. Altmann and his colleagues found that words related to commercial products such as ‘wireless’ and ‘Gnome’ (the Linux distribution platform) or personalities like ‘Bush’ and ‘Eminem’ begin their life in discussion groups with a very low degree of spread – which in principle, would have doomed them to extinction. But in these cases, forces originating from outside the discussion groups, such as advertising and news media, acted in such a way that these words were incorporated into the users vocabulary.

Slang and jargon that is already well accepted among discussion group users followed the same statistical trend as other words, suggesting that acceptance of these words depended more on external than internal factors. The linguist Eleonora Albano, of the State University of Campinas, says that slang and jargon adopted by a community contribute to building the identity of the social group.

Maria Helena Neves, a linguist at the State University of São Paulo and Mackenzie Presbyterian University, considers the quantitative studies about online conversations to be interesting, but is skeptical that the results of these studies can be safely generalized to explain the dynamics of spoken language. “The sample is restricted because of the chosen means of expression, the profile of users, and the purpose of interaction,” she says. She, moreover, is always suspicious of generalizations. “In language there is no ready-made recipe for anything, if there were, literature and poetry would not exist.”

Eduardo Altmann

Floating words
From 1998 to 2000, English words with a high degree of dissemination among members of a discussion group on the Linux computing system grew in popularity. In the graphic on the side, they are represented by colors ranging from red to yellow. Words with a low degree of spread (purple to black) became less frequently used. Variation in popularity did not depend on the frequency of use.

Scientific article
ALTMANN, EG et al. Niche as a determinant of word fate in online groups. PLoS ONE. May 2011.

Republish