Thursday, 1 April 2010

Cultural Variation and Social Networks

Children learn language from exposure to speakers in their social network. This learning influences the input that will be given to the next generation. The learning biases that an individual has will influence the way the language changes over generations (Kirby, Dowman & Griffiths, 2007). However, language also plays a part in constructing and maintaining social networks. Recent studies have suggested that the structure of the social network also has an effect on the how a language evolves. Gong and Wang (2010) find that different network types influence the evolution of linguistic categories in an artificial categorisation game. Lupyan & Dale (2010) find that the amount of contact with other communities, and a community's spatial dispersion influences the morphological complexity of a language.

I was wondering whether bilingual communities have different social network structures to monolingual communities. Real social networks are very difficult to construct, so I wanted to use some online social networking sites. Twitter seemed like an obvious choice because of it's simple API, and also because 'following' someone has a genuine connection on a user's linguistic input.

I aquired some data for Twitter users. The data includes the number of followers (indegree) and people being followed (outdegree), the user's location, the number of status updates sent by a user and the amount of time since the last update. The last two features can be used to filter out people who are not active participants. The location information is optional and may be as specific as GPS coordinates or as general as a country, or even just a timezone. Following from this, communities were defined by country. Data mining techniques will be used to automatically assign users to countries.

Ideally, we would want the following statistics: Average Degree, Clustering Coefficient, Average shortest Path length. However, this requires information on the specific links between users. However, this requires more time and resources, so this data was not collected for this report. This is not a trivial point, however, because users can follow people in other countries.

Next, data on the linguistic variance is needed. As I showed in a recent post, estimating the amount of bilingualism is difficult. The best source of information is Ethnologue, but numbers of speakers are underestimated, probably due to inadequate data for small linguistic communities. I decided to use two measures of bilingualism: The number of languages spoken in a country and the percentage of the population of a country that speak the majority language. A country's nominal per-capita GDP and number of internet users is also taken into account.

Data was collected for about 31,000 users, about 25,000 of which was usable (there have been databases of up to 2.7 million users with over a billion connections between them). Although Twitter allows very many followers, for practical purposes I filtered out users with over 1,000 followers or following over 1,000 other users. This left data for 17,444 users ffrom 119 countries.

Initial results suggest a negative correlation between the indegree for each country and the percentage of the country's population who speak the majority language (using log indegree, t = 1.88, df = 117, p = 0.06).

Using a linear regression, the indegree and outdegree are significant predictors of linguistic variation, even when the effects of population size and access to the internet are partialled out (R-squared = 0.19, F(4,17439) = 1063, p <0.01; t =" -2.04," p =" 0.04;" t ="2.54," p ="0.01). This was based on data for 17,444 users ffrom 119 countries. Statistics for countries were taken from CIA factbook, 2010. The analysis revealed a negative correlation between linguistic variance and indegree, but a positive correlation between linguistic variance and outdegree. The same qualitative results were found by using the number of languages spoken in a country. However, there is a positive correlation between the number of languages and both indegree and outdegree. I'm not sure how to interpret this yet, or whether any of it makes any sense. In the meantime, here's a pretty uninformative map of the world, coloured by average number of Twitter friends. Darker countries have users with a higher average number of friends.

Lupyan G, & Dale R (2010). Language structure is partly determined by social structure. PloS one, 5 (1) PMID: 20098492
Kirby S, Dowman M, & Griffiths TL (2007). Innateness and culture in the evolution of language. Proceedings of the National Academy of Sciences of the United States of America, 104 (12), 5241-5 PMID: 17360393


  1. Yo Sean,

    That is really interesting stuff. Combing network literature with bilingualism seems like a great idea -specially given the cost effective nature of publicly available data.

    I'd be interested to know what your descriptives stats are for your measures of linguistic variance?
    Since it is cross-country reg and there is large diversity in countries (even with your controls) there may be problems of outliers in terms of countries with 10-15 dialects (if you use your second measure) etc. But we are not necessarily thinking about what the is the average observation doing so it might not be a problem for you.

    Your baseline results are interesting ( perhaps many nodes out of you means that you are likely to be worldly and many nodes in you means that you are less likely to know other languages as your language is the one in town).
    But interesting you mentioned that there is a positive correlation between the number of languages and both indegree and outdegree - does this just mean that a European country may be 'busier' than the US for example - from your 117 countries this is interesting. Do you have the same observations from each country?
    Anyway sorry for the incoherent ramblings

  2. Interesting stuff.

    I wonder what are the sizes of your correlations..? Also, what are the partial effect sizes..?

    Your results seem to suggest that fewer people follow a person who inhabits a country with GREATER LING. DIVERSITY, relative to a person who lives in a country with LESSER LING. DIVERSITY. How are you defining liguistic diversity and opposed to number of languages? I assume you mean more dialects = greater ling. diversity...? In which case fewer people follow a person living in a country with many dialects, many variations upon a language, possibly meaning having geographic factors which promote isolation, hence dialects...

    On the other hand, people who live in a country with GREATER LING. DIVERSITY (with factors promoting geographic isolation) tend to follow more people than those who live in a country with LESSER LING. DIVERSITY. Here's where your confound lurks and the obvious one is people following people outside their national borders. Check to see the percentage of dialects spoken cross national borders.

    If significant, I think your findings are interesting. But check that linguistic diversity is not a proxy for some other factor such as geographic isolation.

  3. Thanks for your comments - both make good points.

    Of course, the data contains inter-country links and there's no way of filtering them. This is a problem I'm trying to fix by using twitter's local search to find users within 20km of particular cities. I'm also collecting specific IDs of friends and filtering by location, instead of using the friend count. I'll post the data when I have it, but collecting this data is going much slower.

    As I say in the post, The number of languages spoken in a country and the percentage of the population of a country that speak the majority language. I tried to partial out some of the effects of geographic isolation by including measures of GDP, Internet Prevelance and population size in the regression. The results are still qualitatively the same.

    For the number of languages, the correlation between mean outdegree and number of languages within a country is 0.06 (p < 0.0001) and the adjusted R-squared = 0.67, F(4,7786) = 3950, p < 0.0001.

    I extended the study to incorporate the Greenberg Diversity Index (see The results are similar:
    Indegree and outdegree significantly improve the fit of the model: F(2,7253) = 9.09, p < 0.001. Adjusted R-squared: 0.1168, F(4,7253) = 241, p< 0.0001. Correlation between outdegree and GDI = 0.067, p < 0.0001.

    As I said, I'm not sure what to make of this yet, I'll post more when I am!