Uilab.kaist.ac.kr
Sociolinguistic Analysis of Twitter in
Republic of Korea
Research Institute
University of London
Republic of Korea
be a political statement with wide-ranging implications. In some
In a multilingual society, language not only reflects culture and her-
cases, governments subsidize programs to save a language from
itage, but also has implications for social status and the degree of
disappearing. At the individual level, people who are fortunate to
integration in society. Different languages can be a barrier between
be bilingual constantly make a choice in favor of or against one
monolingual communities, and the dynamics of language choice
or the other language. Anecdotally, even after mastering a second
could explain the prosperity or demise of local languages in an in-
language many people continue to count (and swear in their
ternational setting. We study this interplay of language and net-
mother tongue. At the societal level, the question arises to which
work structure in diverse, multi-lingual societies, using Twitter. In
degree bilinguals are the "glue" that keeps multilingual societies
our analysis, we are particularly interested in the role of bilinguals.
Concretely, we attempt to quantify the degree to which users are
We study the phenomena of multilingual societies and the role
the "bridge-builders" between monolingual language groups, while
that bilinguals play in them by using large amounts of Twitter data.
monolingual users cluster together. Also, with the revalidation of
Social media data has the fascinating component of also containing
a network. These social links allow us to investigate the interaction
lingua franca on Twitter, we reveal users of the native
non-English language have higher influence than English users, and
between a user's language and their social surroundings. Under-
the language convergence pattern is consistent across the regions.
standing this interaction has a number of potential implications:
Furthermore, we explore for which topics these users prefer their
• Preservation of a language. Assuming that you are bilingual
native language rather than English. To the best of our knowledge,
and that all of your friends understand English reasonably
this is the largest sociolinguistic study in a network setting.
well, but not all understand your native language. Shouldyou switch your language to maximize your audience size?
Categories and Subject Descriptors
• Social capitaland potential issues of segregation. Is it
J.4 [Social and Behavioral Sciences]: Sociology
possible to build social ties across language barriers? Whichrole do bilinguals play in this "bridge-building"?
• Social status and language assimilation. Eliza Doolittle in
George Bernard Shaw's "Pygmalion"/"My Fair Lady" un-
Multilingualism; Sociolinguistics; Topic Modeling; Social Media
derwent a huge change in social status by learning a newlanguage, though just a "high class dialect" in this case. Gen-
erally, are there elite languages in multi-lingual societies?
The language we speak is an integral part of our culture. We
• Language selection. How do bilinguals choose one language
use it to communicate, to transmit facts and emotions, and to nav-
over the other for a given topic? Do they prefer their mother
igate the social environment surrounding us. In multilingual soci-
tongue for issues "close to the heart"? Correspondingly, is
eties such as Canada or Switzerland, the spoken language can even
the same topic discussed differently in different languages?
∗This work was done while the first author was at Qatar Computing
We explore these questions with large-scale Twitter data from sev-
Research Institute.
eral multilingual societies. We analyze the Twitter following be-havior to uncover whether monolingual users form tightly con-nected clusters, what bridging roles multilingual users play, andwhich language groups show higher social status. We apply lan-guage processing to analyze the amount of language usage depend-ing on the surrounding network, and probabilistic topic modelingto discover the differences in topics in different languages by mul-tilingual users. Methodologically, we propose metrics for quanti-fying language use and network diversity in multilingual Twitter
This is the author's version of the work. It is posted here for your personal
network, and we illustrate techniques from machine learning ap-
use. Not for redistribution. The definite version was published in HT'14.
http://dx.doi.org/10.1145/2631775.2631824.
plied to multilingual tweets.
There is now widespread recognition among linguists that so-
cial media such as Twitter are highly multilingual and provide animmense volume of real-world language data. Several studies insociolinguistics have explored Twitter and other online languageand in particular, social scientistshave examined the strategic use of multilingualism on Twitter inrecent political movements However, these studies are lim-ited in scope, which require computational tools for a systematicanalysis of multilingualism that involves both network analysis andlanguage processing at a large scale.
More recently, sociolinguists, together with computer scientists,
have tried to map out linguistic diversity through spatial and tem-poral analyses of multilingual Twitter. Studies by and haverevealed the extensive use of a large number of different languages
Figure 1: Visualization of Qatar Twitter network. Each node and
in Manchester, discovering Twitter users in Manchester are con-
edge represents user and followings in the Twitter networks. Each
nected globally and use languages other than those recorded in the
node is colored by the language usage from corresponding user's
local census. Similarly, used Twitter to detect the extent of
tweets. AR-EN Bilingual users are located between monolingual
multilingualism in London, which revealed that there are specific
geographical concentrations of monolingual users of different lan-guages. The processes of language shift, language attrition, lan-guage loss, language endangerment and language death were alsoinvestigated The related but different pro-cess is competition between different groups of language users While groups with more socio-economic power often have acrucial impact on the spread of particular language, the size of thespeaker group also plays a significant role. It has been shown thata single monolingual speaker of a particular language may hold thekey to the survival of the language in the bilingual community, asthe bilingual speakers try to accommodate the monolingual speakerIt then follows that relatively low number of languageusers could have a snowballing effect and prompt the majority touse a specific language in Twitter. A number of mathematical mod-els for language competition have been proposed
Social scientists, especially sociolinguists, have long been inter-
ested in the role language plays in the formation of social networks
Figure 2: Language distribution of the Twitter users for each re-
and in how structures of social networks impact on language prac-
gion. The upper bar illustrates the distribution of language usage,
tices Relatively little is known about the role multilingual-
and the lower bar shows the distribution of mono-, bi-, and trilin-
ism plays in forming these networks and how the virtual networks
gual users. For all regions there are < 1% of trilingual users.
impact on multilingual practices. While it is expected that speak-ers would identify themselves more easily with others who sharethe same languages, therefore forming language-specific clusters,
using FollowerwTo capture users with only the city names,
it is not clear how monolinguals and bilinguals would pattern in
for each of the three countries we compiled lists of cities with more
relation to each other. Understanding the pattern of connections
than 106 inhabitants, along with the names translated into multiple
between monolingual and bilingual speakers would not only offer
languages using Wikipedia entries. After identifying users from the
a new perspective on multilingualism on the social media, but also
regions of interest, we used the Twitter API to crawl all their friends
provide new insights into the societal structures and human rela-
(= followings) and followers, and up to 3,200 recent tweets for each
tions in multilingual societies.
user. Table shows the statistics of the Twitter data. We ignoredinactive users with fewer than 5 tweets. Then, we classified the lan-guage used in each tweet, which is not a trivial task because a largenumber of tweets contain very little information for language clas-
DATA COLLECTION AND LANGUAGE
sification and a single tweet can mix multiple languages. Tooptimize language classification accuracy against these challenges,
we aggregated all tweets for each user into a document of tweets.
We collected Twitter data (recent tweets and friend/follower lists)
After removing mentions, hashtags, and URLs, we used Compact
from two countries (Qatar, Switzerland) and Quebec province in
Language Detector 2 and detected the top three languages with
Canada. We identified Twitter accounts from two sources. First,
their approximate percentages of the text bytes in the document.
we used Twitter streams provided by GNIP as part of a trial pe-
We define users as speaking a language if they have ≥ 15% of text
riod. These streams comprise (i) about 28 hours of Firehose stream
bytes written in the respective language. Figure shows the distri-
(ii) two weeks of decahose stream, both around June to August
bution of language usage.
2013. Users with at least one geo-tagged tweet from Qatar andSwitzerland are considered as candidates. Another source of Twit-
ter accounts is the location information from the public user profile
Figure 3: Network measurements for lingua groups in each multilingual region. Diversity indices D1 (Eq. and D2 (Eq. are calculatedbased on the distribution of the edges toward the lingua groups. Self-follow Index (Eq. measures the probability that a new edge from auser would be directed to another user in the same lingua group. + marks the baseline for Self-follow Index, based on purely random edges.
QUANTIFYING LANGUAGE USE AND
Figure shows the visualization of the networks. We used
NETWORK CHARACTERISTICS
software for plotting with the Yifan Hu graph layout algorithm. Forthe purpose of illustration, we highlighted the bilingual users. This
In this section, we present a quantitative analysis for discovering
visualization shows that monolingual users cluster together, while
social structure in the multilingual societies. After labeling each
bilinguals are located between the two monolingual groups. To
user by the language they tweet, we created a network for each re-
quantify this observation, we measured the diversity of outlinks for
gion with nodes representing users and directed edges representing
each lingua group. We first labeled and grouped users by the lin-
Twitter following relationships. Table shows the statistics and
gua. Then, we counted the edges between the groups. To calculate
network measurements for the five networks.
this, we define er
We first define a lingua as a mono- or multi-lingual combination
ij as the edge from user i to j in region r, pointing
from Twitter follower i to the friend j. Er
of languages. We also define S is the set of all possible linguae, for
ij is the set of respective
edges. Then we define P r
instance, {EN, AR, AR-EN, DE-EN-FR, .}. The "lingua group",
ij as the proportion of the number of edges
from i to j over the all outgoing edges from i. Formally,
i , is the set of users speaking lingua i in region r. We quantify
the distribution of linguae for each region r using Shannon entropy:
where pri indicating the fraction of users speaking lingua i over the
Diversity Measures. We first define D
Twitter population of our data in region r. Higher H(r) indicates
1, based on Shannon En-
tropy of outgoing links for language subgroup i in region r, so that
that there is an even distribution of linguae over the population, and
the entropy value for each region is shown in Table As Figure
1 indicates that outgoing edges will be more concentrated
to a single lingua group. And we define D
shows, Switzerland has the most diverse distribution of linguae,
2 using Simpson Index,
the special version of inverse True diversity:
followed by Qatar and Quebec.
D2 is used in as Participation coefficient, to measure how a
node's connections are ‘well-distributed', and from definitions, a
NETWORK MEASUREMENTS
lingua group with lower D1 or D2 has more concentrated outlinks
into the small set of lingua groups. Since these metrics can only
quantify the general shape of distribution, whether it is diverse or
concentrated, we define Self-follow index, the probability of mak-
ing a homogeneous connection when creating new edge from eachuser in a group. Higher intra-edges with lower inter-edges indicate
Table 1: Data Statistics and network measurements. Edges (di-
the average user tends to follows another from the same language
rected) represent followings. See Eq. for ∗H(r), for which higher
group. We call self-follow index of a lingua group as the average
value indicates that the distribution of languages in the region is less
of self-follow index of all users in the group. The self-follow index
skewed. ∗∗GCC, node coverage of the greatest connected compo-nent over the network, is calculated with undirected edges.
of user a, region r and lingua group i is defined as
Monolinguals Cluster Together. For all three regions, we found
that monolingual groups consistently show lower D1 scores thanmultilingual groups. As in D1, monolingual groups have higherD2 when compared to the multilingual groups in the same regionexcept for the English monolinguals. Figure shows both D1 andD2 for each group. We also found that monolingual lingua groupshave higher self-follow index than any bilingual groups in all re-gions. The results from three diversity metrics and Intra-Inter edgeratio suggests that users in monolingual subgroups have a strongtendency to follow users inside of the same subgroup, while bilin-guals do not.
Users of Local Language have Higher Influence. We explore
the question of language use and social status, which we estimatesimply with the number of followers, as studies have shown thattweets from a user with a high in-degree are more likely to beretweeted We first look at the mean and median of thenumber of followers and friends of users in each lingua group withinthe network. This is to approximate the user's intra-region socialstatus by excluding the effect toward the outside of the network inour data. To minimize the effects of outliers we removed the topand bottom 10% of users for the number of followers and friends.
Figure shows the average number of followers and friends foreach lingua group in three regions, and median strictly followedthe mean numbers. In all three regions and all lingua groups, wefound that the number of friends is always larger than the number
Figure 4: Following patterns among lingua groups. The color of
of followers. Also, for all three regions, users tweeting in the lo-
an edge corresponds to the source color of the edge. Following
cal language have more followers and friends, even when there are
distribution normalized for each user and averaged over group. The
more English monolinguals in the dataset, such as in Switzerland
number for each edge corresponds to the ratio of the followings
and Quebec. This phenomenon shows that users tweeting in the
over all from source group. Edges having weight > 0.10 are shown.
local language exert higher influence within the regional network.
ANALYSIS OF BILINGUAL GROUPS
For all three regions, we found that English users communi-
cate with bilingual groups. Specifically, monolingual groups are
In this section, we analyze questions such as: Do bilinguals act
strongly connected with respective bilingual groups, which in turn
as bridges between monolingual groups? Is there a pattern of lan-
are connected to the EN group. Such connection property forms a
guage convergence where, say, when your audience contains a cer-
star-shaped network with EN group as a hub. Our observation that
tain fraction of English-only speakers you switch to English?
English acts as a hub language revalidates the prior finding that
Bilinguals and English Act as Bridges.
The previous sec-
English is used as a lingua franca in Twitter.
tion showed that monolinguals form clusters. Now we analyze
Language Convergence Consistent Across Regions. Given
the mono- and bilingual bridges that glue multilingual societies to-
that the tweets are broadcast to every follower, how should mul-
gether, as well as help to avoid language-ghettoization. We first vi-
tilinguals choose their language? A game-theoretic approach with
sualize how monolingual and multilingual users follow each other.
an objective of "maximizing the audience" might predict that it re-
Figure shows the ratio of follows among lingua groups in the
quires the users to switch to the language of largest fraction. This
three regions. A node represents a lingua group, and the size of a
could then quickly lead to a global convergence to a single lingua
node corresponds to the relative number of users in that group. We
franca and pose a threat to the preservation of language. We inves-
only show nodes for the lingual groups that are represented in Fig-
tigate this issue by looking at the language distribution of bilingual
ure Only edges with weight higher 10% are shown to avoid vi-
users on Twitter. A user who at least occasionally tweets in differ-
sual clutter. To calculate the numbers underlying the figure, we first
ent language has a choice and could use either language. How does
get the follow distribution toward lingua groups for each user, then
their tweet mixing ratio, i.e., the fraction of tweets in English, de-
averaged the distributions for all users in the same lingua group.
pend on the mixing ratio of their followers, i.e., the average tweet
We found that bilingual groups bridge monolingual groups. The
mixing ratio of their followers, bilingual or not? If we were to ob-
key findings are (i) English acts as a hub language, meaning mono-
serve a steep, threshold-like shape of the curve for English, where
lingual groups are connected through a X-EN bilingual group, or
bilinguals predominantly use English as soon as a small fraction of
through the EN group, (ii) Bilingual group X-Y bridges two mono-
their followers use only English, then this would spell trouble for
lingual groups X and Y , (iii) In-group following takes the largest
the "native" languages.
proportion for monolingual groups, and (iv) Monolingual users donot follow monolingual users of another language.
QATAR, AR-EN BILINGUAL
national doha development vision
government national qa foundation
hamad bin god sheikh tamim emir
Q¢ gcc countries
doha egypt kuwait bahrain uae
day national doha gcc about today
doha volleyball football uae photo
instagram katara instagood love
job please send cv recruitment org
SWITZERLAND, EN-FR BILINGUAL
more country politics no federal
ch news tar info thank you radio
job manager head senior engineer
ski weather romandie snow rentals
lake sun sky sunset beautiful night
love like time good show morning
club dj party enjoy welcome house
fashion valais basel wine beautiful
Figure 5: Trendline for the relation between language usage of a
Part of the topics discovered from tweets containing
bilingual user and their monolingual friends. The X-axis indicates
country hashtags and are posted by bilinguals. Topics are manu-
the average percentage of monolingual followers for a user and the
ally labeled from the top words. We translated the most frequent
language of interest. The Y-axis indicates the average percentage
words into English. We do not display stopwords and region names.
of a bilingual user's tweets in that language.
For all three regions, tweets containing local language hashtag aremainly of informative/political/debatable topics, while tweets con-taining English hashtag are event/tour/enjoyment topics.
Figure plots the language distribution of a bilingual user's mono-
lingual friends on the x-axis and the language distribution of bilin-gual users with the corresponding language distribution among fol-
k = 10, and after fitting the model we use Google translate service
lowers on the y-axis. To draw a trend line, we divide the x axis into
to translate words into English. We set α = 50/k and β = 0.01 for
ten bins, and average users' proportion of language usage for each
LDA. Table shows the part of topics from two regions. We found
bin. Users having < 5 followers in the induced network are filtered
that from all bilingual groups in three regions, bilingual users post
out. The observed pattern is very consistent among languages and
informational and political tweets for the local audience in local
geographical regions. Simply put, bilingual users "mimic" the lan-
language. They, on the other hand, post events, tourism, photog-
guage mix observed among their folloThough the equality
raphy, and other leisure-related tweets in English for the non-local
does not hold for users at either end of the spectrum, say, with 90%
audience. These results show that our methodology of identifying
English among their followers, this is likely an artifact as we only
bilingual Twitter users and analyzing the topics of their tweets can
consider bilinguals. While it is still possible that a large number of
reveal the semantics of communications among multiple language
users have already been converted to monolingual users, The very
speakers in a multilingual society.
smooth and consistent pattern suggests that the language conver-sion process is more gradual than one might have expected.
Bilinguals Post Different Stories in Different Language. To
We presented a large-scale computational analysis of language
gain a deeper understanding of the role of bilingual users in a mul-
use and network characteristics of the language-based groups in
tilingual society, we analyze the contents of the tweets from bilin-
multilingual societies using Twitter data. Using the extensive set of
gual users. To observe any systematic differences in language use,
tweets from monolingual and bilingual users from Qatar, Switzer-
we use a parallel set of tweets containing the same hashtags in two
land, and Quebec, we first discovered that monolingual users clus-
languages, train a topic model on those tweets to reveal the differ-
ter together, while bilinguals do not. Then, we revealed that users
ences in the topics of the same hashtags. For this analysis, we train
speaking local language have more influence than others. Addi-
latent Dirichlet allocation model to discover the topics and ana-
tionally, we have shown that, surprisingly, the language-mixing ra-
lyze any differences in information depending on the language. We
tio of bilingual users closely mirrors the mix of their followership.
did not use the Polylingual topic as it requires a corpus
Then we showed that bilinguals bridge between monolinguals with
of documents in different languages with similar topics. After run-
English as a hub, while monolinguals tend not to directly follow
ning LDA for each language and we translated the top words into
each other. Finally, with the statistical topic model, we discovered
English. We avoid the topic alignment problem by using the set of
that bilinguals express informative/political/debatable topics in a
translated hashtag pairs. For instance, we use (#suisse - #switzer-
local language, while posting event/tour/enjoyment topics on the
land) for EN-FR in Switzerland, and respective country hashtag
pairs for other regions. We set the number of topics for each set,
4We observed nearly identical plots when we also included bilin-
gual followers for the x-axis and "bucketed" them according to
This research was funded by the MSIP (Ministry of Science, ICT
their tweet mixing ratio.
& Future Planning), Korea in the ICT R&D Program 2014.
[18] J. A. Fishman. Reversing language shift: Theoretical and
[1] D. M. Abrams and S. H. Strogatz. Linguistics: Modelling the
empirical foundations of assistance to threatened languages.
dynamics of language death. Nature, 2003.
Multilingual matters, 1991.
[2] J. Androutsopoulos. Language choice and code-switching in
[19] S. Gal. Language shift: Social determinants of linguistic
german-based diasporic web forums. The multilingual
change in bilingual Austria. Academic Press New York,
Internet, 2007.
[3] J. Anis. Neography: Unconventional spelling in french sms
[20] L. A. Grenoble and L. J. Whaley. Endangered languages:
text messages. The multilingual Internet, 2007.
Language loss and community response. Cambridge
[4] A.-S. Axelsson, Å. Abelin, and R. Schroeder. Anyone speak
University Press, 1998.
swedish? tolerance for language shifting in graphical
[21] R. Guimera and L. A. N. Amaral. Functional cartography of
multi-user virtual environmnets. The multilingual Internet,
complex metabolic networks. Nature, 2005.
[22] C. K.-M. Lee. Linguistic features of email and icq instant
[5] G. Bailey, J. Goggins, and T. Ingham.
messaging in hong kong. The multilingual Internet, 2007.
[23] W. Li. Three generations, two languages, one family:
Language choice and language shift in a Chinese community
[6] S. Bergsma, P. McNamee, M. Bagdouri, C. Fink, and
in Britain. Multilingual Matters, 1994.
T. Wilson. Language identification for creating
[24] M. Mainguy, Y. Nakai, and M. Takayama.
language-specific twitter collections. In LASM. ACL, 2012.
[7] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet
allocation. JMLR, 2003.
[8] P. Bourdieu. Distinction: A social critique of the judgement
[25] E. Manley.
of taste. Harvard University Press, 1984.
[9] S. Carter, W. Weerkamp, and M. Tsagkias. Microblog
language identification: Overcoming the limitations of short,
[26] L. Milroy. Language and social networks. B. Blackwell,
unedited and idiomatic text. Language Resources and
Evaluation, 2013.
[27] D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith, and
[10] X. Castelló, V. M. Eguíluz, and M. San Miguel. Ordering
A. McCallum. Polylingual topic models. In EMNLP, pages
dynamics with two non-excluding options: bilingualism in
880–889, 2009.
language competition. New Journal of Physics, 2006.
[28] J. W. Minett and W. S. Wang. Modelling endangered
[11] M. Cha, H. Haddadi, F. Benevenuto, and P. K. Gummadi.
languages: The effects of bilingualism and social structure.
Measuring user influence in twitter: The million follower
Lingua, 2008.
fallacy. ICWSM, 2010.
[29] S. S. Mufwene. Language evolution: Contact, competition
[12] S. Climent, J. Moré, A. Oliver, M. Salvatierra, I. Sànchez,
and change. Continuum International Publishing Group,
and M. Taulé. Can machine translation enhance the status of
catalan versus spanish in online academic forums? B. Danet
[30] T. Poell and K. Darmoni. Twitter as a multilingual space:
& S. Herring (Eds.), The multilingual Internet, 2007.
The articulation of the tunisian revolution through
[13] X. Daming, W. Xiaomei, and L. Wei. 15 social network
#sidibouzid. NECSUS, 2012.
analysis. The Blackwell guide to research methods in
[31] M. S. Schmid. Language attrition. Cambridge University
bilingualism and multilingualism, 2008.
Press Cambridge„ UK, 2011.
[14] J.-M. Dewaele. Blistering barnacles! what language do
[32] B. Suh, L. Hong, P. Pirolli, and E. H. Chi. Want to be
multilinguals swear in? Estudios de Sociolinguistica, 2004.
retweeted? large scale analytics on factors impacting retweet
[15] N. C. Dorian. Language death: The life cycle of a Scottish
in twitter network. In SocialCom. IEEE, 2010.
Gaelic dialect. University of Pennsylvania Press
[33] R. Wardhaugh. Languages in competition: Dominance,
Philadelphia, 1981.
diversity, and decline. B. Blackwell, 1987.
[16] N. C. Dorian. Investigating obsolescence: Studies in
[34] M. Warschauer, G. R. E. Said, and A. G. Zohry. Language
language contraction and death. Cambridge University
choice online: Globalization and identity in egypt. JCMC,
Press, 1992.
[17] M. Durham. Language choice on a swiss mailing list. JCMC,
Source: http://uilab.kaist.ac.kr/research/HT14/twitter_languages.pdf
THE NATIONAL TELECOMMUNICATIONS REGULATORY COMMISSION ICT NEWSLETTER NTRC ICT NEWSLETTER ISSUE #68 December 2015 Cyber Tips Realize that you are an attractive target to hackers. Don't ever say "It won't happen to me." - management. Use a strong mix of characters, and don't use the same password for multiple sites.
VOLUME 16, NUMBER 1 August 2009 AN IEER PUBLICATION Radioactive Rivers and Rain: Routine Releases of Tritiated Water From nuclear Power PlantsBy annIe MakhIjanI and arjun MakhIjanI, ph.d.nuclear power plants generate tritium in the course of their operation and release it both to the atmosphere and to water bodies. tritium