Need help?

800-5315-2751 Hours: 8am-5pm PST M-Th;  8am-4pm PST Fri
Medicine Lakex

Sociolinguistic Analysis of Twitter in
Republic of Korea Research Institute University of London Republic of Korea be a political statement with wide-ranging implications. In some In a multilingual society, language not only reflects culture and her- cases, governments subsidize programs to save a language from itage, but also has implications for social status and the degree of disappearing. At the individual level, people who are fortunate to integration in society. Different languages can be a barrier between be bilingual constantly make a choice in favor of or against one monolingual communities, and the dynamics of language choice or the other language. Anecdotally, even after mastering a second could explain the prosperity or demise of local languages in an in- language many people continue to count (and swear in their ternational setting. We study this interplay of language and net- mother tongue. At the societal level, the question arises to which work structure in diverse, multi-lingual societies, using Twitter. In degree bilinguals are the "glue" that keeps multilingual societies our analysis, we are particularly interested in the role of bilinguals.
Concretely, we attempt to quantify the degree to which users are We study the phenomena of multilingual societies and the role the "bridge-builders" between monolingual language groups, while that bilinguals play in them by using large amounts of Twitter data.
monolingual users cluster together. Also, with the revalidation of Social media data has the fascinating component of also containing a network. These social links allow us to investigate the interaction lingua franca on Twitter, we reveal users of the native non-English language have higher influence than English users, and between a user's language and their social surroundings. Under- the language convergence pattern is consistent across the regions.
standing this interaction has a number of potential implications: Furthermore, we explore for which topics these users prefer their • Preservation of a language. Assuming that you are bilingual native language rather than English. To the best of our knowledge, and that all of your friends understand English reasonably this is the largest sociolinguistic study in a network setting.
well, but not all understand your native language. Shouldyou switch your language to maximize your audience size? Categories and Subject Descriptors • Social capitaland potential issues of segregation. Is it J.4 [Social and Behavioral Sciences]: Sociology possible to build social ties across language barriers? Whichrole do bilinguals play in this "bridge-building"? • Social status and language assimilation. Eliza Doolittle in George Bernard Shaw's "Pygmalion"/"My Fair Lady" un- Multilingualism; Sociolinguistics; Topic Modeling; Social Media derwent a huge change in social status by learning a newlanguage, though just a "high class dialect" in this case. Gen- erally, are there elite languages in multi-lingual societies? The language we speak is an integral part of our culture. We • Language selection. How do bilinguals choose one language use it to communicate, to transmit facts and emotions, and to nav- over the other for a given topic? Do they prefer their mother igate the social environment surrounding us. In multilingual soci- tongue for issues "close to the heart"? Correspondingly, is eties such as Canada or Switzerland, the spoken language can even the same topic discussed differently in different languages? ∗This work was done while the first author was at Qatar Computing We explore these questions with large-scale Twitter data from sev- Research Institute.
eral multilingual societies. We analyze the Twitter following be-havior to uncover whether monolingual users form tightly con-nected clusters, what bridging roles multilingual users play, andwhich language groups show higher social status. We apply lan-guage processing to analyze the amount of language usage depend-ing on the surrounding network, and probabilistic topic modelingto discover the differences in topics in different languages by mul-tilingual users. Methodologically, we propose metrics for quanti-fying language use and network diversity in multilingual Twitter This is the author's version of the work. It is posted here for your personal network, and we illustrate techniques from machine learning ap- use. Not for redistribution. The definite version was published in HT'14.
plied to multilingual tweets.
There is now widespread recognition among linguists that so- cial media such as Twitter are highly multilingual and provide animmense volume of real-world language data. Several studies insociolinguistics have explored Twitter and other online languageand in particular, social scientistshave examined the strategic use of multilingualism on Twitter inrecent political movements However, these studies are lim-ited in scope, which require computational tools for a systematicanalysis of multilingualism that involves both network analysis andlanguage processing at a large scale.
More recently, sociolinguists, together with computer scientists, have tried to map out linguistic diversity through spatial and tem-poral analyses of multilingual Twitter. Studies by and haverevealed the extensive use of a large number of different languages Figure 1: Visualization of Qatar Twitter network. Each node and in Manchester, discovering Twitter users in Manchester are con- edge represents user and followings in the Twitter networks. Each nected globally and use languages other than those recorded in the node is colored by the language usage from corresponding user's local census. Similarly, used Twitter to detect the extent of tweets. AR-EN Bilingual users are located between monolingual multilingualism in London, which revealed that there are specific geographical concentrations of monolingual users of different lan-guages. The processes of language shift, language attrition, lan-guage loss, language endangerment and language death were alsoinvestigated The related but different pro-cess is competition between different groups of language users While groups with more socio-economic power often have acrucial impact on the spread of particular language, the size of thespeaker group also plays a significant role. It has been shown thata single monolingual speaker of a particular language may hold thekey to the survival of the language in the bilingual community, asthe bilingual speakers try to accommodate the monolingual speakerIt then follows that relatively low number of languageusers could have a snowballing effect and prompt the majority touse a specific language in Twitter. A number of mathematical mod-els for language competition have been proposed Social scientists, especially sociolinguists, have long been inter- ested in the role language plays in the formation of social networks Figure 2: Language distribution of the Twitter users for each re- and in how structures of social networks impact on language prac- gion. The upper bar illustrates the distribution of language usage, tices Relatively little is known about the role multilingual- and the lower bar shows the distribution of mono-, bi-, and trilin- ism plays in forming these networks and how the virtual networks gual users. For all regions there are < 1% of trilingual users.
impact on multilingual practices. While it is expected that speak-ers would identify themselves more easily with others who sharethe same languages, therefore forming language-specific clusters, using FollowerwTo capture users with only the city names, it is not clear how monolinguals and bilinguals would pattern in for each of the three countries we compiled lists of cities with more relation to each other. Understanding the pattern of connections than 106 inhabitants, along with the names translated into multiple between monolingual and bilingual speakers would not only offer languages using Wikipedia entries. After identifying users from the a new perspective on multilingualism on the social media, but also regions of interest, we used the Twitter API to crawl all their friends provide new insights into the societal structures and human rela- (= followings) and followers, and up to 3,200 recent tweets for each tions in multilingual societies.
user. Table shows the statistics of the Twitter data. We ignoredinactive users with fewer than 5 tweets. Then, we classified the lan-guage used in each tweet, which is not a trivial task because a largenumber of tweets contain very little information for language clas- DATA COLLECTION AND LANGUAGE sification and a single tweet can mix multiple languages. Tooptimize language classification accuracy against these challenges, we aggregated all tweets for each user into a document of tweets.
We collected Twitter data (recent tweets and friend/follower lists) After removing mentions, hashtags, and URLs, we used Compact from two countries (Qatar, Switzerland) and Quebec province in Language Detector 2 and detected the top three languages with Canada. We identified Twitter accounts from two sources. First, their approximate percentages of the text bytes in the document.
we used Twitter streams provided by GNIP as part of a trial pe- We define users as speaking a language if they have ≥ 15% of text riod. These streams comprise (i) about 28 hours of Firehose stream bytes written in the respective language. Figure shows the distri- (ii) two weeks of decahose stream, both around June to August bution of language usage.
2013. Users with at least one geo-tagged tweet from Qatar andSwitzerland are considered as candidates. Another source of Twit- ter accounts is the location information from the public user profile Figure 3: Network measurements for lingua groups in each multilingual region. Diversity indices D1 (Eq. and D2 (Eq. are calculatedbased on the distribution of the edges toward the lingua groups. Self-follow Index (Eq. measures the probability that a new edge from auser would be directed to another user in the same lingua group. + marks the baseline for Self-follow Index, based on purely random edges.
QUANTIFYING LANGUAGE USE AND Figure shows the visualization of the networks. We used NETWORK CHARACTERISTICS software for plotting with the Yifan Hu graph layout algorithm. Forthe purpose of illustration, we highlighted the bilingual users. This In this section, we present a quantitative analysis for discovering visualization shows that monolingual users cluster together, while social structure in the multilingual societies. After labeling each bilinguals are located between the two monolingual groups. To user by the language they tweet, we created a network for each re- quantify this observation, we measured the diversity of outlinks for gion with nodes representing users and directed edges representing each lingua group. We first labeled and grouped users by the lin- Twitter following relationships. Table shows the statistics and gua. Then, we counted the edges between the groups. To calculate network measurements for the five networks.
this, we define er We first define a lingua as a mono- or multi-lingual combination ij as the edge from user i to j in region r, pointing from Twitter follower i to the friend j. Er of languages. We also define S is the set of all possible linguae, for ij is the set of respective edges. Then we define P r instance, {EN, AR, AR-EN, DE-EN-FR, .}. The "lingua group", ij as the proportion of the number of edges from i to j over the all outgoing edges from i. Formally, i , is the set of users speaking lingua i in region r. We quantify the distribution of linguae for each region r using Shannon entropy: where pri indicating the fraction of users speaking lingua i over the Diversity Measures. We first define D Twitter population of our data in region r. Higher H(r) indicates 1, based on Shannon En- tropy of outgoing links for language subgroup i in region r, so that that there is an even distribution of linguae over the population, and the entropy value for each region is shown in Table As Figure 1 indicates that outgoing edges will be more concentrated to a single lingua group. And we define D shows, Switzerland has the most diverse distribution of linguae, 2 using Simpson Index, the special version of inverse True diversity: followed by Qatar and Quebec.
D2 is used in as Participation coefficient, to measure how a node's connections are ‘well-distributed', and from definitions, a NETWORK MEASUREMENTS lingua group with lower D1 or D2 has more concentrated outlinks into the small set of lingua groups. Since these metrics can only quantify the general shape of distribution, whether it is diverse or concentrated, we define Self-follow index, the probability of mak- ing a homogeneous connection when creating new edge from eachuser in a group. Higher intra-edges with lower inter-edges indicate Table 1: Data Statistics and network measurements. Edges (di- the average user tends to follows another from the same language rected) represent followings. See Eq. for ∗H(r), for which higher group. We call self-follow index of a lingua group as the average value indicates that the distribution of languages in the region is less of self-follow index of all users in the group. The self-follow index skewed. ∗∗GCC, node coverage of the greatest connected compo-nent over the network, is calculated with undirected edges.
of user a, region r and lingua group i is defined as Monolinguals Cluster Together. For all three regions, we found that monolingual groups consistently show lower D1 scores thanmultilingual groups. As in D1, monolingual groups have higherD2 when compared to the multilingual groups in the same regionexcept for the English monolinguals. Figure shows both D1 andD2 for each group. We also found that monolingual lingua groupshave higher self-follow index than any bilingual groups in all re-gions. The results from three diversity metrics and Intra-Inter edgeratio suggests that users in monolingual subgroups have a strongtendency to follow users inside of the same subgroup, while bilin-guals do not.
Users of Local Language have Higher Influence. We explore the question of language use and social status, which we estimatesimply with the number of followers, as studies have shown thattweets from a user with a high in-degree are more likely to beretweeted We first look at the mean and median of thenumber of followers and friends of users in each lingua group withinthe network. This is to approximate the user's intra-region socialstatus by excluding the effect toward the outside of the network inour data. To minimize the effects of outliers we removed the topand bottom 10% of users for the number of followers and friends.
Figure shows the average number of followers and friends foreach lingua group in three regions, and median strictly followedthe mean numbers. In all three regions and all lingua groups, wefound that the number of friends is always larger than the number Figure 4: Following patterns among lingua groups. The color of of followers. Also, for all three regions, users tweeting in the lo- an edge corresponds to the source color of the edge. Following cal language have more followers and friends, even when there are distribution normalized for each user and averaged over group. The more English monolinguals in the dataset, such as in Switzerland number for each edge corresponds to the ratio of the followings and Quebec. This phenomenon shows that users tweeting in the over all from source group. Edges having weight > 0.10 are shown.
local language exert higher influence within the regional network.
ANALYSIS OF BILINGUAL GROUPS For all three regions, we found that English users communi- cate with bilingual groups. Specifically, monolingual groups are In this section, we analyze questions such as: Do bilinguals act strongly connected with respective bilingual groups, which in turn as bridges between monolingual groups? Is there a pattern of lan- are connected to the EN group. Such connection property forms a guage convergence where, say, when your audience contains a cer- star-shaped network with EN group as a hub. Our observation that tain fraction of English-only speakers you switch to English? English acts as a hub language revalidates the prior finding that Bilinguals and English Act as Bridges.
The previous sec- English is used as a lingua franca in Twitter.
tion showed that monolinguals form clusters. Now we analyze Language Convergence Consistent Across Regions. Given the mono- and bilingual bridges that glue multilingual societies to- that the tweets are broadcast to every follower, how should mul- gether, as well as help to avoid language-ghettoization. We first vi- tilinguals choose their language? A game-theoretic approach with sualize how monolingual and multilingual users follow each other.
an objective of "maximizing the audience" might predict that it re- Figure shows the ratio of follows among lingua groups in the quires the users to switch to the language of largest fraction. This three regions. A node represents a lingua group, and the size of a could then quickly lead to a global convergence to a single lingua node corresponds to the relative number of users in that group. We franca and pose a threat to the preservation of language. We inves- only show nodes for the lingual groups that are represented in Fig- tigate this issue by looking at the language distribution of bilingual ure Only edges with weight higher 10% are shown to avoid vi- users on Twitter. A user who at least occasionally tweets in differ- sual clutter. To calculate the numbers underlying the figure, we first ent language has a choice and could use either language. How does get the follow distribution toward lingua groups for each user, then their tweet mixing ratio, i.e., the fraction of tweets in English, de- averaged the distributions for all users in the same lingua group.
pend on the mixing ratio of their followers, i.e., the average tweet We found that bilingual groups bridge monolingual groups. The mixing ratio of their followers, bilingual or not? If we were to ob- key findings are (i) English acts as a hub language, meaning mono- serve a steep, threshold-like shape of the curve for English, where lingual groups are connected through a X-EN bilingual group, or bilinguals predominantly use English as soon as a small fraction of through the EN group, (ii) Bilingual group X-Y bridges two mono- their followers use only English, then this would spell trouble for lingual groups X and Y , (iii) In-group following takes the largest the "native" languages.
proportion for monolingual groups, and (iv) Monolingual users donot follow monolingual users of another language.
QATAR, AR-EN BILINGUAL national doha development vision government national qa foundation hamad bin god sheikh tamim emir Q¢ gcc countries doha egypt kuwait bahrain uae day national doha gcc about today doha volleyball football uae photo instagram katara instagood love job please send cv recruitment org SWITZERLAND, EN-FR BILINGUAL more country politics no federal ch news tar info thank you radio job manager head senior engineer ski weather romandie snow rentals lake sun sky sunset beautiful night love like time good show morning club dj party enjoy welcome house fashion valais basel wine beautiful Figure 5: Trendline for the relation between language usage of a Part of the topics discovered from tweets containing bilingual user and their monolingual friends. The X-axis indicates country hashtags and are posted by bilinguals. Topics are manu- the average percentage of monolingual followers for a user and the ally labeled from the top words. We translated the most frequent language of interest. The Y-axis indicates the average percentage words into English. We do not display stopwords and region names.
of a bilingual user's tweets in that language.
For all three regions, tweets containing local language hashtag aremainly of informative/political/debatable topics, while tweets con-taining English hashtag are event/tour/enjoyment topics.
Figure plots the language distribution of a bilingual user's mono- lingual friends on the x-axis and the language distribution of bilin-gual users with the corresponding language distribution among fol- k = 10, and after fitting the model we use Google translate service lowers on the y-axis. To draw a trend line, we divide the x axis into to translate words into English. We set α = 50/k and β = 0.01 for ten bins, and average users' proportion of language usage for each LDA. Table shows the part of topics from two regions. We found bin. Users having < 5 followers in the induced network are filtered that from all bilingual groups in three regions, bilingual users post out. The observed pattern is very consistent among languages and informational and political tweets for the local audience in local geographical regions. Simply put, bilingual users "mimic" the lan- language. They, on the other hand, post events, tourism, photog- guage mix observed among their folloThough the equality raphy, and other leisure-related tweets in English for the non-local does not hold for users at either end of the spectrum, say, with 90% audience. These results show that our methodology of identifying English among their followers, this is likely an artifact as we only bilingual Twitter users and analyzing the topics of their tweets can consider bilinguals. While it is still possible that a large number of reveal the semantics of communications among multiple language users have already been converted to monolingual users, The very speakers in a multilingual society.
smooth and consistent pattern suggests that the language conver-sion process is more gradual than one might have expected.
Bilinguals Post Different Stories in Different Language. To We presented a large-scale computational analysis of language gain a deeper understanding of the role of bilingual users in a mul- use and network characteristics of the language-based groups in tilingual society, we analyze the contents of the tweets from bilin- multilingual societies using Twitter data. Using the extensive set of gual users. To observe any systematic differences in language use, tweets from monolingual and bilingual users from Qatar, Switzer- we use a parallel set of tweets containing the same hashtags in two land, and Quebec, we first discovered that monolingual users clus- languages, train a topic model on those tweets to reveal the differ- ter together, while bilinguals do not. Then, we revealed that users ences in the topics of the same hashtags. For this analysis, we train speaking local language have more influence than others. Addi- latent Dirichlet allocation model to discover the topics and ana- tionally, we have shown that, surprisingly, the language-mixing ra- lyze any differences in information depending on the language. We tio of bilingual users closely mirrors the mix of their followership.
did not use the Polylingual topic as it requires a corpus Then we showed that bilinguals bridge between monolinguals with of documents in different languages with similar topics. After run- English as a hub, while monolinguals tend not to directly follow ning LDA for each language and we translated the top words into each other. Finally, with the statistical topic model, we discovered English. We avoid the topic alignment problem by using the set of that bilinguals express informative/political/debatable topics in a translated hashtag pairs. For instance, we use (#suisse - #switzer- local language, while posting event/tour/enjoyment topics on the land) for EN-FR in Switzerland, and respective country hashtag pairs for other regions. We set the number of topics for each set, 4We observed nearly identical plots when we also included bilin- gual followers for the x-axis and "bucketed" them according to This research was funded by the MSIP (Ministry of Science, ICT their tweet mixing ratio.
& Future Planning), Korea in the ICT R&D Program 2014.
[18] J. A. Fishman. Reversing language shift: Theoretical and [1] D. M. Abrams and S. H. Strogatz. Linguistics: Modelling the empirical foundations of assistance to threatened languages.
dynamics of language death. Nature, 2003.
Multilingual matters, 1991.
[2] J. Androutsopoulos. Language choice and code-switching in [19] S. Gal. Language shift: Social determinants of linguistic german-based diasporic web forums. The multilingual change in bilingual Austria. Academic Press New York, Internet, 2007.
[3] J. Anis. Neography: Unconventional spelling in french sms [20] L. A. Grenoble and L. J. Whaley. Endangered languages: text messages. The multilingual Internet, 2007.
Language loss and community response. Cambridge [4] A.-S. Axelsson, Å. Abelin, and R. Schroeder. Anyone speak University Press, 1998.
swedish? tolerance for language shifting in graphical [21] R. Guimera and L. A. N. Amaral. Functional cartography of multi-user virtual environmnets. The multilingual Internet, complex metabolic networks. Nature, 2005.
[22] C. K.-M. Lee. Linguistic features of email and icq instant [5] G. Bailey, J. Goggins, and T. Ingham. messaging in hong kong. The multilingual Internet, 2007.
[23] W. Li. Three generations, two languages, one family: Language choice and language shift in a Chinese community [6] S. Bergsma, P. McNamee, M. Bagdouri, C. Fink, and in Britain. Multilingual Matters, 1994.
T. Wilson. Language identification for creating [24] M. Mainguy, Y. Nakai, and M. Takayama.
language-specific twitter collections. In LASM. ACL, 2012.
[7] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 2003.
[8] P. Bourdieu. Distinction: A social critique of the judgement [25] E. Manley.
of taste. Harvard University Press, 1984.
[9] S. Carter, W. Weerkamp, and M. Tsagkias. Microblog language identification: Overcoming the limitations of short, [26] L. Milroy. Language and social networks. B. Blackwell, unedited and idiomatic text. Language Resources and Evaluation, 2013.
[27] D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith, and [10] X. Castelló, V. M. Eguíluz, and M. San Miguel. Ordering A. McCallum. Polylingual topic models. In EMNLP, pages dynamics with two non-excluding options: bilingualism in 880–889, 2009.
language competition. New Journal of Physics, 2006.
[28] J. W. Minett and W. S. Wang. Modelling endangered [11] M. Cha, H. Haddadi, F. Benevenuto, and P. K. Gummadi.
languages: The effects of bilingualism and social structure.
Measuring user influence in twitter: The million follower Lingua, 2008.
fallacy. ICWSM, 2010.
[29] S. S. Mufwene. Language evolution: Contact, competition [12] S. Climent, J. Moré, A. Oliver, M. Salvatierra, I. Sànchez, and change. Continuum International Publishing Group, and M. Taulé. Can machine translation enhance the status of catalan versus spanish in online academic forums? B. Danet [30] T. Poell and K. Darmoni. Twitter as a multilingual space: & S. Herring (Eds.), The multilingual Internet, 2007.
The articulation of the tunisian revolution through [13] X. Daming, W. Xiaomei, and L. Wei. 15 social network #sidibouzid. NECSUS, 2012.
analysis. The Blackwell guide to research methods in [31] M. S. Schmid. Language attrition. Cambridge University bilingualism and multilingualism, 2008.
Press Cambridge„ UK, 2011.
[14] J.-M. Dewaele. Blistering barnacles! what language do [32] B. Suh, L. Hong, P. Pirolli, and E. H. Chi. Want to be multilinguals swear in? Estudios de Sociolinguistica, 2004.
retweeted? large scale analytics on factors impacting retweet [15] N. C. Dorian. Language death: The life cycle of a Scottish in twitter network. In SocialCom. IEEE, 2010.
Gaelic dialect. University of Pennsylvania Press [33] R. Wardhaugh. Languages in competition: Dominance, Philadelphia, 1981.
diversity, and decline. B. Blackwell, 1987.
[16] N. C. Dorian. Investigating obsolescence: Studies in [34] M. Warschauer, G. R. E. Said, and A. G. Zohry. Language language contraction and death. Cambridge University choice online: Globalization and identity in egypt. JCMC, Press, 1992.
[17] M. Durham. Language choice on a swiss mailing list. JCMC,



THE NATIONAL TELECOMMUNICATIONS REGULATORY COMMISSION ICT NEWSLETTER NTRC ICT NEWSLETTER ISSUE #68 December 2015 Cyber Tips Realize that you are an attractive target to hackers. Don't ever say "It won't happen to me." - management. Use a strong mix of characters, and don't use the same password for multiple sites.

VOLUME 16, NUMBER 1 August 2009 AN IEER PUBLICATION Radioactive Rivers and Rain: Routine Releases of Tritiated Water From nuclear Power PlantsBy annIe MakhIjanI and arjun MakhIjanI, ph.d.nuclear power plants generate tritium in the course of their operation and release it both to the atmosphere and to water bodies. tritium