The many languages of digital infrastructures
PUTHIYA PURAYIL SNEHA and ANASUYA SENGUPTA
THE ongoing pandemic has compelled much needed reflection on questions of access and infrastructure in India, especially during a time that has rendered the internet and digital technologies as essential, and in many ways the ‘new normal’. Even as we have been coming to terms with how best to cope with a myriad set of new regulations for public and private life now, framed with the promise of a ‘digital India’ in mind, the need to create diverse, inclusive and equitable information societies has become the need of the hour. Linguistic barriers in particular, in reading, writing and speaking in multiple languages on digital interfaces (whether internet or mobile phones) remain persistent today across the world, especially for marginalized and non-dominant communities.
According to the Net.Lang Report published by the Maaya Network and UNESCO in 2011, about 97 per cent of the world’s population speaks about 4 per cent of the world’s languages. Conversely, about 96 per cent of the world’s languages are spoken by only about 3 per cent of people around the world.’1 This disparity is reflected online as well, where out of more than 7,000 existing languages, only a small percentage is recognized as being in use on the internet.2
A recent initiative by Ethnologue (an online and print database and resource on languages) to translate Covid 19 safety guidelines across different languages also notes that many still do not have access to a single resource on the pandemic.3 More languages are becoming endangered and disappearing every day, even as there are language shifts, and transformations in the way that they are mediated through digital technologies. It is more imperative than ever therefore, to address these knowledge and infrastructural gaps in order to make the web more multilingual, accessible and safe, particularly for marginalized and non-dominant communities.
While the problem of linguistic disparity on the internet may seem like a recent one, prompted by advancements in technology that languages have simply been unable to keep up with, its antecedents are much older, and intertwined in a complex history of colonial infrastructures of knowledge production. Perhaps the most seminal exploration of this relationship between language and colonization is in the work of Ngugi wa Thiong’o, theorising the role of language in perpetuating the objectives of imperialism, through its construction of culture, literary tradition and history.4 The ‘language debate’ since then, has transcended the realm of postcolonial studies, with several efforts in theory and practice across the world aimed towards linguistic decolonization.
Closer home, G.N. Devy, in describing a contemporary ‘crisis of knowledge’ in India, notes that it is the result of a ‘cultural amnesia’ of Indian knowledge traditions, as well as the fact that knowledge production in the Global South or colonized nations are not considered on par with that of the West. He further notes the long-term impact of such disparities, reflected in a lack of understanding of and attention to indigenous, tribal and minority languages in India, a result of both the colonial encounter and the caste hegemony deeply ingrained in knowledge traditions of the country.5
The resulting uneven development and use of languages, whether in education, governmental administration or governance in other fields like technology, continues therefore to be a norm rather than the exception. The advancements in the growth of the internet and digital technologies over the last couple of decades, while offering much promise to address some of these disparities (through initiatives such as Digital India for example) still face ongoing challenges, given a persistent and complex digital divide.6
Like with other long-standing forms of systemic social exclusion, linguistic barriers are a result of the asymmetries of power and knowledge, and the internet is both a reflection as well as sometimes a deepening of these forms of inequity. However, there are several efforts being undertaken across the world today, towards making the internet more multilingual, and thereby more accessible and equitable in its representations of multiple forms and systems of knowledge. This essay is a set of reflections on two recent projects on languages and the internet, and some observations and learnings on the scope and nature of the challenges, as well as the challenges and potential strategies to address them.
As illustrated above, there is still much to learn about the inequities that persist in terms of the access and use of marginalized and non-dominant languages on the internet, not only in India but across the world. Our reflections here are informed by two recent efforts to map and address some of these issues. The first is a forthcoming report on the ‘State of the Internet’s Languages’ (STIL), led by Whose Knowledge?, in collaboration with the Oxford Internet Institute and Centre for Internet and Society.7 The report brings together data and stories on how people read/write/speak online in multiple languages. Working with the premise that ‘language is a proxy for knowledge’, the report looks at how most human knowledge, especially those produced in non-dominant and marginalized languages, continues to remain under-represented on the web.
The initiative defines marginalized and non-dominant languages as those marginalized in both global and local contexts by different historical and ongoing dynamics of power and privilege. For instance, most languages that are not European colonial languages – English, Spanish, French, Portuguese, and Dutch – are marginalized in terms of published content, whether in print or digitally. At a more ‘local’ or ‘regional’ level, within highly multilingual contexts like South Asia and India, dominant (and sometimes considered ‘national’ or ‘official’) languages like Hindi, Urdu, Sinhala or Bangla can eclipse others in use and status.
Multiple forms of systemic power and privilege further reinforce and exacerbate existing inequities online, as a result of which consumers of knowledge are not part of the ‘design, architecture, substance, and experience of this information infrastructure.’8 Through a set of narratives drawn from data visualizations of the prevalence of languages on prominent web platforms, and stories about personal experiences of individuals and communities engaging with the internet and digital technologies in their many languages, the report attempts to map these extant knowledge gaps. It also attempts to highlight work being undertaken by people across the world to create a multilingual internet, which is essential to ultimately creating better access and accessibility to knowledge on the web.
The second is a series of collaborative and exploratory short-term research projects on Wikimedia platforms and communities in India, undertaken by team members of the Access to Knowledge programme at CIS.9 The projects work on a range of topics, from systemic issues such as a gender gap and bias in the participation and content creation on Wikipedia in Indian languages, access and reuse of cultural content across languages, and challenges in using these plat-forms in multilingual classrooms, to experiences of content creation in Indian languages on diverse Wikimedia projects beyond Wikipedia (Wikidata, Wiktionary, Wikisource, and so on).10
While the research projects here are still in an exploratory phase, the process of working on them has been informative, in terms of highlighting practical challenges and potential for undertaking research on Indian language Wikimedia platforms. Both the above initiatives highlight several asymmetries in the development and access to content and (digital) infrastructures for languages other than English, as well as efforts being undertaken by communities to address them.
In the course of this research, it has become clear that a significant foundational reason for the inequities related to creation and use of content and digital infrastructure for non-dominant/marginalized languages is the effects of colonization. In fact one of the key attempts of the STIL report is an effort to unpack precisely what is meant by ‘decolonization’ itself. As illustrated by many of the stories in the report, the inherent complexities of the discourse often make very unclear our own locations as decolonial subjects in the present context. The presence of inter-language marginalization, such as between ‘classical and vernacular’ languages or dialects, further complicate the discourse. Some examples of these include the debates on classical language status accorded to several languages11 or the recent concerns over the National Education Policy (2020) and its purported imposition of Hindi in the Indian educational system.12
Stories from the STIL report also outline the complex relationships between languages that have been shaped by histories of colonization and conflict, such as that of Sinhalese and Tamil in Sri Lanka, French and Arabic in Tunisia, or the attention to ‘minority’ languages in Europe (such as Basque, Breton, Karelian and Sardinian). Similarly, the politics of classification of languages, such as colonial/ postcolonial languages, dominant, marginalized, endangered and disappearing languages are also closely informed by these colonial histories, and continue to affect their representation and usability on the internet.
Disparities in the development of multilingual content on the web, are also closely related to their imbrication within, and perpetuation of, systemic inequalities, of race, caste, class, gender, sexuality, bodily ability and beyond. The gender gap and bias on Wikimedia projects is a well-documented challenge across most languages, owing to a lack of content about and participation by contributors across a spectrum of gender and sexual identities and topics.13 This content and participation gap in Indian language communities has also been persistent, despite several efforts to address the challenge. It continues to be driven by socio-cultural factors such as the restricted access to public spaces and digital infrastructure by women, lack of training in technical and communication skills, limited leadership roles and concerns about community health and safety on online platforms.
The limited availability of good quality, informative and educational content on gender and sexuality in languages other than English is therefore a visible gap in digital information infrastructures, not only in India. As noted by stories in the STIL report, searching for terms or topics related to gender and sexuality (such as homosexuality) in local languages, for example, very often throws up irrelevant and derogatory content online, many a time in English. The problem is further compounded for persons with disabilities who may be looking for content on sexuality and disability in Indian languages, as there is very little available, in accessible online formats. The authors therefore also discuss the need to then rely on multiple platforms, including social media, for relevant information and ways to proactively and safely engage in discussions on gender, sexuality and disability in languages other than English on the internet.
The question of form and format is an important one to address in the context of digital information infrastructures, as the internet is still primarily a textual medium most easily accessed by visually-able people. This raises specific challenges for multilingual content. Many languages across the world, including in India are oral and/or use signs and images, and therefore often do not have a script, or use borrowed scripts of a dominant language. As a result, it is important to ask what non-textual forms of multilingual content are, and where they are on the internet.
Santali, for example, which is spoken by close to 7.6 million people in India, Bangladesh, Bhutan and Nepal, and is the third most-spoken Austroasiatic language, was primarily oral until the introduction of the Ol-Chiki script in 1925.14 It was added to the Unicode Standard15 in 2008, which then facilitated the creation of online content in the language. Another example is the group of indigenous languages of Arrernte, spoken by Aboriginal communities of central Australia, which also has a highly developed sign language.16 The Indigemoji app launched in 2019, is another recent effort in bringing the knowledge produced in these languages online, in the form of indigenous emojis.17
The majority of existing languages, therefore, contain a wealth of knowledge and histories that may not be expressed purely through writing, and it is essential to look for ways in which existing digital infrastructures may provide affordances for their usability and development online.
The gaps in digital infrastructure highlight several larger concerns with respect to technological development, access and skills that continue to persist in the creation and use of multilingual content online. Many Indic scripts are not fully Unicode compliant, or have not been added to the standard, as a result of which they are not available for use or rendered correctly in online formats, and therefore remain inaccessible. The lack of Optical Character Recognition (OCR)18 for many languages further makes digitized content inaccessible, and thereby not searchable or amenable for further use. This prevalent digital divide due to linguistic barriers in accessing devices, platforms, apps or software is further aggravated by a lack of sufficient digital or technological literacy in using these tools.
There is also a paucity of efficient content management systems in non-dominant languages, in part due to these technical barriers in the process of identifying and sourcing content, translation, digitization and archiving. In addition, there are multiple technological as well as skill related challenges that prevent the effective preservation, access and use of multilingual content and tools. A simple example here would be to look at the language support offered by web browsers like Google, and the accuracy of their translations. Although the Google Translate API19 is a useful feature that keeps improving over time based on its training on language datasets, a persistent challenge in some Indian language Wikimedia projects has been that of poor quality content, based on machine translations.20
These issues therefore need to be addressed both at the level of the efficacy of software and tools itself, and in terms of training and building human capacities in working with digital technologies in multiple languages. Access, or rather ‘quality of access’, in terms of being able to use the internet and digital media optimally in our preferred languages, still remains a concern in different parts of the world.
These infrastructural gaps also disproportionately affect efforts in the preservation and use of endangered and disappearing languages, a large number of them spoken by marginalized or non-dominant communities. There are significant efforts all over the world to aid this, such as a project on using social media to support the revitalization of Indigenous languages in Turtle Island,21 or the efforts of several researchers and practitioners working to create awareness and use of the Zapotec languages on the web.22 There is a need to therefore understand the issue of language diversity and plurality in terms of local contexts and challenges, through numbers but also embodied personal experiences, in order to support and aid these efforts in creating a more equitable and decolonized internet.
Finally, an overarching problem when looking at infrastructural gaps in creating a multilingual internet, is that of ownership and regulation. Policy reforms encouraging development of technological support for low/non-literate communities and non-dominant and marginalized languages are the need of the hour, and have often been impeded by the challenges mentioned above. Global regulatory bodies have laid out measures and guidelines, such as the W3C23 and ICANN24 practices on language use, or the UN recommendations on policies for protection of linguistic minorities.25 These need to however be adapted to local needs and challenges of specific language communities.
For example, many security vulnerabilities exist on social media platforms due to a lack of recognition of discriminatory or hateful regional language content (which often escapes filters built on algorithms trained on English datasets), and these are often targeted towards women and other marginalized, vulnerable communities. Mechanisms for maintaining safety and community health in online spaces predominantly engaging with regional languages is therefore an important area of focus.
Ownership of digital infrastructure and content also adds another layer of complexity to this discourse, where data privacy,26 regulation of content and liability of intermediaries27 (Internet Service Providers, messaging apps and Over-The-Top, digital and social media platforms for example) are heavily contested topics. Recent developments in laws related to online content have raised concerns about a chilling effect on free speech, especially with respect to content on social media and its impact on greater plurality on the web.
Another important regulatory challenge is the (lack of) awareness of intellectual property rights and possibilities for open licensing which may help free up a large volume of content in diverse languages for wider public access on the internet. On Wikipedia, for instance, content donation efforts are often impeded by concerns of copyright violation, and lack of awareness about open licenses. The legal ‘language’ or vocabulary of open access and licensing also remains predominantly English. How, for example, do we translate terms such as open access, metadata or copyright in Indian languages in easily understandable ways?
The observations shared here outline just some of the challenges faced by several languages in India and across the world, in terms of their representation and use on the internet. As illustrated by many of the examples above, technological barriers are deeply embedded in different structures of power and privilege, just as the knowledge gaps at different levels. To use the same metaphor as earlier, the ‘many languages’ of digital infrastructure – including its technology, regulation and access – are not easily translatable, and therefore need contextualization to be made more accessible to the people whose engagement with the web is most affected by these factors.
The concept of large-scale digital access across multiple languages also needs to be problematized and critiqued with more nuance, as illustrated time and again over the last year, whether in terms of challenges of moving to remote, online education, or now with access to public vaccination programmes facilitated through a single digital platform. Localization of resources, including data, tools and platforms, with respect to languages, is therefore an important aspect of facilitating large-scale access, especially in a country with the linguistic diversity and complexity of India. Importantly, the issue of multilinguality needs to be addressed in all its nuance and history, and by proactively centering the voices of individuals and communities it most affects.
We need to understand better the intersectional nature of the problem of access to the internet, and the impact of ownership and regulation of digital media. Most importantly, we need to acknowledge ongoing community-led efforts to address these challenges. These contextually rich ‘translations’ would help immeasurably in creating an equitable, safe and accessible internet in India and across the world.
1. Laurent Vannini, Crosnier Herve´ Le, and Irina Bokova, ‘Net.lang: Towards the Multilingual Cyberspace’, in Net.lang: Towards the Multilingual Cyberspace. C&F e´ditions, Caen, 2012, p. 13.
2. Miguel Trancozo Trevino, ‘The Many Languages Missing from the Internet’, BBC Future, BBC, 15 April 2020, https://www.bbc. com/future/article/20200414-the-many-lanuages-still-missing-from-the-internet.
3. ‘Coronavirus and Local Languages: How Do You Say, "Wash Your Hands"?’, Ethnologue, 16 July 2020, https://www. ethnologue.com/guides/health.
4. Ngugi wa Thiong’o, Decolonising the Mind: The Politics of Language in African Literature. James Currey, Oxford, 2011.
5. G.N. Devy, ‘Post-Memory Education’ in The Crisis Within: on Knowledge and Education in India. Aleph Book Company, New Delhi, 2017, pp. 52-68.
6. Muntazir Abbas, ‘Unavailability of Local Language Content a Barrier to Digital India: Ajay Prakash Sawhney’, ET Telecom.Com, 31 July 2018, https://telecom.economictimes.indiatimes.com/news/unavailability-of-local-language-content-a-barrier-to-digital-india-ajay-prakash-sawhey/65212852.
7. ‘State of the Internet’s Languages’, Whose Knowledge?, accessed 12 May 2021, https://whoseknowledge.org/initiatives/state-of-the-internets-languages/.
9. Wikimedia Meta-Wiki, ‘CIS-A2K’, Meta, Wikimedia Foundation, Inc., 23 February 2021, https://meta.wikimedia.org/wiki/ CIS-A2K.
10. Reports on some of the completed projects may be seen here: https://cis-india.org/@@search?Subject%3Alist=A2K%20Research
11. A.R. Venkatachalapathy, ‘The "Classical" Language Issue’, Economic and Political Weekly 44(2), 10 January 2009, https://doi.org/https://www.epw.in/journal/2009/02/commentary/classical-language-issue.html.
12. Kumkum Roy, ‘Examining the Draft National Education Policy, 2019’, Economic and Political Weekly 54(25), 25 June 2019, https://doi.org/https://www.epw.in/engage/article/examining-draft-national-education-policy-2019.
13. ‘Gender Bias on Wikipedia’, Wikipedia, Wikimedia Foundation, 12 May 2021, https://en.wikipedia.org/wiki/Gender_bias_on_Wikipedia
14. ‘Santali Language’, Wikipedia. Wikimedia Foundation, 5 May 2021, https://en.wiki-pedia.org/wiki/Santali_language.
15. ‘About the Unicode® Standard’, Unicode Standard, accessed 12 May 2021, https://unicode.org/standard/standard.html.
16. Adam Kendon, Sign Languages of Aboriginal Australia: Cultural, Semiotic and Communicative Perspectives. Cambridge University Press, Cambridge, 2013.
17. ‘Indigemoji’, Indigemoji, accessed 12 May 2021, https://www.indigemoji.com.au/.
18. U. Pal and B.B. Chaudhuri, ‘Indian Script Character Recognition: A Survey’, Pattern Recognition 37(9), 2004, pp. 1887-1899, https://doi.org/10.1016/j.patcog.2004.02.003.
19. Runa Bhattacharjee and Pau Giner, ‘You Can Now Use Google Translate to Translate Articles on Wikipedia’, Wikimedia Foundation, 11 January 2019, https://wikimedia-foundation.org/news/2019/01/09/you-can-now-use-google-translate-to-translate-articles-on-wikipedia/
20. Kyle Wilson, ‘Wikipedia Has a Google Translate Problem’, The Verge, 8 May 2019, https://www.theverge.com/2019/5/8/18526739/wikipedia-translation-tool-machine-learning-ai-english.
21. Social Sciences and Humanities Research Council Government of Canada, ‘Using Twitter to Support Indigenous Cultural Revitalization and Youth Well-Being’, Social Sciences and Humanities Research Council, 29 November 2012, https://www.sshrc-crsh.gc.ca/society-societe/stories-histoires/story-histoire-eng.aspx?story_id=325.
22. See: https://www.facebook.com/yelnban/
23. ‘Internationalization’, W3C, accessed 12 May 2021. https://www.w3.org/standards/webdesign/i18n.
24. ‘Internationalized Domain Names’, ICANN, accessed 12 May 2021, https://www.icann.org/resources/pages/idn-2012-02-25-en#:~:text=Internationalized%20 Domain%20Names%20(%20IDNs%20)%20enable,allowed%20by%20relevant%20IDN%20protocols.
25. ‘The Protection and Promotion of Linguistic Diversity Addressed by UNESCO’, UNESCO, 17 January 2019, https://en. unesco.org/news/protection-and-promotion-linguistic-diversity-addressed-unesco.
26. Amber Sinha and Arindrajit Basu, ‘The Politics of India’s Data Protection Ecosystem’, EPW Engage 54(49), 14 December 2019, https://doi.org/https://www.epw.in/ engage/article/politics-indias-data-protection-ecosystem.
27. Aman Taneja, ‘India Invokes IT Act to Regulate Digital Content but New Norms May Fail Legal Scrutiny’, Firstpost, 26 February 2021, https://www.firstpost.com/india/india-invokes-it-act-to-regulate-digital-content-but-new-norms-may-fail-legal-scrutiny-9351061.html.