Social science researchers who want to study the internet in India using data mining and analytic techniques are challenged by constraints in access, and the availability of big data. Even when such data is available, it is often behind a paywall or organised in a manner that makes it difficult to interpret.
India is ranked at 127 among 201 countries in terms of internet penetration (Internet Live Stats 2016). Although internet penetration is only 13.5%, the country still has the second-largest number of users worldwide. According to the latest estimates, around 35% of the Indian population access the Internet using multiple devices (Internet Live Stats 2016). India is thus considered to be one of the fastest growing online markets, and thereby part of the strategic focus of many internet-based companies. As Indians browse, search, transact and interact online, one can observe how the internet is getting increasingly enmeshed in everyday lives.
But, how do we study the influence and impact of the internet in India beyond anecdotes, journalistic articles and descriptive narratives? Conducting rigorous and critical studies on economic, social and political aspects of the internet in India, using data-driven analytic approaches is quite challenging in many ways. The challenges include access to data, having a certain level of proficiency in technological skills for data acquisition and analysis, training in cross-disciplinary perspective, and opportunities for collaborative efforts, among other things.
This essay examines the first challenge, that is, the availability and access to relevant data. What are the ways in which social science scholars might approach online interactions and transactions as an empirical research field? What are the data sources available to them, and the challenges thereof, in accessing them? We examine both the traditional sources of data and the emerging social/big data sources that can be used for studying the internet in India.
Diffusion and Adoption of Internet
The first policy framework addressing the internet in India was “IT [Information Technology] for Masses” in 2001. Subsequent national level policies provided a common IT policy framework for the country. These policies were adopted by different states, and many interventions were implemented as per their perceived priorities and strategies. As a result, the penetration and access to the internet have been, in general, uneven across the country.
The starting point for any scholar wanting to do an in-depth study on the internet in India would probably be to examine the nature and extent of internet penetration in India. Estimated and actual figures for the same are available for the country as a whole. In fact, there are multiple values for each parameter from multiple sources, such as the census, indiastat.com, etc. It is thus imperative for the researcher to be clear about the source and the appropriateness of the source and data before using them.
The Organization for Economic Cooperation and Development (OECD) defines digital divide as “the gap between individuals, households, businesses, and geographic areas at different socioeconomic levels with regard both to their opportunities to access ICTs and to their use of the internet for a wide variety of activities”. In order to understand particular dimensions of the digital divide, the Census 2011 is, perhaps, the most comprehensive public data set. For the first time, the Census 2011 collected data that provides a picture of digital inclusion in India. It offers household data on the possession of digital assets from the village level. In separate tables, it also has data on village level assets and infrastructural facilities. This might be useful in assessing the extent of digital divide a village may be experiencing. However, the dataset does not record if a particular house has more than one digital asset.
Micro-level data about each household is available at select academic institutions. The Census 2011 data set is fairly voluminous, and relevant data is generally distributed among multiple tables. Multiple files have to be searched or combined for relevant data. Thus, a social science researcher would need some technical skills to use requisite software for automating data extraction, merging, analysis and visualising data. The main limitation of Census 2011 data is that it is slightly dated, especially if one considers the rapid proliferation of the mobile phone. But if mined carefully, the results can actually serve as a benchmark for all future studies.
Compared to computers and laptops, the diffusion of mobile phones in India has been far more rapid and widespread. The smartphone has become one of the most popular devices through which the Internet is accessed in India. The Telecom Regulatory Authority of India’s (TRAI) website provides recent aggregated data on the growth of telecom services. However, data on this website is generally available in PDF format (for example, TRAI 2018). The PDF format is suitable for sharing documents, but for aggregation and analysis, its contents often need to be copied and formatted manually. Availability of aggregate data in machine-readable file formats like XML, RDF, JSON or even Excel would make it easy to process contents using computers.
Established in 2000, www.indiastat.com is an IT-enabled private limited company providing data aggregation services in the socio-economic information domain. It draws data from sources like the TRAI website, the Lok Sabha and Rajya Sabha questions, ITU, UN reports etc, and provides them online in a reusable format. The website has data that is generally not available at various other sources. For example, it provides data about the uptake of financial services in India or the number of complaints associated with online transactions. However, the data can be accessed only in exchange of a fee or through institutional access. A simple search on this website yields a huge amount of data and here also the researcher has to sift through individual files to find relevant research data. Moreover, care has to be taken to handle the duplicity and comparability of data across different states and time period. In many cases, although triangulation of data is not possible, the figures available on the website, do give some idea about the phenomenon being investigated.
In addition to the data sets mentioned here for examining the nature and extent of the digital divide, the data from the 71st round of National Sample Survey Office (NSSO) on Social Consumption – Education Survey (2016) can be used for examining second order digital divide. The second order digital divide refers to the lack of skills or capabilities that can prevent people who already have supporting devices from accessing the digital sphere. Among the details it has about socio-economic characteristics of households, it also has data on whether the household has a computer or access to the internet. It also has particulars of information technology literacy for household members aged 14 and above.
In general, information and communication technologies are fundamental for a transition from industrial to information/knowledge/digital economy with the commodification of information, digital goods and services, and online transactions. The internet has enabled new, and at times disruptive business models. In addition, it also has implications on the operations and the productivity of traditional industries. Market structure and competitiveness, pricing, incentives and regulations, the impact of online exchange of goods and services, and online behaviour are some areas via which the internet can be studied from an economic perspective. However, the data to study these linkages between the internet and the Indian economy is limited. For example, the Annual Survey of Industries (ASI), conducted by the Ministry of Statistics and Programme Implementation collects only a single piece of data for firms in the manufacturing sector, that is, whether the firm has a computerised accounting system or not. Some structural aspects of the IT/ITES sector can be derived from the 63rd round of NSSO survey Service Sector 2006–07 (2012). It has a section on post and telecommunications, with survey results of all enterprises providing communication services like courier, ISD/STD/PCO booths, voicemail through computer networking, video/fax/phone, voiced and non-voiced leased circuits, email, video conferencing, Internet, and activity of cable operators. It basically covers all enterprises not owned by the government, public sector undertakings and local bodies. In another section on computer and related activities, the survey covers enterprises engaged in hardware consultancy, software publishing; software consultancy, supply and maintenance; data processing, maintenance and repair of office, accounting and computing machinery among others. State and national level data on the use of ICTs by unincorporated non-agricultural enterprises in manufacturing, trade and other service sectors are available in the data set of the 67th round of NSSO (2015).
As the proliferation of mobile phones and internet services increases, a lot of private market research firms and industry associations like Internet and Mobile Association of India (IAMAI) have begun to collect data related to Internet usage, especially from a marketing point of view. However, such data is generally available only as summary reports for the general public. Anybody who wants to carry out related research will have to pay to access the data or engage similar data collection agencies for collecting data. Similarly, although one can see the increasing popularity of e-commerce sites and also the use of mobile-based applications, very little data is publicly available to study the various facets of e-commerce or m-commerce. Further, very little data is available on the economic activities and financial data about internet-based companies themselves. The traditional databases, NSE, CMIE Prowess and Bloomberg, such as keep data only about listed companies. Crunchbase and Tracxn are databases that have data and information on internet-based companies and start-ups for angel investors and venture capitalists. However, beyond the minimal basic data, both are paid services.
Sociology of the Internet
The digital divide or the gap between digital “haves” and “have-nots” has always been one of the key study concerns of sociologists interested in studying about the Internet society. The sociology of the internet and digital sociology are two emerging subfields focussing on how the Internet plays a role in mediating and facilitating communication and interaction, and on how it affects and is affected by social life more broadly (Lisa 2017).
Surveillance, trolls, online abuse, stalking, and incitement of violence at individual and community level are instances of an emerging social phenomenon that requires studying the internet more actively. Research is also required to determine whether the online, virtual community reflect the diversity and the structural inequalities of the physical world and to synthesise individual user’s online behaviour with macroscopic analyses of institutional and political-economic factors that shape the behaviour. As of now, most studies in the Indian context about online interactions use manual techniques or do offline surveys to understand online behaviour. The advent of web 2.0 with its interactive social media and networking platforms brings forth possibilities of new approaches for studying campaigns and movements, formation and structure of online communities, their identities and interactions, and various forms of mobilisation. There is much optimism in the possibility of automatic generation and collection of social network data on a scale that might not be possible using traditional data collection methods.
But for a social scientist, this is also an area where accessing data is more challenging and constrained. Most of the data collected by online social media platforms are propriety data with limited access. Thus, if one wants to do an analysis of conversations on Twitter, one will get only a very limited number of tweets per day either through a free software or writing a program using their API. Relatively larger datasets are available for a price from third-party data aggregators having a licence from Twitter. The drawback of either extracted data or paid data is that one is not aware of the logic in which tweets were sampled, and thus the results of such research are not considered to be reproduced or replicated or generalised to a larger population. Some repositories containing India-specific data sets are either released by the companies as part of hackathons or collected by university-based research groups. Unfortunately, these kinds of repositories, even paid ones, are not available in India.
Impact of the Internet on Politics and Governance
Politics and governance are areas where one can observe the increasing use and impact of internet-based technologies. Social networking platforms like Twitter, YouTube, and Facebook allow communication and dissemination of information to a large number of people, at little or no cost. India first saw large scale use of social media tools and technologies for political mobilisation during the general elections of 2014. Various political parties used social media to campaign, organise support, and raise funds. Systematically capturing political interactions and activities on the internet can enable political scientists to observe or make reasonable inferences about formal and informal communication flows, the dissemination of ideas across different social groups, the formation of consensus, and the actual network structures underlying communication. That so many people are tweeting, liking, and posting messages simultaneously gives an exciting possibility to see how discourses are created and propaganda spread.
Further, during the recent unrest in Gujarat and Kashmir, the internet was blocked for few days to prevent the spread of violence and dissent. Good data on the timing and place of protests as well as the timing and point of origin of online discussions might help political scientists to study various nuances and maybe resolve controversies over whether the internet is essential or inessential to explaining the fanning of dissent and unrest. Broader research can actually examine who participates and whose voices are heard in online forums. However, the challenges of getting access to machine-generated data for independent research are similar to as described in the previous section. Data is either propriety-based, is difficult to obtain, or is available with restricted access with the possibility of selection bias, or inadequate to make satisfactory inferences. For example, during the general elections of 2014, a certain report (IRIS Knowledge Foundation, IRIS and IAMAI) claimed that in at least 160 out of 543 constituencies in India, social media like Facebook and Twitter would play a role in deciding outcomes. A subsequent study (Chakravarty 2014) found that 71% of the top two candidates in the 160 constituencies did not have a presence on social media for their election campaign. Since the use of social media for election campaigning is a strategically important market segment for many internet platforms, it would be worthwhile to investigate whether the perceived influence of the internet on political mobilisation, and of the outcome, is exaggerated or real.
India has always been at the forefront of deployment of IT for e-governance and citizen engagement. In the case of the former, a lot of data is available publicly. In fact, the data repository available at www.e-taal.gov.in automatically records e-governance transactions all over the country and can be considered as a true big data repository related to India. While the data is available publicly on the website, it is still an effort to use it for research. Firstly, the data is not available in an easily downloadable format and secondly, manually downloading the data, page by page requires a lot of patience and time. One would expect such data sets to be available to all via the open data platform. The Open Government Data (data.gov.in) Platform India is a joint initiative of Government of India and United States Government to make available data sets, and documents published by various ministries and departments of the Government of India. It was intended to increase transparency in the functioning of the government and also open avenues for many more innovative uses of government data to give a different perspective. A study (Web Foundation 2015) examining the usefulness of the data and information to identify areas of improvement in public service delivery highlighted some of the typical data-related issues of such repositories. To quote,
“critical datasets are unavailable on data.gov.in, available datasets are often outdated, duplicated, incomplete, inadequately referenced and lack common terms used to describe the data. Top level metadata such as data collection methodology and a description of the variables are also either missing or incomplete. These shortcomings make it difficult to compare and analyse datasets properly.” (Web Foundation 2015)
The Data Challenge for Studying Internet in India
The rapid diffusion of the internet and related technologies bring forth a range of new topics and new ways to study their impact on economic, social, cultural and political spheres. However, social science researchers who want to study the internet in India using data analytic techniques are constrained and challenged in multiple ways. Current social science research is making use of small data approaches to study the internet in India and is yet to exploit the full potential offered by the automatic content generation and data collection. Even if there is an intent, a social scientist has to encounter issues in data access and analysis. Specifically, the data challenges of studying internet in India ranges from non-availability of relevant data (not being collected by either public or private agencies), inadequacy of available data (partial information, available formats), restricted access (paid-institutional access, restrictive API-pay, no control, non-replicable data), no-access data (willing to pay – data as commodity).
Depending on where is one positioned ideologically, and often where one is located institutionally, opinions on access to data varies from advocating availability of data free for research to limited, restrictive access to data collected by various agencies. It is argued that internet-based companies need to survive and it is anything but natural that they should be able to generate money from the data they collect as part of their operations. It would be too ambitious to expect the easy access to data in near future, but what can be done is to make a concerted effort towards the collection and release of data through public agencies. It might also involve devising a mechanism by which data collected by private agencies are available for social science research.