On social media data access for researchers
Data API shutdowns and price hikes by Twitter/X and Reddit are harmful to public interest social media research and damage the ability to make sense of the public conversation
In recent months, two major social media platforms — X/Twitter and Reddit — have shut down longstanding free data access APIs and replaced them with new offerings with hefty price tags. These changes drastically limit the ability of independent researchers to conduct social media analysis at scale and study the public conversation. Additionally, restricting third party data access undercuts any claims of transparency made by the platforms themselves, as there is no way for users to validate that such claims are accurate. Although various excuses have been offered for the sudden introduction of these restrictions (a deluge of spam bots, content harvesting by AI startups), I do not believe that these explanations hold up to scrutiny, and I consider the removal of researcher access to public social media data to be a harmful development on multiple fronts.
The variety of research made possible (or drastically more efficient) by the availability of bulk public data via free or inexpensive APIs is vast. Finding the source of a rumor or narrative, searching for potential coordination in its spread, and identifying the major influencers and communities involved are all far more straightforward tasks when one has the option to programmatically analyze content, accounts, and interactions at scale. Detecting common forms of astroturfing such as fake followers, repetitive spambots, and retweet networks is far more feasible when one can easily check for patterns in traits such as creation dates, activity times, text content, and use of specific types of images (among many, many other things). Warning users of scams such as Round Year Fun or groups of accounts impersonating government officials or journalists is considerably easier when one can provide evidence to support one’s claims.
Additionally, open API access allows independent researchers and everyday users to provide transparency into claims made by platforms, media, and other researchers. If a social media company claims to have reduced the frequency of hateful slurs or a major media organization breathlessly announces that researchers have discovered that 45% of the accounts posting about COVID-19 are bots (for example), third party researchers with API access can validate or refute these conclusions based on their own analyses.
Without access to data, that transparency vanishes, as does a significant portion of evidence-based research into social media behavior and content, along with any factual media reporting based upon said research. While the loss of reliable research is harmful enough on its own, it also creates an information vacuum wherein malicious actors can generate wildly biased or outright fraudulent “research” with impunity (and more easily get it published as “news” by pliant media outlets).
The platforms have offered two primary justifications for shutting off data access. When X/Twitter first announced their upcoming API shutdown, it was presented as a measure for combating spam bots. This is obviously absurd, however, as X/Twitter still permits the free creation of bots that post automatically (in fact, automated posting is the platform’s only free API offering at this point in time). The more recent excuse, offered by Reddit, is that AI companies are using social media APIs to obtain large volumes of training data for use in commercial products. While this is a legitimate concern, it could be handled in far less heavy-handed ways, such as restrictions in the license agreement(s) attached to the APIs and associated data.
What should be done? My view (which will likely surprise no one), is that the current trend should be reversed, and all major social media platforms should offer free or low-cost bulk access to public data for the purpose of non-commercial research. Ideally, this would be standardized across the industry in some fashion, and data would be made available both via APIs for those with coding skills and by direct download of CSV/Excel-format datasets for those who are more fluent in spreadsheet usage than in Python or R. (It is a common misconception that all social media researchers are also programmers.) In the event that the current trend of locking down platform data continues, it may be wise to explore legislation that mandates open data access for public platforms once those platforms grow beyond a certain number of users.
Really glad that I found you on substack! I used to regularly read your submissions on Twitter before I left the platform. I appreciate the time and effort put into the information that you share. A while back I really enjoyed using that "Allegedly" site that you had up. Is there any chance that it makes a comeback again, or is it permanently down, if you don't mind me asking?
Thank you for everything you do and for keeping us informed. You are greatly appreciated.