I'm sorry, this spam network cannot generate inappropriate or offensive content
The presence of ChatGPT error messages in the tweets of a network of spammy Twitter accounts is a potential sign of the presence of AI-generated text
Sometimes the best way to detect malicious or deceptive use of AI tools is to look for silly mistakes on the part of the humans piloting the tools in question. For example, the popular large language model-based chatbot ChatGPT spits out error messages such as “I’m sorry, I cannot generate inappropriate or offensive of content” when given certain prompts. When these error messages show up in a large corpus of content that someone is trying to pass off as human-written material, they indicate that other (seemingly authentic) text in that corpus is likely also output from the large language model that produced the error.
Here’s a look at a Twitter spam network that, based on the presence of ChatGPT error messages in some of the tweets it posts, appears to be using ChatGPT to generate tweet content. The error messages on their own are not sufficient to identify the set of accounts that make up the network, however, for the following reasons:
other spam networks may be tweeting the same error messages
ChatGPT errors have become something of a meme among human users
most of the accounts in the network have only tweeted a few times and have yet to tweet an error message
Conveniently, the accounts in this network recently followed a set of large accounts en masse (mostly US and European political and media accounts) and have an abnormal creation date distribution (almost no accounts created after 2016), so we can use these traits to map the network.
This spam network consists of (at least) 59,645 Twitter accounts, the vast majority of which were created between 2010 and 2016. All recent content from the accounts in this network was allegedly tweeted via the Twitter Web App. (This doesn’t demonstrate the accounts are human-driven, as there are a variety of ways to automate or emulate a web browser.) Some of the accounts also have old tweets sent with a variety of apps prior to 2016. The old content is highly varied and appears unrelated to what the accounts are tweeting in April 2023, suggesting that the accounts in this network were hacked, hijacked, or purchased.
The vast majority of this network's tweets are single-sentence tweets from April 2023 that, based on the presence of ChatGPT error output in the same corpus of tweets, seem likely to have been generated with ChatGPT. Other than the error messages, this set of tweets contains very few exact duplicates. (This is one of the potential advantages of using large language models for spam, as one can generate a nearly endless variety of outputs from a given prompt). Although the exact prompts used are unknown, the words most commonly found in the tweets are “ChatGPT”, “China”, and “Erzurum”.
The remainder of the network’s content is cryptocurrency spam; specifically, quote tweets of giveaway tweets from cryptocurrency/NFT accounts. These tweets being with short recurring phrases such as “wow i am happy” and generally tag between three and six other Twitter accounts. There is no currently no evidence that ChatGPT or any other AI model was involved in the generation of these quote tweets.
To conclude, I’ll loop back to the beginning of this article: sometimes the best way to detect malicious or deceptive use of AI is to look for the parts of the output that the human user(s) of the AI models in question failed to clean up. Actors who are engaged in account creation and automated content generation at scale simply aren’t going to clean up every bit of anomalous output that their tools generate; they’re likely to be focused on quantity of output rather than quality. This is also why, almost half a decade in, the trick for identifying StyleGAN faces by eye position is still extremely useful — yes, malicious actors can crop or rotate the faces and some do, but many don’t, because it’s less effort and the resulting “faces” are still sufficient to fool the people who encounter them often enough to be worth using.
The research in this article was originally presented in this Twitter thread.
Early days, so early they left the equivalent of print("Hello world). Experimenting in the wild and whatnot.