Spam in the firehose

Using repeated text in the Bluesky firehose to detect spam accounts in near-real time

Dec 06, 2024

image of a spraying firehose with screenshots of Bluesky account searches superimposed — not a literal depiction of the Bluesky firehose

Every public action on social media platform Bluesky is published via a stream of events known as the Bluesky firehose. This can be used to monitor Bluesky in near-real time for various behaviors indicative of spam or other inauthentic activity. For example, accounts that are created in bulk often use the same names and biographies over and over, and this repetition can be tracked by programmatically watching the firehose for profile updates. Over the course of five days, the process of monitoring the firehose for repeated biographies flagged 2234 spam accounts, over half of which belong to a single network.

# MONITOR FIREHOSE

import atproto
import atproto_firehose as hose
from atproto_firehose.models import MessageFrame
from atproto_client.models import get_or_create
import json
import pandas as pd
import time
import warnings
warnings.filterwarnings ("ignore")

out_path = "bsky_monitoring/"
profile_queue = []
post_queue = []


def retry (method, params):
    retries = 5
    delay = 1
    while retries > 0:
        try:
            r = method (params)
            return r
        except:
            print ("    error, sleeping " + str (delay) + "s")
            time.sleep (delay)
            delay = delay * 2
            retries = retries - 1
    return None


def get_profiles (actors, client):
    profiles = []
    while len (actors) > 0:
        if len (actors) > 25:
            batch = actors[:25]
            actors = actors[25:]
        else:
            batch = actors
            actors = []
        r = retry (client.app.bsky.actor.get_profiles,
                {"actors" : batch})
        profiles.extend (r.profiles)
    return profiles


def on_message (message, test_function, handler):
    message = hose.parse_subscribe_repos_message (message)
    if isinstance (message, 
            atproto.models.ComAtprotoSyncSubscribeRepos.Commit):
        blocks = atproto.CAR.from_bytes (message.blocks).blocks
        for op in message.ops:
            uri = atproto.AtUri.from_str ("at://" + message.repo \
                    + "/" + op.path)
            raw = blocks.get (op.cid)
            if raw:
                try:
                    record = get_or_create (raw, strict=False)
                    if record is not None and \
                            record.py_type is not None:
                        rdict = record.model_dump ()
                        item = {
                            "repo"       : message.repo,
                            "revision"   : message.rev,
                            "sequence"   : message.seq,
                            "timestamp"  : message.time,
                            "action"     : op.action,
                            "cid"        : str (op.cid),
                            "path"       : op.path,
                            "collection" : uri.collection,
                            "record"     : rdict,
                            "type"       : "commit"
                        }
                        if test_function (item):
                            handler (item)
                except:
                    print ("ERROR!")
    elif isinstance (message, 
            atproto.models.ComAtprotoSyncSubscribeRepos.Handle):
        item = {
            "did"        : message.did,
            "sequence"   : message.seq,
            "timestamp"  : message.time,
            "handle"     : message.handle,
            "type"       : "handle"
        }
        if test_function (item):
            handler (item)   
    elif isinstance (message, 
            atproto.models.ComAtprotoSyncSubscribeRepos.Account):
        item = {
            "did"        : message.did,
            "sequence"   : message.seq,
            "timestamp"  : message.time,
            "active"     : message.active,
            "status"     : message.status,
            "type"       : "account"
        }
        if test_function (item):
            handler (item)   
    elif isinstance (message, 
            atproto.models.ComAtprotoSyncSubscribeRepos.Identity):
        item = {
            "did"        : message.did,
            "sequence"   : message.seq,
            "timestamp"  : message.time,
            "type"       : "identity"
        }
        if test_function (item):
            handler (item)   
            

def monitor_bsky_firehose (test_function, handler):
    firehose = hose.FirehoseSubscribeReposClient ()
    while True:
        try:
            print ("connecting to firehose...")
            firehose.start (lambda message: on_message (message,
                    test_function, handler))
        except:
            print ("firehose error, sleeping 20s...")
            time.sleep (20)
            

def is_profile_update (item):
    return  item["type"] == "commit" and \
            item["collection"] == "app.bsky.actor.profile" and \
            item["path"] == "app.bsky.actor.profile/self"


def is_post_create (item):
    return  item["type"] == "commit" and \
            item["collection"] == "app.bsky.feed.post" and \
            item["action"] == "create"


def to_profile_update (d):
    r = d["record"]
    return {
        "did"          : d["repo"],
        "revision"     : d["revision"],
        "sequence"     : d["sequence"],
        "timestamp"    : d["timestamp"],
        "created_at"   : r["created_at"],
        "description"  : "" if r["description"] is None else \
                         r["description"].strip (),
        "display_name" : "" if r["display_name"] is None else \
                         r["display_name"].strip ()
    }


def to_post_create (d):
    r = d["record"]
    return {
        "did"          : d["repo"],
        "revision"     : d["revision"],
        "sequence"     : d["sequence"],
        "timestamp"    : d["timestamp"],
        "path"         : d["path"],
        "created_at"   : r["created_at"],
        "text"         : "" if r["text"] is None else \
                         r["text"].strip (),
    }


def summarize_queue (label, queue, repeat_fields=None, 
                     min_repeat_count=5,
                     min_repeat_length=20,
                     max_queue=1000000,
                     unique="did"):
    df = pd.DataFrame (queue)
    if repeat_fields is None:
        repeat_fields = df.columns
    print ("*** " + label + " ***")
    print ("total events:    " + str (len (queue)))
    print ("unique dids: " + str (len (set (df["did"]))))
    for col in repeat_fields:
        df1 = df[[col, unique]].drop_duplicates ([col, unique])
        g = df1.groupby (col)
        df0 = pd.DataFrame ({"count" : g.size ()}).reset_index ()
        df0 = df0[df0["count"] >= min_repeat_count]
        df0 = df0[df0[col].fillna ("").str.len () >= min_repeat_length]
        if len (df0.index) > 0:
            keep = set (df0[col])
            results = []
            for value in keep:
                df1 = df[df[col] == value].drop_duplicates ([col,
                        unique])
                dids = list (set (df1["did"]))
                if len (dids) >= min_repeat_count:
                    print (value)
                    print ("dids: " + str (len (dids)))
                    profiles = [p.model_dump () \
                             for p in get_profiles (dids, client)]
                    print ("profiles: " + str (len (dids)))
                    results.append ({
                        col        : value,
                        "dids"     : dids,
                        "records"  : df1.to_dict (orient="records"),
                        "profiles" : profiles
                    })
            if len (results) > 0:
                with open (out_path + label + "-" + col + "-" + \
                        str (time.time ()) + ".json", "w") as file:
                    json.dump (results, file, indent=2)
    print ()
    if len (queue) > max_queue:
        del queue[:max_queue // 10]
    

def monitor_repetition (item):
    if is_post_create (item):
        post_queue.append (to_post_create (item))    
        if len (post_queue) % 25000 == 0:
            summarize_queue ("posts", post_queue,
                    min_repeat_count=20,
                    min_repeat_length=60,
                    repeat_fields=["text"],
                    unique="path")
    elif is_profile_update (item):
        profile_queue.append (to_profile_update (item))
        if len (profile_queue) % 2500 == 0:
            summarize_queue ("profiles", profile_queue,
                    repeat_fields=["description", "display_name"])

            
client = atproto.Client ()
client.login ("**************", "**************")
monitor_bsky_firehose (lambda x: True, monitor_repetition)

The above Python code uses the atproto module to connect to the Bluesky firehose, and maintains queues of the most recent million profile updates and most recent million posts. These queues are periodically scanned for exact duplication of biographies, display names, and post text across multiple accounts. Of these three forms of simple repetition, non-trivial biographies duplicated by at least five accounts were by far the most accurate indicator of inauthentic activity, with 2234 of the 2380 accounts flagged (93.9%) being confirmed as spam via manual inspection. The remainder of this article will focus on these accounts; results of the scan for repeated post text will be revisited in a future analysis.

pie chart breaking down a set of 2234 fake bluesky accounts by account type: porn, crypto spam, etc — slightly more than half of the accounts identified belong to a single botnet

The 2234 Bluesky accounts with repetitive biographies identified by monitoring the firehose include multiple networks (or portions thereof) and a variety of account types, with porn, crypto spam, and gray market account sales being recurring themes. Many of the porn accounts were suspended while the experiment was still underway; most of the accounts in the other groups are still online as of the time of this writing. Slightly over half of the accounts detected (1155 of 2234, 51.7%) belong to a single network.

collage of Bluesky accounts with biographies of the form "passionate about A in the field of B" — passionate about creating large numbers of similar accounts

The largest set of accounts detected by monitoring the firehose is a network of 1155 accounts with biographies of the form “passionate about <A> in the field of <B>”. Example biographies include “passionate about contributing to impactful projects in the field of culture”, “passionate about sharing unique perspectives in the field of travel”, and “passionate about exploring innovative ideas in the field of technology”. Each account’s display name consists of a first name and last name, and most of their handles match their display names, with random digits inserted somewhere.

table of biographies used by the "passionate about..." spam network — 1155 accounts with 80 distinct yet oddly similar biographies

The 1155 accounts in the network have 80 distinct biographies between them, with each biography duplicated by at least five accounts. The most frequent biography is “passionate about contributing to impactful projects in the field of culture”, presently used by 25 of the accounts in the network. Thus far, none of the accounts in this network has posted anything, although most have followed a handful of accounts and a scattered few have picked up followers of their own here and there.

hourly bar chart of account creation times and table of accounts most frequently followed by the spam accounts — all of these accounts are recent creations

All of the 1155 accounts in the spam network were created between November 30th and December 3rd, 2024, with a batch of over 150 accounts created in a single hour on December 3rd. The creation of accounts shows no signs of slowing, so it is not unlikely that the network will be larger by the time you read this. Most of the accounts in the network follow somewhere in the neighborhood of five accounts belonging to real people; thus far, there is no discernible pattern to the accounts that the spam accounts follow.

collage of accounts with duplicate biographies and the display name "Daisy" — we follow back all patriots and we know looks aren’t everything

The process of monitoring the Bluesky firehose turned up several smaller networks as well. Among these networks is a group of accounts with the display name “Daisy” and handles consisting of “daisy” with numbers attached to the beginning and end. The accounts in this network use two repeated biographies; one of these is political (“Welcome new friends and follow back all patriots…”), while the other is more general (“I know looks aren’t everything, but I have them just in case.”) Each of these accounts has posted or reposted a small number of image posts; some but not all have been suspended by Bluesky.

unsurprisingly, the spam accounts use plagiarized photos

As tends to be the case, many of the spam accounts use stolen profile images. Google reverse image searches for three of the images from the “Daisy” network are shown above; the profile photos used by most of the other networks are likewise plagiarized. (A few of the networks use icons of various types as avatars rather than photographs.)

collage of 25 bluesky accounts with the biography "message for a handle transfer fee or your competitor's advertisements will be posted" — come for the spam, stay for the low-effort extortion attempts

As mentioned earlier, some of the spam accounts are for sale, although in the case of one network, "extortion” might be a better word than “sale”. 25 accounts with handles implying affiliation with various major corporations such as Netflix, Best Buy, and Progressive Insurance have the biography “message for a handle transfer fee or your competitor’s advertisements will be posted”. Since these spam accounts have few or no followers, this form of digital blackmail is unlikely to be effective (or even noticed by its intended victims) but apparently someone nonetheless felt it was worth trying.

collage of 19 bluesky accounts with the biography "Because One Checkmark Just Isn't Enough" and different color checkmarks as avatars — so many checkmarks, so little time

Finally, there are a few groups of accounts with identical biographies that are just plain bizarre, such as a set of 19 accounts with the biography “Because One Checkmark Just Isn’t Enough”, and checkmarks in every color of the rainbow as avatars. (Edit: responses on Bluesky have made it clear that these accounts are labelers intended to apply parody verification checkmarks to accounts on an opt-in basis.)

This experiment is relatively primitive, and there is plenty of room for future optimization and improvement in both the post- and profile-based detection techniques. Bluesky’s use of an open protocol makes projects of this sort relatively straightforward, and I plan to iterate further on the work described here. As a final note, it is unfortunate that established major platforms have made work like this more difficult by eschewing an open approach in favor of walled gardens and APIs with prohibitively massive price tags.

Conspirador Norteño

Discussion about this post