Exploring Amazon recommendations with Python

How to automate the process of mapping the networks formed by Amazon's recommendation algorithm

Apr 06, 2024

Screenshot of six authors recommended by Amazon when viewing Brenda R Downey's author page. Four of these authors have GAN-generated faces. — Amazon suggests additional authors with GAN-generated faces when looking at an author with a GAN-generated face

A previous article on this blog covered a set of Amazon “authors” with StyleGAN-generated faces and numerous published books that appeared to be artificially generated as well. This group of authors was found with the aid of Amazon’s recommendation algorithm, which helpfully guides users from one inauthentic author to the next via the “Customers also bought items by” feature. The process of exploring these recommendations and mapping out the networks that they form can be automated with relative ease. This article describes how to do so using Python; similar code can be written in most popular programming languages.

screenshots of HTML of various portions of an Amazon author page, including the author name and recommended authors — the “Inspect” feature of your web browser is your friend

To map out Amazon’s “Customers also bought items by” recommendations, we can simply automate the same HTTP requests issued by an actual user’s web browser and parse the resulting HTML. In Python, this can be done using the requests and Beautiful Soup libraries. All modern web browsers have an “Inspect” feature that allows one to examine the rendered HTML of any content displayed in the browser, which we can use to find the relevant tags and attributes to extract the author’s name, unique ID, biography, and photo, as well as the set of algorithmically recommended authors. We can then repeat this process with the recommended authors, which in turn yields a new batch of recommended authors, and so on.

# note that the specific tags and attributes to search for may
# change over time as Amazon makes changes to their site

import bs4
import json
import requests
import time

def retry (action, retries=6, delay=2):
    for i in range (retries - 1):
        try:
            return action (None)
        except:
            print ("error, sleeping " + str (delay) + "s")
            time.sleep (delay)
            delay = delay * 2
    return action (None)

def download_image (data, path):
    url = data["image_url"]
    file = path + data["id"] + url[url.rfind ("."):]
    with requests.get (url, stream=True) as r:
        r.raise_for_status ()
        with open (file, "wb") as f:
            for chunk in r.iter_content (chunk_size=8192): 
                f.write (chunk)
        
def fetch_amazon_author (author_id):
    url = "https://www.amazon.com/stores/author/" \
            + author_id + "?ref_=ast_author_cabib"
    print (url)
    r = requests.get (url)
    soup = bs4.BeautifulSoup (r.text, "html.parser")
    name = soup.find ("span", {"itemprop" : "name"}).get_text ()
    image_url = soup.find ("div", {
        "class" : "Header__author-logo__NTjEd"
    }).find ("img")["src"]
    recs = []
    for e in soup.find_all ("div", 
            {"class" : "SimilarAuthors__author-card__uy2nT"}):
        rec_id = e.find ("a")["href"].replace ("/stores/author/", "")
        rec_id = rec_id[:rec_id.find ("?")]
        rec_name = e.find ("div", {
            "class" : "SimilarAuthors__author-card__link-name__UjB8Z"
        }).get_text ()
        rec_image_url = e.find ("img")["src"]
        recs.append ({
            "id"        : rec_id,
            "name"      : rec_name,
            "image_url" : rec_image_url
        })
    url = "https://www.amazon.com/stores/author/" \
            + author_id + "/about"
    r = requests.get (url)
    soup = bs4.BeautifulSoup (r.text, "html.parser")
    bio = "\n\n".join ([p.get_text () for p in soup.find ("div", {
        "class" : "AuthorBio__author-bio__author-biography__WeqwH"
    }).find_all ("p")])
    return {
        "id"          : author_id,
        "name"        : name,
        "recommended" : recs,
        "image_url"   : image_url,
        "bio"         : bio
    }

def explore (author_ids, max_iterations=6, delay=1,
            image_path=None, image_path_small=None):
    results = []
    already = set ()
    if image_path and not image_path.endswith ("/"):
        image_path = image_path + "/"
    if image_path_small and not image_path_small.endswith ("/"):
        image_path_small = image_path_small + "/"       
    for i in range (max_iterations):
        queue = set ()
        for author_id in author_ids:
            data = retry (lambda x: fetch_amazon_author (author_id))
            already.add (author_id)
            results.append (data)
            if image_path:
                download_image (data, image_path)
            for rec in data["recommended"]:
                queue.add (rec["id"])
                if image_path_small:
                    download_image (rec, image_path_small)
            time.sleep (delay)
        author_ids = queue - already
        print (str (len (results)) + " results after " \
               + str (i + 1) + " iterations")
    return results

results = explore (["B0CL7X72YY", "B0B91BN53M", "B0CQXFL1KY"],
                  image_path="amazon_images_full",
                  image_path_small="amazon_images_small")
with open ("amazon_test.json", "w") as f:
    json.dump (results, f)

The Python code above takes a list of Amazon author IDs, downloads each author’s name, image, and biography, and collects the set of authors recommended in the “Customers also bought items by” section for each author. This process is repeated using the recommended authors as input until a desired number of iterations has been reached. In this example, three authors with GAN-generated faces (“Rosanne McClure”, “Brenda R Downey” and “Jason N. Martin N. Martin”) were used as the seed authors, which yielded 1917 authors after 6 iterations. Two of the three seed authors churn out dubious cookbooks, while the third specializes in repetitive books on making small talk.

71 GAN-generated faces used as portraits by Amazon authors — some of these GAN-generated faces have been resized or shifted off-center

Among these 1917 authors are 71 authors with GAN-generated faces (including the three initial seed authors). Some of these images have been edited, possibly in an effort to obscure their synthetic origins. Several have been zoomed or cropped, which confuses methods of detection that rely on the telltale eye positioning present in StyleGAN faces. In the case of at least two of the authors, “Bridget Bishop” and “Steven Carlson”, the abstract background of the GAN-generated face image has been replaced with a photograph of a real location.

Some of the authors with images other than GAN-generated faces look suspect as well. Several have cranked out large numbers of repetitive, low-quality books that, based on the text samples provided by Amazon, may well be composed of artificially generated text. Some of these suspicious authors use AI images produced by text-to-image models as their portraits (as pointed out by Erin Gallagher on X) rather than GAN-generated faces, while others use stock photos of real people.

network diagram showing the recommendation relationships between 1917 authors reachable via 6 degrees of separation or less from three authors with GAN-generated faces — nodes represent Amazon authors; edges indicate one author was recommended in the other’s “Customers also bought items by” section

Amazon’s recommendations can be visualized as a network, where each node represents one of the authors in the dataset, and two authors are connected by an edge if the “Customers also bought items by” section for one author contains a recommendation for the second author. The authors form clusters, largely based on the topics of their books. This dataset contains prominent clusters devoted to food preparation (green) and financial matters (red), with smaller clusters focused on puzzles (blue), Westerns (orange), and mysticism (purple) showing up as well. The sparse gray area toward the top of the figure includes several well-known real authors, such as Dune saga creator Frank Herbert.

Topics were determined by keyword analysis of the authors’ biographies, with each author assigned to whichever category their biography contains the most keywords from. Authors with no biography, along with authors whose biographies contain none of the keywords, are shown in gray. The prominence of the food preparation cluster isn’t entirely surprising, given that two of the three seed authors belong to this category.

network diagram showing the recommendation relationships between 1917 authors, with the portraits of the 71 authors with GAN-generated faces included — beware cookbooks from authors with fake faces

The authors with GAN-generated faces are clustered particularly densely in the food preparation section of the network (green), with a more loosely-connected group of GAN-faced authors showing up in the finance section (red). Note that the network is a directed graph, since recommendations are one way relationships — just because author B shows up as a recommendation on author A’s page does not mean that author A will necessarily be recommended by the algorithm when viewing author B. This is particularly true when disparities in author popularity are in play; for example, the inauthentic authors with GAN-generated faces lead to the aforementioned Frank Herbert within six recommendations, but the same is not true in reverse, as none of the GAN-faced authors turn up within six degrees of separation if Herbert is used as the seed author.

network diagram showing the recommendation relationships between 570 authors reachable via 6 degrees of separation or less starting from Frank Herbert — no authors with GAN-generated faces here (appropriate, given that AI is illegal in the interstellar empire portrayed in the *Dune* saga)

2 Comments

WaltFrench@EarthLink.Net

Apr 6Liked by Conspirador Norteño

Amazon’s intentional facilitation of this garbage reminds me of Bannon’s “flood the zone with shit” designed to destroy people’s ability to form understanding and trust

If Amazon wanted to be a trusted seller of products, it’d simply require each author/seller/store to assert “this is my actual likeness,” “I personally wrote this book,” “my US tax ID number is XXX…” and it’d be the basis for customers’ ability to pursue fraud cases against exploitative items

But no; Amazon manages to take no responsibility for deception and it must be that they ENJOY being the lowest common denominator

Anyway, thanks for showing the extent of new & improved zone-flooding

Expand full comment

FINTEL

I wonder if any of these actually had any sales? I guess one should wonder what their ultimate purpose is/was? Was it one of those viral inspirations from TikTok (10 things to make money now) or something more sinister? The style Gan images seem to be the go too for nefarious actors, so I wonder what the purpose was for these books, other than to tell people they have books published. Great work as always.