The Curious Case of BNN

Rephrasing news articles from major outlets is much easier and cheaper than actually hiring journalists all over the world

Jan 26, 2023

collage of screenshots from BNN's website that reference a worldwide team of reporters — spoiler: no evidence of an “extensive network of reporters across the globe” was found

BNN (the "Breaking News Network", a news website operated by tech entrepreneur and convicted domestic abuser Gurbaksh Chahal) allegedly offers independent news coverage from an extensive worldwide network of on-the-ground reporters. As is often the case, things are not as they seem. A few minutes of perfunctory Googling reveals that much of BNN's "coverage" appears to be mildly reworded articles copied from mainstream news sites. For science, here’s a simple technique for algorithmically detecting this form of copying.

paragraphs from a BNN article on Russia's military-industrial complex, and paragraphs from a similar TASS article the BNN article was presumably copied from

paragraphs from a BNN article on FTX's collapse, and paragraphs from a similar Fox News article the BNN article was presumably copied from — many of BNN’s “articles” are weirdly similar to articles published by mainstream media outlets

Many of the articles on the BNN website appear to have been created by copying and rewording individual paragraphs from articles published on major media sites. Much of the original language is preserved, however, and this can be detected in a variety of ways. One (extremely simple but reasonably effective) algorithm is as follows:

Split each article (both the BNN articles and the articles being compared to) into paragraphs, convert to lowercase, and eliminate all punctuation so that each paragraph becomes a list of consecutive words.
Convert each paragraph into the set of trigrams (sequences of three consecutive words) it contains. (N-grams of other lengths work as well.)
To compare two articles, compare the set of trigrams in every paragraph in the first article with the set of trigrams in every paragraph in the second, calculating the score for that pair of paragraphs as the percentage of trigrams in the first paragraph that are also present in the second. The pair of articles is considered a match if the average of the highest score for each paragraph in the first article is at least 25%, and no paragraph in the first article has a highest score of 0%.

diagram comparing the set of trigrams in a sentence from a BNN article and a sentence in the Fox News article it was likely copied from — this short paragraph was detected as a match despite being reworded

The major media sites included in this analysis are Reuters, AP (Associated Press), Fox News (including Fox Business), TASS, and the BBC. To gather article text from these websites (as well as BNN itself), Wayback Machine, Archive.today, and Twitter search results were used to compile a list of unique article URLs for each site. The archive sites were also used to harvest the text of (most of) the articles as raw HTML, which was then parsed into paragraphs and stored in CSV format using the following Python code (apologies for the formatting):

import bs4
import os
import pandas as pd
import sys

path = sys.argv[1]
out_path = sys.argv[2]

def bnn (soup, url):
    elements = soup.find_all ("div", {"class" : "post-content"})
    if len (elements) == 1:
        rows = []
        elements = elements[0].find_all ("p")
        pos = 0
        times = soup.find_all ("time", {"class" : "post-date"})
        t = times[0]["datetime"].replace ("T", " ")
        for element in elements:
            text = element.text.strip ()
            rows.append ({"url"         : url,
                          "text"        : text,
                          "paragraphID" : pos,
                          "paragraph"   : pos,
                          "utcTime"     : t})
            pos = pos + 1
        return rows
    else:
        return None
    
def reuters (soup, url):
    times = soup.find_all ("time")
    if times is None or len (times) == 0:
        return None
    if times[0].has_attr ("datetime"):
        t = times[0]["datetime"]
    else:
        spans = times[0].find_all ("span")
        if len (spans) == 4:
            t = spans[1].text + " " + spans[2].text
        elif len (spans) == 1:
            t = spans[0].text
        else:
            t = times[0].text
    elements = soup.find_all ("p")
    pos = 0
    rows = []
    for element in elements:
        para = None
        if element.has_attr ("data-testid"):
            test = element["data-testid"]
            if test.startswith ("paragraph-"):
                para = int (test.replace ("paragraph-", ""))
        elif element.has_attr ("id"):
            test = element["id"]
            if test.startswith ("paragraph-"):
                para = test.replace ("paragraph-", "")
        elif element.has_attr ("class") and element["class"][0].startswith ("Paragraph-paragraph"):
            para = pos
        if para is not None:
            rows.append ({"url"         : url,
                          "text"        : element.text,
                          "paragraphID" : para,
                          "paragraph"   : pos,
                          "utcTime"     : t})
            pos = pos + 1
    return rows

def ap_news (soup, url):
    times = soup.find_all ("span", {"data-key" : "timestamp"})
    if len (times) == 1:
        t = times[0]["data-source"]
    else:
        return None
    articles = soup.find_all ("div", {"class" : "Article"})
    if len (articles) == 0:
        articles = soup.find_all ("article")
    if len (articles) != 1:
        return None
    else:
        elements = articles[0].find_all ("p", recursive="false")
        rows = []
        pos = 0
        for element in elements:
            rows.append ({"url"         : url,
                          "text"        : element.text,
                           "paragraphID" : pos,
                            "paragraph"   : pos,
                            "utcTime"     : t})
            pos = pos + 1
    return rows

def fox (soup, url):
    times = soup.find_all ("time")
    if times is None or len (times) == 0:
        return None
    if times[0].has_attr ("datetime"):
        t = times[0]["datetime"]
    else:
        t = times[0].text
    articles = soup.find_all ("div", {"class" : "article-body"})
    if len (articles) != 1:
        return None
    else:
        elements = articles[0].find_all ("p", recursive="false")
        rows = []
        pos = 0
        for element in elements:
            rows.append ({"url"         : url,
                          "text"        : element.text,
                          "paragraphID" : pos,
                          "paragraph"   : pos,
                          "utcTime"     : t})
            pos = pos + 1
    return rows

def tass (soup, url):
    times = soup.find_all ("dateformat")
    if times is None or len (times) == 0:
        print ("no time: " + url)
        return None
    else:
        t = times[0]["time"]
    articles = soup.find_all ("div", {"class" : "text-block"})
    if len (articles) != 1:
        print ("articles " + str (len (articles)) + " " + url)
        return None
    else:
        elements = articles[0].find_all ("p", recursive="false")
        rows = []
        pos = 0
        for element in elements:
            rows.append ({"url"         : url,
                          "text"        : element.text,
                          "paragraphID" : pos,
                          "paragraph"   : pos,
                          "utcTime"     : t})
            pos = pos + 1
    return rows

def bbc (soup, url):
    times = soup.find_all ("time")
    if times is None or len (times) == 0:
        t = ""
    elif times[0].has_attr ("datetime"):
        t = times[0]["datetime"]
    else:
        t = times[0].text
    articles = soup.find_all ("article")
    if len (articles) != 1:
        articles = soup.find_all ("main")
        if len (articles) != 1:
            return None
        else:
            elements = articles[0].find_all ("div", {"dir" : "ltr"},
                                        recursive=False)
            if len (elements) == 0:
                elements = articles[0].find_all ("div", {"dir" : "rtl"},
                                        recursive=False)
            rows = []
            pos = 0
            for element in elements:
                rows.append ({"url"         : url,
                          "text"        : element.text,
                          "paragraphID" : pos,
                          "paragraph"   : pos,
                          "utcTime"     : t})
                pos = pos + 1
    else:
        elements = articles[0].find_all ("div", 
                                        {"data-component" : "text-block"},
                                        recursive=False)
        rows = []
        pos = 0
        for element in elements:
            rows.append ({"url"         : url,
                          "text"        : element.text,
                          "paragraphID" : pos,
                          "paragraph"   : pos,
                          "utcTime"     : t})
            pos = pos + 1
    return rows

parsers = {
    "bnn.network"           : [bnn, "BNN"],
    "reut.rs"               : [reuters, "Reuters"],
    "www.reuters.com"       : [reuters, "Reuters"],
    "apnews.com"            : [ap_news, "AP"],
    "www.foxbusiness.com"   : [fox, "Fox News"],
    "www.foxnews.com"       : [fox, "Fox News"],
    "tass.com"              : [tass, "TASS"],
    "bbc.com"               : [bbc, "BBC"],
    "www.bbc.com"           : [bbc, "BBC"],
}

data = {}
errors = {}
domains = []
for f in os.listdir (path):
    domain = f[:f.find ("_")]
    url = "https://" + f.replace ("_", "/")
    domains.append (domain)
    if domain in parsers:
        parser = parsers[domain]
        try:
            with open (path + f, "r") as file:
                html = file.read ()
                soup = bs4.BeautifulSoup (html)
                result = parser[0] (soup, url)
                if result is not None:
                    if len (result) == 0:
                        print ("zero paragraphs: " + url)
                    if parser[1] in data:
                        data[parser[1]].extend (result)
                    else:
                        data[parser[1]] = result
                else:
                    print ("error parsing: " + url)
                    if parser[1] in errors:
                        errors[parser[1]] = errors[parser[1]] + 1
                    else:
                        errors[parser[1]] = 1
        except:
            print ("unknown error: " + f)
for site in data.keys ():
    print (site)
    df = pd.DataFrame (data[site])
    print (str (len (set (df["url"]))) + " articles successfully parsed")
    print (str (len (df.index)) + " paragraphs total")
    print (str (errors[site] if site in errors else 0) + " articles with errors")
    df.to_csv (out_path + site + "_content.csv", index=False)
    print ("----------")

From there, the parsed articles from each news outlet were compared to the parsed articles from every other outlet, with matches written to a CSV file (along with the average and-paragraph-specific scores for each pair of articles matched). Python code for the comparison process is below; the threshold for matching and the n-gram length can both be tweaked with minimal effort. (Note that this is a very basic algorithm; you can easily find papers and blog posts on a variety of plagiarism detection techniques by typing “plagiarism detection” into your search engine of choice.)

import os
import pandas as pd
import re
import sys
import time


def parse_words (text):
    return re.split (r"\W+", text.lower ())

def get_phrases (article, window=3, min_words=10):
    if "phrases" in article:
        return article["phrases"]
    phrases = []
    for text in article["paragraphs"]:
        if len (text) >= min_words:
            phrases.append (set ([" ".join (text[i:i + window]) \
                for i in range (len (text) - window + 1)]))
    article["phrases"] = phrases
    return phrases

def phrase_similarity (phrases1, phrases2):
    if len (phrases1) == 0 or len (phrases2) == 0:
        return None
    common = len (set.intersection (phrases1, phrases2))
    return common / len (phrases1)

def compare_articles (article1, article2):
    scores = []
    for phrases1 in get_phrases (article1):
        score = 0
        for phrases2 in get_phrases (article2):
            pscore = phrase_similarity (phrases1, phrases2)
            if pscore is not None:
                score = max (pscore, score)
        scores.append (score)
    return scores

def compare_article_sets (site1, site2, articles1, articles2, 
                          cutoff=0.2, max_zeroes=0):
    matches = []
    for article1 in articles1:
        for article2 in articles2:
            scores = compare_articles (article1, article2)
            total = 0
            zeroes = 0
            for score in scores:
                if score == 0:
                    zeroes = zeroes + 1
                else:
                    total = total + score
            score = 0 if len (scores) == 0 else sum (scores) / len (scores)
            if score >= cutoff and zeroes <= max_zeroes:
                matches.append ({"site1"    : site1,
                                 "site2"    : site2,
                                 "article1" : article1["url"],
                                 "article2" : article2["url"],
                                 "scores"   : scores,
                                 "score"    : score})
                print ("potential match "+ str (scores) + " " + article1["url"] + " <- " + article2["url"])
    return matches

def count_and_sort (df, columns):
    g = df.groupby (columns)
    return pd.DataFrame ({"count" : g.size ()}).reset_index ().sort_values (
            "count", ascending=False).reset_index ()

start_time = time.time ()
data = {}
path = sys.argv[1]
site1 = sys.argv[2]
out_file = sys.argv[3]
for f in os.listdir (path):
    df = pd.read_csv (path + f)
    df["text"] = df["text"].fillna ("")
    site = f.replace ("_content.csv", "")
    articles = []
    for url in set (df["url"]):
        articles.append ({"url" : url,
                          "paragraphs" : [parse_words (text) for text in \
                              df[df["url"] == url].sort_values ("paragraph")["text"]]})
    data[site] = articles
    print (site + ": " + str (len (articles)))

matches = []
sites = set (data.keys ())
for site2 in sites:
    if site1 != site2:
        print (site1 + " : " + site2)
        matches.extend (compare_article_sets (site1, site2, data[site1], data[site2]))
print (str (len (matches)) + " potential matches")
df = pd.DataFrame (matches)
df = df.sort_values ("score", ascending=False)
df = df[["site1", "site2", "score", "scores", "article1", "article2"]]
print (str (len (df.index)) + " rows")
print (count_and_sort (df, ["site1", "site2"]))
for site in sites:
    print (site + ": " + str (len (set (df[df["site1"] == site]["article1"]))) + " articles match")
df.to_csv (out_file, index=False)
print (str (time.time () - start_time) + " seconds")

bar chart showing the number of articles potentially copied from each outlet: Reuters 127 articles, AP 55 articles, BBC 50 articles, TASS 35 articles, Fox News 20 articles — reporting on dangerous situations such as wars is much safer when you just regurgitate the work of actual journalists from the comfort of your home

Results: 287 of 1259 BNN articles analyzed (22.8%) appear to have been copied from one of these five sites (Reuters, AP, Fox News, TASS, and the BBC). This is likely just the tip of the iceberg in terms of the total amount of non-original content on the site; in most cases, simply plugging a sentence or two from a BNN article into a search engine leads to an article on an established news site from which the BNN article was likely copied and rephrased. A major exception to this rule is articles about BNN itself (or its founder Gurbaksh Chahal), all of which seem to be original material.

As a sanity check, the major outlets were compared to each other (as well as BNN) using the same algorithm used to compare BNN to the major outlets. The results were markedly different: only ~0.01% of articles in the mainstream outlets were flagged as potentially copied, as opposed to 22.8% of the BNN articles.

network diagram showing the number of text matches between each pair of media outlets (very little evidence of copying by any outlet other than BNN — the major media sites showed almost no evidence of copying (unlike BNN)

Stop kicking women

Jan 26, 2023Edited

Seems like Chahal lost the plot somewhere between savagely beating his ex-girlfriend and interviewing Steven Jarvis as source of truth. You hate to see it.

Expand full comment

Bikram Rathore

Jan 3, 2024

Really liked the debunk, there are writers who writes 1000 articles per hour updated in seconds how come this happens, i was wondering all this is ai content and assigned to author.

Conspirador Norteño

Discussion about this post