Seeing isn't always believing: video edition

The era of text-to-video generative AI is upon us, bringing with it new twists on old problems

Feb 17, 2024

Earlier this week, OpenAI announced the development of a new generative AI model named Sora, which is capable of generating videos up to a minute in length based on a text prompt. Much of the video generated by Sora is of sufficient quality to be mistaken for real footage by a casual viewer, which provides a powerful tool for those who wish to use video content for deceptive purposes. Additionally, the mere existence of this technology makes it easier for dishonest actors to falsely claim real video footage is artificially generated (a phenomenon known as the “liar’s dividend”).

four frames from an AI-generated video of a chef. In two of the frames, the chef is holding a spoon; in the others, the spoon is mysteriously absent — there is no spoon, except for the three seconds when there is

Although the quality of the videos produced by Sora is impressive, there are a variety of anomalies present indicating the synthetic origin of the content. For example, a brief video shared by OpenAI CEO Sam Altman of a woman cooking includes a segment where a spoon appears out of nowhere in the cook’s hand for just long enough to stir the contents of the bowl in front of her, after which the spoon spontaneously vanishes. There are other oddities present as well: why does this person store eggs precariously on the edge of a shelf rather in the refrigerator? What’s the deal with the diagonal rolling pin in the background and what principle of physics holds it in place? The cook’s body language is also unnatural — the head in particular doesn’t stay in sync with the body over the course of the video.

three frames of an AI-generated video of a person walking with pedestrians in the background — the pedestrian in the foreground walks forward, but some of the pedestrians in the background appear to be walking in place

One of the most technically impressive AI-generated video clips shared by OpenAI is a 60 second clip generated from the prompt “A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. she wears a black leather jacket, a long red dress, and black boots, and carries a black purse. she wears sunglasses and red lipstick. she walks confidently and casually. the street is damp and reflective, creating a mirror effect of the colorful lights. many pedestrians walk about”. Despite the overall quality of this video, there are nonetheless some problems, particularly with the pedestrians in the background. For example, two people walking toward the camera near a crosswalk the background make no progress during the portion of the video where they are obscured by the protagonist, who moves noticeably forward during the same time period.

two frames of an AI-generated video of a construction site; a pile of timbers mysteriously changes height over the course of the video — the pile of materials is conveniently flat when the vehicle drives over it, but has noticeable height seven seconds later

Another issue with Sora: the geometry of objects sometimes changes over the course of a given video. In an artificially-generated aerial video of a construction site, a vehicle crosses what appear to be some relatively flat timbers early in the clip. By the end of the video, however, the same pile of timbers has magically grown in height and looks far too tall for a small vehicle to drive over. One of the vertical beams supporting the vehicle’s roof also vanishes over the course of the video, and there are some odd inconsistencies in the size of the components of the structures being built.

frame from an AI-generated video containing several rendering errors involving people and furniture — in the year 2056, you too will be able to pass your body through furniture

A video generated from the prompt “A beautiful homemade video showing the people of Lagos, Nigeria in the year 2056. Shot with a mobile phone camera.” contains several instances of the Sora model struggling with scenes that contain multiple people interacting with furniture or other objects. The person in the background on the left wearing a shirt with thick vertical stripes has a table passing directly through their torso, for instance. Additionally, most of the chairs in the video are geometrically nonsensical when closely examined. Several aspects of the clothing depicted are also physically implausible, such as the structure of the sleeves on the yellow shirt worn by the person just to the right of the middle of the image.

frame from an AI-generated video featuring a cat with too many paws — pay no attention to the extra paw that appears out of nowhere

Few things in life compare to being woken up in the morning by the paws of a playful cat, but Sora’s rendition of this particular experience has a few issues. Partway through the video, an additional front paw materializes out of nowhere and joins the paw already pressed against the nose of the sleepy human. The interaction between paw and face also results in the human’s visible nostril being unrealistically enlarged at several points in the video, and the red collar seen briefly at the beginning is an abstract blob that does not resemble any commonly used type of collar.

frame from an AI-generated video of an art gallery — if you’re unhappy with the lack of paintings of three-legged people in real art museums, AI is here to help

A Sora-generated video of a walkthrough of an art gallery looks reasonably convincing, as long as one doesn’t look too closely at the alleged works of art. One painting appears to show a person with three legs, while others contain incoherent jumbles of fabric and hands rendered in a style that loosely resembles an oil painting.

While this article is hardly an exhaustive list of potential anomalies in Sora-generated videos, it hopefully illustrates that, although these videos are in many ways photorealistic, they can still be identified as synthetically generated by a discerning eye. Caution is warranted, however, as an overabundance of suspicion and overeager diagnosis of real video (or other types of fake video) as AI-generated can actually serve to enable dishonest actors by increasing the level of distrust of all video, including real footage of real events. It’s also worth keeping in mind that, while this technology is new and impressive, the notion of deceiving people with video has been with us for quite some time, and plain old deceptive editing can be just as effective a tool of manipulation as deepfakes when competently executed.

MysteryBee

Feb 18, 2024

Would be interested in your view on web3 verification technologies, which I understand provide a way that videos could be authenticated to a blockchain so that the creator's identity could be robustly established? Is that the next step in the arms race?

Expand full comment

2 replies by Conspirador Norteño and others

FINTEL

Remember that baseball commercial where everyone started melting into red bloody goo? Funny how media touts this as the next big thing, ignoring the anomalies that exist with this technology making it not such a big thing after all.

2 more comments...

Conspirador Norteño

Seeing isn't always believing: video edition

The era of text-to-video generative AI is upon us, bringing with it new twists on old problems

Discussion about this post