Seeing isn't always believing: video edition
The era of text-to-video generative AI is upon us, bringing with it new twists on old problems
Earlier this week, OpenAI announced the development of a new generative AI model named Sora, which is capable of generating videos up to a minute in length based on a text prompt. Much of the video generated by Sora is of sufficient quality to be mistaken for real footage by a casual viewer, which provides a powerful tool for those who wish to use video content for deceptive purposes. Additionally, the mere existence of this technology makes it easier for dishonest actors to falsely claim real video footage is artificially generated (a phenomenon known as the “liar’s dividend”).
Although the quality of the videos produced by Sora is impressive, there are a variety of anomalies present indicating the synthetic origin of the content. For example, a brief video shared by OpenAI CEO Sam Altman of a woman cooking includes a segment where a spoon appears out of nowhere in the cook’s hand for just long enough to stir the contents of the bowl in front of her, after which the spoon spontaneously vanishes. There are other oddities present as well: why does this person store eggs precariously on the edge of a shelf rather in the refrigerator? What’s the deal with the diagonal rolling pin in the background and what principle of physics holds it in place? The cook’s body language is also unnatural — the head in particular doesn’t stay in sync with the body over the course of the video.
One of the most technically impressive AI-generated video clips shared by OpenAI is a 60 second clip generated from the prompt “A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. she wears a black leather jacket, a long red dress, and black boots, and carries a black purse. she wears sunglasses and red lipstick. she walks confidently and casually. the street is damp and reflective, creating a mirror effect of the colorful lights. many pedestrians walk about”. Despite the overall quality of this video, there are nonetheless some problems, particularly with the pedestrians in the background. For example, two people walking toward the camera near a crosswalk the background make no progress during the portion of the video where they are obscured by the protagonist, who moves noticeably forward during the same time period.
Another issue with Sora: the geometry of objects sometimes changes over the course of a given video. In an artificially-generated aerial video of a construction site, a vehicle crosses what appear to be some relatively flat timbers early in the clip. By the end of the video, however, the same pile of timbers has magically grown in height and looks far too tall for a small vehicle to drive over. One of the vertical beams supporting the vehicle’s roof also vanishes over the course of the video, and there are some odd inconsistencies in the size of the components of the structures being built.
A video generated from the prompt “A beautiful homemade video showing the people of Lagos, Nigeria in the year 2056. Shot with a mobile phone camera.” contains several instances of the Sora model struggling with scenes that contain multiple people interacting with furniture or other objects. The person in the background on the left wearing a shirt with thick vertical stripes has a table passing directly through their torso, for instance. Additionally, most of the chairs in the video are geometrically nonsensical when closely examined. Several aspects of the clothing depicted are also physically implausible, such as the structure of the sleeves on the yellow shirt worn by the person just to the right of the middle of the image.
Few things in life compare to being woken up in the morning by the paws of a playful cat, but Sora’s rendition of this particular experience has a few issues. Partway through the video, an additional front paw materializes out of nowhere and joins the paw already pressed against the nose of the sleepy human. The interaction between paw and face also results in the human’s visible nostril being unrealistically enlarged at several points in the video, and the red collar seen briefly at the beginning is an abstract blob that does not resemble any commonly used type of collar.
A Sora-generated video of a walkthrough of an art gallery looks reasonably convincing, as long as one doesn’t look too closely at the alleged works of art. One painting appears to show a person with three legs, while others contain incoherent jumbles of fabric and hands rendered in a style that loosely resembles an oil painting.
While this article is hardly an exhaustive list of potential anomalies in Sora-generated videos, it hopefully illustrates that, although these videos are in many ways photorealistic, they can still be identified as synthetically generated by a discerning eye. Caution is warranted, however, as an overabundance of suspicion and overeager diagnosis of real video (or other types of fake video) as AI-generated can actually serve to enable dishonest actors by increasing the level of distrust of all video, including real footage of real events. It’s also worth keeping in mind that, while this technology is new and impressive, the notion of deceiving people with video has been with us for quite some time, and plain old deceptive editing can be just as effective a tool of manipulation as deepfakes when competently executed.
Would be interested in your view on web3 verification technologies, which I understand provide a way that videos could be authenticated to a blockchain so that the creator's identity could be robustly established? Is that the next step in the arms race?
Remember that baseball commercial where everyone started melting into red bloody goo? Funny how media touts this as the next big thing, ignoring the anomalies that exist with this technology making it not such a big thing after all.