The new Microsoft Video Authenticator rates videos with a percentage chance that the content has been artificially manipulated.
In the age of eager reporters surfing the internet for scandalous scoops, who is helping defend TV station newsrooms by detecting and tagging fake pictures and videos before they air, and how are they doing it? Hint: It’s not ‘golden eyes’ or government regulators.
A few years after photography was invented in the 1830’s, people learned how to physically manipulate it for fun, profit and deceit. Nearly two hundred years and 1000s of phony UFO and physically doctored publicity photos later, Adobe introduced Photoshop. The popular digital raster graphics editing software made it possible for anyone with a computer, talent, inspiration and patience to create or recreate a realistic but manipulated photo-reality.
In the years since Photoshop’s 1990 debut, digital video, machine learning (ML), artificial intelligence (AI), and deep neural networks trained by processing examples have brought similar and newer powers to manipulating and managing video. In TV stations, ML, AI and neural networks have all become important components in media management, editing, captioning, monitoring and compliance solutions. What is becoming TV broadcasting’s new best friend can also be its foe.
These new digital learning powers are also ripe for nefarious people to produce and distribute fake videos promoting a particular political agenda. If your newsroom isn't concerned about ‘Deepfake’ videos, it should be because Deepfake has quickly evolved beyond human eye visibility. Its dangerous because its easy for one person to create fake news that passes the video display eye-test.
Fake video technology debuted commercially in 2018 with FakeApp software, originally intended for users to have fun creating and sharing videos with their faces swapped. Like nearly all new forms of entertainment technology, as it got better and easier to access its uses branched out from wholesome fun to deceit and disinformation. During 2019 and 2020, FakeApp was superseded by several alternative and rapidly-evolving, off-the-shelf Deepfake production systems that produce excellent results on COTS systems and continue improving.
There are also several generative adversarial network (GAN)-based face swapping methods that have been published with accompanying code that qualify as Deepfake too. Fake video producers with similar lofty intentions but without ML, AI or neural network capabilities make so-called ‘cheapfake’ videos which are easier to produce but perhaps more likely for a viewer to spot. A couple of inexpensive fake video systems reside entirely in the cloud. How photorealistic does a fake parody video need to be?
Deepfake software running on COTS systems has led to a Deepfake explosion in parody videos and targeted political attacks that look incredibly real. In addition to the pandemic, the 2020 challenge to TV newsrooms is to confirm all media and clips are real before airing them as factual. Deepfake technology progress and ease of access makes all digital media suspicious.
Digital forensics experts can analyze unique, high-impact videos for evidence of manipulation, but they can’t scale to review each of the hundreds of thousands of videos uploaded to the internet or social media platforms every day. Checking that worldwide dynamic content for Deepfakes necessitates automated scalability, and computer vision or multimodal models specifically designed for the task.
Microsoft recently announced its new Video Authenticator, not to be confused with the Microsoft Authenticator phone app for multi-factor verification. The new Video Authenticator technology detects manipulated content to assure people what they are viewing is authentic.
According to Microsoft, “Video Authenticator can analyze a still photo or video to provide a percentage chance, or confidence score, that the media is artificially manipulated. In the case of a video, it can provide this percentage in real-time on each frame as the video plays. It works by detecting the blending boundary of the Deepfake and subtle fading or greyscale elements that might not be detectable by the human eye.”
The new technology has two components. First is a Microsoft Azure tool that enables content producers to add digital hashes and certificates to their content as metadata. The second component is a reader that checks the certificates and matches the hashes with a high degree of accuracy. It verifies that the content is authentic and unchanged, and it provides details about who produced it. The reader can be a browser extension or in other forms.
Facebook is countering the emerging Deepfake threat by constructing an extremely large swap video dataset to enable the training of detection models. Facebook also organized the DeepFake Detection Challenge (DFDC) Kaggle competition. Kaggle is an online community of data scientists and machine learning engineers, and a Google subsidiary.
The DFDC has enabled experts from around the world to come together, benchmark their deepfake detection models, try new approaches, and learn from each others’ work. Unlike previous databases, all recorded subjects used in the DFDC agreed to participate and to allow their likeness to be manipulated and modified during construction of the DFDC dataset.
The DFDC dataset is the largest face swap video dataset currently available to the public. It includes more than 100,000 clips using 3,426 paid actors, produced with several Deepfake, GAN-based and non-learned methods. Deepfake detection models trained only on the DFDC can generalize to real “in-the-wild” Deepfake videos. The model can be a valuable analysis tool when analyzing potentially Deepfaked videos.
The first-generation datasets contain fewer than 1,000 videos and less than 1 million frames. It also doesn’t have agreements with individuals for the rights to use them in the dataset. Many of the videos were sourced from YouTube and public individuals.
The second-generation datasets contain between 1 and 10 thousand videos, 1 to 10 million frames, and the videos have better perceptual quality than videos in the first-generation. During generation two, subjects appearing in a dataset without their consent became an ethical concern. In response, the preview version of this second-generation dataset using consenting actors was released. Shortly thereafter, a similar Google-DFD Dataset was also released, which contained 28 paid actors. However, the datasets in this generation do not contain enough identities to allow for sufficient detection generalization.
Facebook recently proposed a third generation of datasets that have significantly more frames and videos than the second generation. The most recent DeepFake datasets, DeeperForensics-1.0 and Facebook’s DFDC Dataset, contain tens of thousands of videos and tens of millions of frames. Paid actor consent has continued for both datasets.
However, there are many major differences between DF-1.0 and the DFDC Dataset, but DF-1.0 is the most similar dataset to the DFDC dataset. Each of the 100,000 fake videos in the DFDC Dataset is a unique target/source swap. Ignoring perturbations, DF-1.0 only contains 1,000 unique fake videos.
Methods and Models
Most Deepfake videos contain a target and a source. The target is the base video, the source is the content used to swap onto a face in the target video. All the face-swapped videos in the dataset were created with one of several methods. The set of models selected were designed to cover some of the most popular face swapping methods at the time the dataset was created.
Some methods with less-realistic results were included in order to represent low-effort Deepfakes and cheapfakes. The number of videos per-method are not equal; most face-swapped videos were created with the Deepfake Autoencoder (DFAE). This choice was made to reflect the distribution of public Deepfaked videos.
About 95% of faked videos are created with DeepFaceLab off-the-shelf software. Similar software includes Zao, FaceSwap, and Deepfakes web.
DFAE methods were generally the most flexible and produced the best results. It uses one shared encoder, but two separately trained decoders, one for each identity in the swap. In addition, the shared portion of the encoder extends one layer beyond the bottleneck, and the upscaling functions used are typically PixelShuffle operators, which is a non-standard, non-learned function that maps channels in one layer to spatial dimensions in the next layer.
The morphable mask/neural network (MM/NN) method performs swaps with a custom frame-based morphable-mask model. Facial landmarks in the target image and source image are computed, and the pixels from the source image are morphed to match the landmarks in the target image. The NN method was able to produce convincing single frame images but tended to produce discontinuities in the face.
The FSGAN method uses GANs to perform face swapping (and reenactment) of a source identity onto a target video, accounting for pose and expression variations. FSGAN applies an adversarial loss to generators for reenactment and inpainting, and trains additional generators for face segmentation. FSGAN tends to produce flat-looking results and works better with good lighting.
The StyleGAN method is modified to produce a face swap between a given fixed identity descriptor onto a video by projecting this descriptor on the latent face space. This process is executed for every frame, and it produced the worst overall results at the frame and video levels.
The NTH model is a GAN-like method that produced the most consistent quality. It can generate realistic talking heads of people in few- and one-shot learning settings. It consists of two distinct training stages: a meta-learning stage and a fine-tuning stage. A pre-trained model is fine-tuned with pairs of videos in the raw DFDC set: the land marking positions are extracted from the driving video and fed into the generator to produce images with the appearance of the person in the other video.
Using a sharpening filter on the blended faces greatly increased the perceptual quality in the final video at no additional expense.
An initial set of 857 subjects were selected into the training set, and there are over 360,000 potential pairings within this group. Training all pairs would require almost 1,000 GPU-years, assuming it takes one day to train a DFAE model on one GPU.
Subjects were paired within a set with those with similar appearances, as this tended to give better results for models like the DFAE. Over 800 GPUs were used to train 6,683 pairwise models (which required 18 GPU-years), as well the more flexible models such as NTH or FSGAN that only required a small amount of fine-tuning per subject. Finally, a subset of 10 second clips were selected from the output of all models, and the overall distribution of gender and appearance was balanced across all sets and videos.
Making fake media is easy. Detecting it is complicated. The industry’s best experts are at work on detection solutions to help people and broadcasters not be mislead by it.
You might also like...
As engineers and technologists, it’s easy to become bogged down in the technical solutions that maintain high levels of computer security, but the first port of call in designing any secure system should be to consider the user and t…
In the last article in this series, we looked at why integrated monitoring is a necessity in modern broadcast IP workflows. In this article, we dig deeper to understand what is new in IP monitoring and how this integrates with…
The phrase media supply chain is increasingly ubiquitous within the industry, and crops up everywhere from marketing materials to magazine articles. It’s also a fairly new one too, though, so it’s worth unpacking what it means, why it is …
Optimization gained from transitioning to the cloud isn’t just about saving money, it also embraces improving reliability, enhancing agility and responsiveness, and providing better visibility into overall operations.
Video, audio and metadata monitoring in the IP domain requires different parameter checking than is typically available from the mainstream monitoring tools found in IT. The contents of the data payload is less predictable and packet distribution more tightly defined…