AI In The Content Lifecycle: Part 3 - Gen AI In Post Production
This series takes a practical, no-hype look at the role of Generative AI technology in the various steps in the broadcast content lifecycle – from pre-production through production, post, delivery and in regulation/ethics. Here we examine how Generative AI is impacting post production, bringing capabilities that were either impossible or unfeasible before, automating special effects, generating content at the post-production stage, and adding audio to video.
Other articles in this series:
Post-production has long had more scope for automation and potential benefit from it than any other part of the audio video workflow, and so it is not surprising that it was early to employ some form of contemporary AI from around 2012. Yet the surface has barely been scratched so far given the huge scope and complexity, and now a fresh wave of advances is breaking with Generative AI.
Post processes typically take considerably longer than the actual production and so a strong motivation for AI has been to reduce the time taken by tasks such as best shot selection, addition of special effects, and removing undesired artefacts from the audio or video, while also enhancing content with some newly generated snippets. The second area for AI is to increase the scope of processes such as color correction, enabling higher levels of quality control and enhancement. The final step is then to go further still and enable improvements that were just not possible or feasible before, which could include incorporation of more sophisticated effects around Extended Reality (ER).
As this all might suggest, there is a trend towards converging the various stages of production from pre through to post to reduce time, improve efficiency and incorporate special effects with the primary content generation. Gen AI will figure here by enabling greater overlap and streamlining across the whole production workflow.
The stage had already been set with developments in nonlinear editing (NLE), breaking with the traditional post-production process which stepped through the material sequentially scene by scene. Having once been an analog process, NLE was digitized in the shape of video editing software, allowing scenes to be edited as a holistic whole for consistent application of special effects and emotional overlays such as color tinting.
But the broader tasks of video editing, enhancements such as CGI, color finishing and final sound mixing with incorporation of special effects were still sequential, one having to be finished before the other began. This could be restrictive in a creative sense, while failing to capitalize on time savings that could be achieved with overlap.
Video creation tools from the leading Gen AI companies are finding application in post-production for generating fragments or even complete scenes that can be integrated and edited as a whole in a more nonlinear fashion. Then the larger foundational language-based platforms are also being applied across the workflow for post-production, as well as origination.
Examples include Microsoft’s Azure Video Indexer and Google Deep Mind’s Veo, the latter launched in May 2024. These platforms have been incorporated by traditional vendors of post-production systems, or soon will be.
Such platforms may operate as cloud-based video analytics services, sometimes “at the edge”, meaning closer to the point of use to minimize latency, or to conform with local regulations over privacy. They may then generate insights for actions from the stored video, which could include identification of faces and logos, or understanding of graphics.
Numbers of people in scenes can be counted, or characteristics of the scene such as time of day determined by combining elements, all traditional time-consuming post-production tasks. These processes can then be integrated with overall production, shrinking previously distinct temporal processes into a single time slot.
Some such systems have also embodied substantial progress in video to audio (V2A), which has many intriguing applications. They make synchronized audiovisual generation possible for the first time as a coherent process, of interest for primary content generation but also in post-production for adding realistic sound effects or retrospective dialogue.
Post-production can also be applied after much longer lag times to archived content, either to restore old audio or create sound that was not there before. It can even generate soundtracks for old silent movies, which might have the likes of Charlie Chaplin and Buster Keaton turning in their graves. It was after all Charlie Chaplin who lamented that sound came along just when silent movies were becoming good. Nevertheless, there will be interest in experimenting with sound generation for old content as the technology improves.
A key point for success in post-production especially is that the V2A should operate down at the pixel level without any preconceived idea of what the video comprises and what objects are contained within each frame. It is worth considering how Gen AI is being applied here and what some of the outstanding technical challenges are.
The favored approach for audio generation from video, certainly by Deep Mind, is called the diffusion model, which has been found so far to generate the most compelling as well as accurate audio from raw video input. In essence, video is first compressed, which seems to improve the process by paring it down to its essential parts, reducing redundancy. Then natural language prompts are issued to synchronized, realistic audio aligned as closely as possible with the prompt.
Finally, the audio output is decoded, turned into an audio waveform and combined with original videoinput to yield the complete AV product.
The key to success lies in the diffusion approach to machine learning applied in these generative models, which simulate the way information in a real-world system spreads across its representation as data points, gradually blending and transforming them to yield ever evolving samples. This can be tuned readily to AV, which flows sequentially between frames each closely related to its predecessor but changing progressively.
The diffusion model begins by learning the latent structure of its subject dataset, that is the dynamics of data point movement through what is called its latent space. That is created by embedding items within a data space such that the more a pair of items resemble each other the closer together they are located.
This sets the dynamic background for study of how data points flow through it and how the whole is transformed over time. This provides the basis for training the model to generate realistic audio matching these flows. More detailed information is then added to the training process, including transcription of spoken dialogue matching certain sequences.
In the Google Veo case, training on video, audio and additional annotations has enabled the system to learn that specific audio events are associated with various visual scenes. This provides a foundation for the audio generation, which can then be tuned further.
So far, such systems have been shown capable of processing video at the raw pixel level, which makes them more generally applicable, and can automatically align the generated sound with the video. There is no need for manual adjustment of sound elements and visuals, or their relative timing.
ML algorithms can already recognize regions within video footage where sound effects could be added to enhance the scene, so there is potential for combining these with the A to V generation with some automation. But A to V itself is still work in progress, because there are some distinct limitations associated to the original source video.
If that video is of poor quality, with artefacts and distortions, as would often be the cause with archived footage, the audio in turn will also be of low fidelity. There is potential for improvement, perhaps through preprocessing of the video, given that AI has already been applied in this way to restore old footage.
Another issue is lip synchronization, given that the generated audio does not follow directly from characters or objects with which it should be associated. That is another complex research project to bring spoken audio into sync with lips where relevant.
The ability to generate both audio and video from scripts is also filtering into both production and post-production. With Gen AI there is growing capability to apply multimodal models capable of adding granular detail to scenes, again straddling the borderline between production and post. This can include refining costumes, incorporating new shot angles, changing aspects of movement. AI has improved what can be called temporal comprehension, meaning that relevant changes applied to a given scene or small cluster of frames will automatically percolate through the whole movie or other content items.
The ambition is that Gen AI will take automation to a higher level, first by recommending changes to scenes or aspects such as color, and then applying them automatically, with suitable annotation and explanation where relevant.
There is certainly growing consensus in the post-production field that Gen AI will bring further disruption and change. A survey conducted by Variety Intelligence Platform (VIP+) in collaboration with media research firm HarrisX in May 2024 found that among 150 US media and entertainment decision makers, 37% thought that Gen AI would have a major impact on film editing over the next two years, and a further 36% a minor impact.
You might also like...
NDI For Broadcast: Part 1 – What Is NDI?
This is the first of a series of three articles which examine and discuss NDI and its place in broadcast infrastructure.
Brazil Adopts ATSC 3.0 For NextGen TV Physical Layer
The decision by Brazil’s SBTVD Forum to recommend ATSC 3.0 as the physical layer of its TV 3.0 standard after field testing is a particular blow to Japan’s ISDB-T, because that was the incumbent digital terrestrial platform in the country. C…
Designing IP Broadcast Systems: System Monitoring
Monitoring is at the core of any broadcast facility, but as IP continues to play a more important role, the need to progress beyond video and audio signal monitoring is becoming increasingly important.
Broadcasting Innovations At Paris 2024 Olympic Games
France Télévisions was the standout video service performer at the 2024 Paris Summer Olympics, with a collection of technical deployments that secured the EBU’s Excellence in Media Award for innovations enabled by application of cloud-based IP production.
HDR & WCG For Broadcast - Expanding Acquisition Capabilities With HDR & WCG
HDR & WCG do present new requirements for vision engineers, but the fundamental principles described here remain familiar and easily manageable.