Machine Learning (ML) For Broadcasters: Part 11 - Generative AI In Content Generation

Machine Learning, under the banner of Generative AI, is encroaching on creative aspects of Audio and Visual content generation, even down to production of movie scripts and screenplays. This builds on established applications of ML lower down the production chain, such as generation of clips and quality enhancement.

Unprecedented computational power allied to advances in deep machine learning are taking generative AI into creative areas beyond automation of more mundane tasks across the video lifecycle. So far, examples of AI and ML in creative content production have been confined to short films and documentaries, but the potential not merely to generate Audio and Visual content from textual inputs, but also even to produce original scripts, has to varying degrees been already demonstrated.

While humans still reign supreme at the pinnacle of creativity in the Audio and Visual arts, there is definite concern, as well as hope, over the impact of inevitable further advances in generative AI capabilities. As in other applications of AI discussed in this series, we are talking here mainly about various shades of ML where fields are represented and analyzed as deep nested networks of nodes and weighted connections that allows identification of patterns and convergence around desired predictive models.

Generative AI has evolved from humble beginnings in relatively closed fields and branched out into various domains such as production of marketing copy and essay writing to order, which immediately identified some of the contentious areas such as plagiarism and diminution of skills thought to be exclusively human. In generative AI, plagiarism can arise not so much through direct theft of ideas or content, but indirect convergence around texts or inputs selected from the internet for training purposes. Such plagiarism is hard to pin down to specific sources but is a running sore for generative AI.

ML in the broader field of assisted content production has evolved along several other fronts, some of which are not yet fit for commercial deployment in full scale production but are now on the radar screen and promising to converge as part of overall Audio and Visual workflow. One important thread is on the textual and audio side where text-to-speech and voice synthesis techniques have come together to enable creation of more authentic audio from text for applications such as podcasting. In such cases text is converted to speech which can be based on a synthesized voice, or even adaptation of the author’s voice, avoiding the need for actual dictation of the words. Podcastle.AI, a free extension of Google Chrome, is an example of such a utility for converting blog posts into podcasts, or indeed any text into speech.

The state of the art in the field of content generation from prior sources is text-to-video, which is very much work in progress because it is more complex, fuzzy, and more computationally intensive, than say text-to-voice. Some of the earlier efforts have taken what may well prove to be an ultimately futile route of going straight from text to video, bypassing the intermediate image or single frame step, by creating large sets of text/video pairs culled from the internet. These data sets are used to train the ML algorithms that would subsequently generate video from fresh text. Some of these models have also gone straight to full resolution video, which itself adds computational complexity as well as uncertainty.

The way forward almost certainly is to break the text to video generation into at least three steps. First is to generate still images from the text, which then serve as input frames to the second step. That would be to expand the images across the spatiotemporal domain to create video sequences, but at first at low resolution. Finally, a different deep neural network would be applied to upscale the video to a higher resolution, also adding greater color depth and dynamic range.

The best-known manifestation of this approach to date is Meta’s (formerly Facebook) Make-A-Video unveiled in October 2022 and still in beta release, which first generates single-frame images from an input text. These images are then expanded across the spatial and temporal planes through extra neural network layers.

The single frames images are created with an encoder pre-trained for the task, using images more loosely coupled with text descriptions rather than full scripts, as well as completely unlabelled videos to develop knowledge of how objects of all kinds interact and move. It then converts the input text into an embedded image, followed by decoding into a sequence of 16 frames at just 64x64 pixels. Unsupervised learning is then applied to the video data without text labels to develop a ML model capable of up sampling the generated slow and low-resolution video to a higher frame rate and pixel density.

Unsupervised learning involves training on pre-existing unlabelled data sets without preconceptions to identify classes or enable generation of related output from subsequent novel input data. This is valuable for training video generation models partly because it does not require large sets of text/video pairs for training at a computational complexity that challenges even the most powerful systems available today. It allows the different stages to be optimized independently, which makes it easier to achieve consistent high-quality results.

Furthermore, at this stage the ambition is limited to generation of video from pre-existing text, ultimately perhaps a screenplay for a movie. A further stage lies in generating the screenplay from a source further removed, such as a novel, or even completely from scratch.

It is sometimes argued that AI cannot generate genuinely original stories because computers lack emotional intelligence associated with empathy, hope and motivation. Yet, rather like psychopaths are supposed to be capable, ML models can simulate emotions quite convincingly given sufficient training and power, so it is reasonable to assume that the creative frontier will be rolled back to some extent.

Meanwhile, ML is already being employed or at least evaluated for various subordinate roles in original content creation with potential for adding considerable value, as well as providing a glimmer of the future. One of these is creation of movie trailers by condensing the full feature length version, which IBM claimed to have done first as early as 2016 with its Watson AI research supercomputer.

Better known perhaps for its feats defeating humans in quiz shows, Watson created the trailer for the science fiction horror movie Morgan, made by 20th Century Fox, since renamed 20th Century Studios. This was rather a proof of concept, where the model was first trained on over 100 other horror movie trailers to analyze individual scenes for sound, image and composition, then getting to work on the 90 minutes of the Morgan movie.

Human editors were still required to patch the individual scenes together, but IBM claimed that the overall process had been shortened from around 20 days to just 24 hours. Since then, further progress has been made creating trailers such that the model can also stitch together the scenes, virtually avoiding need for human intervention at all.

There are various ways ML is becoming involved further upstream at the movie creation stage, and even in casting. Although ML may indeed strictly fail to emulate human emotions it has already proved competent at identifying their surrogates written in facial expressions, evoked in tones of voice, or even implicit in written text.

Such capabilities are subject to research under the sub field known as Emotional AI, which gained early application in marketing and product development, aiming to assess reactions of humans in product trials. It has even been tried in finance to predict stock market movements by analyzing emotional reactions of traders to unfolding events.

In movie making and entertainment content creation generally, emotional AI is being evaluated or used for scene selection from alternative clips, aiming to identify those likely to resonate most with audiences, or that fit the narrative most effectively. In such ways AI, or strictly ML, is increasingly infiltrating overall video workflow.

One other emerging application of ML in content creation lies in video and audio enhancement, which raises some qualitative and ethical issues. An old truism in casting, relevant especially for series or individual movies that span long periods of time, was that it was much easier to make actors look older than they are than younger through application of makeup. For some years now CGI has been applied with increasing success to break this rule by taking years off an actor’s appearance, extending as it were the life of ageing actors such as Robert De Niro. Yet the results have been criticized for being unrealistic, especially in cases where the actor never was genuinely young in the part, which has led to adoption of a lighter touch.

In one example, Disney+ created Obi-Wan Kenobi as an American TV television miniseries, as part of the Star Wars franchise. This stars Ewan McGregor as Obi-Wan Kenobi, re-enacting his role from the original Star Wars prequel trilogy. This involved just light application of CGI de-aging, which was welcomed by some critics as being sufficient to demonstrate that some time had passed but not to take decades off the actor’s appearance, which would have been incongruous.

ML can be applied to enhance audio as well, which can filter out unwanted background effects such as passing emergency vehicle siren, in an audio equivalent of photoshop. It can also be used to reconstitute voices from old recordings originally on physical media that had deteriorated, creating clean versions for re-archiving.

At a more advanced level, generative ML can be combined with voice synthesis to help edit the audio and reduce need for retakes, which is already used in tools for cleaning up podcasts, as noted earlier. In these cases, the tools do not just remove pauses and smooth over stutters, but can in principle generate words that fill in and enhance the narrative.

Audio and video generation can even in principle be applied to create content featuring actors who are dead, or who have not actually engaged directly in the making of the material. This leads to the deep fake angle, which has become an increasing concern for actors as it becomes possible to create digital avatars that simulate words and actions they have never performed and might not sanction.

Such concerns led the UK’s actors’ union Equity in April 2022 to launch a campaign called "Stop AI Stealing the Show", highlighting examples such as voice-overs in adverts clearly designed to resemble a given actor, or indeed other well-known celebrity, without explicitly naming them.

Such cases highlight the dichotomy posed by AI in content generation, as in other fields, between great opportunity but also unknown consequences and risks, some of which are already apparent.

You might also like...

C-Suite Insight: Louis Hernandez Jr. Reshapes Grass Valley For Long-Term Success

A self-described “technologist” at heart, Louis Hernandez Jr. knows an emerging trend when he sees one and likes to ride the wave as long as possible. Trained by his father, a computer science teacher, with his formal undergraduate and MBA in …

Understanding The Client-Side OTT Customer Experience

The criticality of service assurance in OTT services is evolving quickly as audiences grow and large broadcasters double-down on their streaming strategies.

The Potential Impact Of Quantum Computing

Quantum Computing is still a developmental technology but it has the potential to completely transform more or less everything we currently assume regarding what computers can and can’t do - when it hits the mainstream what will it do…

OTT Monitoring From The Network Side

At its core, the network-side can be an early warning system for QoS, which in turn correlates to actual QoE performance.

Network & System Orchestration Tools At IBC 2023

At the heart of virtually every IP infrastructure and its inherent IT network is a software layer that acts like a conductor to make sure the system is working smoothly. Some call it the orchestration layer because it instructs each…