Big Data Retrieval

Just having metadata is not sufficient. One needs to be able to effectively search it.

As the volume of file-based media grows, the requirement for metadata advances significantly. Simultaneously, the number of sources of metadata available expands as each node in the programme chain adds information to the file. And here lies the problem, the amount of information a producer has to search through is increasing exponentially, and no one person will be able to access all of the available data in a coherent way.

Traditionally, metadata consisted of a few scrawled notes on a piece of camera tape stuck to the side of a video tape box, or written on a piece of paper. Producers used to visit the tape librarian to find footage for their edit, and a stack of tapes would be given to the edit assistant, who would dutifully load them onto a trolley and wheel them to the edit suite, a process that could take many hours or even days.

As the volume of file based media continues to grow, the requirement for metadata advances significantly. Browsing online, a producer can find library media, provide timecode for edit points and make the material available to an editor in a matter of minutes, sometimes even editing the programme themselves. The power of these search facilities are through stored tags and descriptions derived from the metadata element of the media file. The number of sources of metadata available continues to grow as each node in the programme chain adds information to the file. And here lies the problem, the amount of information a producer has to search through is literally increasing exponentially, and no one person will be able to access all of the available data in a coherent way.

Workflows rarely respect metadata embedded in media files. For example, the camera information in a recorded file such as f-stop, lens focal length, time and GPS location can be easily stripped out during ingest to non-linear editing software. It’s not uncommon for a separate XML file to be created containing further metadata which has to be kept in sync with the video and audio essence. Even embedded MXF doesn’t solve the problem as a file’s metadata can change during an edit, or transcode, and valuable original data could be lost or modified.

Retrieving the metadata and doing meaningful searches has its own challenges. Searching the media files alone can be time consuming and cumbersome. Should you search the rushes or edited master?The transmission transcode or library archive?Data storage is only as effective as data retrieval, searching takes time and requires local knowledge of the system.

Historic data can be stored in databases that are no longer supported and would require a highly competent programmer to extract any meaningful information, assuming the operating system was still available to retrieve the data in the first place. Metadata could be spread over many databases, from the media asset management system to the transmission logs, creating further problems with searching.

A media assets value can only be truly monetized if clients can find the material they’re looking for. As well as having robust storage systems, the tag and search information has to be there in the first place. Having a human sat watching every piece of media to transcribe the dialogue is unrealistic and simply would not work.

As we have seen, the very act of ingesting, editing or transcoding a file can change its contents by stripping out and creating new metadata. An automated way of extracting and recording has to be provided at each workflow node in the chain, thus removing the problem of losing information during processing.The system has to be scalable beyond our usual understanding of scalability, that is we have to be able to parse file formats we haven’t yet designed.

Metadata storage and retrieval is much more involved than just providing access to XML files or embedded MXF data. Historic practices have seen users creating metadata rich filenames in an attempt to overcome retrieval problems, an unsustainable practice as the filenames are limited in length and cannot be easily parsed. The more information we can provide about a media asset, the more useful it is to us, and the more people will be willing to buy it.

Companies such as GrayMeta have been able to take the initiative with big data farming and retrieval. The software works across many systems; public cloud, private cloud and on-prem. And is completely expandable through modularity, being able to deal with all of the future formats we haven’t yet designed.

Using big data systems, producers can now search archive media and rushes with unprecedented granularity, to quickly find obscure and interesting shots. For instance, if we consider a scenario where a producer wants a shot of a cloudy day over New York at 8am. The camera would have recorded the time and GPS position of the shot, a publically available weather system would be able to provide historical data about the weather conditions at that time, and the big data retrieval engine would be able to join and match all of this information and provide a link to the media. This fundamentally requires the camera’s data to be still available. As the metadata would have been recorded by the big data system at the point of ingest, the information created by the camera is maintained, even after the rushes have been edited or transcoded.

Future proofing legacy systems will be a giant task going forward. The traditional method has been to copy SQL databases by decoding their data and tables, reformatting to the new database design, and then transferring the data, hoping there are no corruptions, ambiguities or errors on the way. Another method is to leave the legacy database alone and build an API to extract information from it. This will work to a certain extent, but the database is generally heavily tied to the underlying operating system which at some point will no longer be supported by the manufacturer, causing a major headache for the IT department, especially with security and compliance. It’s quite common for a large client to audit an IT system, and if they find unsupported operating systems, they will simply refuse to deal with you because of their concerns about security.

A single solution, that is easily accessible, can parse data from different sources including scripts, rights databases, xml files and emails. Then meaningfully join them together, and make the search results available is key to library and archive systems of the future.

Other related articles posted on The Broadcast Bridge.

Demystifying and Debunking Metadata

You might also like...

IP Monitoring & Diagnostics With Command Line Tools: Part 7 - Remote Agents

How to run diagnostic processes in each machine and call them remotely from a centralised system that can marshal the results from many other networked systems. Remote agents act on behalf of that central system and pass results back to…

Growing Momentum For 5G In Remote Production

A combination of factors that includes new 3GPP 5G standards & optimizations that have reduced latencies & jitter, new network slicing capabilities and the availability of new LEO satellite services are bringing increasing momentum to the use of 5G for…

Building Software Defined Infrastructure: Part 4 - Integration

Welcome to Part 4 of Building Software Defined Infrastructure. This multi-part content series from Tony Orme explores the microservices based IT technologies that are driving the next phase of transition from hardware to software based broadcast systems. This series is essential…

Monitoring & Compliance In Broadcast: Accessibility & The Impact Of AI

The proliferation of delivery devices and formats increases the challenges presented by accessibility compliance, but it is an area of rapid AI powered innovation.

IP Monitoring & Diagnostics With Command Line Tools: Part 6 - Advanced Command Line Tools

We continue our series with some small code examples that will make your monitoring and diagnostic scripts more robust and reliable