Big Data Retrieval

As the volume of file-based media grows, the requirement for metadata advances significantly. Simultaneously, the number of sources of metadata available expands as each node in the programme chain adds information to the file. And here lies the problem, the amount of information a producer has to search through is increasing exponentially, and no one person will be able to access all of the available data in a coherent way.

Traditionally, metadata consisted of a few scrawled notes on a piece of camera tape stuck to the side of a video tape box, or written on a piece of paper. Producers used to visit the tape librarian to find footage for their edit, and a stack of tapes would be given to the edit assistant, who would dutifully load them onto a trolley and wheel them to the edit suite, a process that could take many hours or even days.

As the volume of file based media continues to grow, the requirement for metadata advances significantly. Browsing online, a producer can find library media, provide timecode for edit points and make the material available to an editor in a matter of minutes, sometimes even editing the programme themselves. The power of these search facilities are through stored tags and descriptions derived from the metadata element of the media file. The number of sources of metadata available continues to grow as each node in the programme chain adds information to the file. And here lies the problem, the amount of information a producer has to search through is literally increasing exponentially, and no one person will be able to access all of the available data in a coherent way.

Workflows rarely respect metadata embedded in media files. For example, the camera information in a recorded file such as f-stop, lens focal length, time and GPS location can be easily stripped out during ingest to non-linear editing software. It’s not uncommon for a separate XML file to be created containing further metadata which has to be kept in sync with the video and audio essence. Even embedded MXF doesn’t solve the problem as a file’s metadata can change during an edit, or transcode, and valuable original data could be lost or modified.

Retrieving the metadata and doing meaningful searches has its own challenges. Searching the media files alone can be time consuming and cumbersome. Should you search the rushes or edited master?The transmission transcode or library archive?Data storage is only as effective as data retrieval, searching takes time and requires local knowledge of the system.

Historic data can be stored in databases that are no longer supported and would require a highly competent programmer to extract any meaningful information, assuming the operating system was still available to retrieve the data in the first place. Metadata could be spread over many databases, from the media asset management system to the transmission logs, creating further problems with searching.

A media assets value can only be truly monetized if clients can find the material they’re looking for. As well as having robust storage systems, the tag and search information has to be there in the first place. Having a human sat watching every piece of media to transcribe the dialogue is unrealistic and simply would not work.

As we have seen, the very act of ingesting, editing or transcoding a file can change its contents by stripping out and creating new metadata. An automated way of extracting and recording has to be provided at each workflow node in the chain, thus removing the problem of losing information during processing.The system has to be scalable beyond our usual understanding of scalability, that is we have to be able to parse file formats we haven’t yet designed.

Metadata storage and retrieval is much more involved than just providing access to XML files or embedded MXF data. Historic practices have seen users creating metadata rich filenames in an attempt to overcome retrieval problems, an unsustainable practice as the filenames are limited in length and cannot be easily parsed. The more information we can provide about a media asset, the more useful it is to us, and the more people will be willing to buy it.

Companies such as GrayMeta have been able to take the initiative with big data farming and retrieval. The software works across many systems; public cloud, private cloud and on-prem. And is completely expandable through modularity, being able to deal with all of the future formats we haven’t yet designed.

Using big data systems, producers can now search archive media and rushes with unprecedented granularity, to quickly find obscure and interesting shots. For instance, if we consider a scenario where a producer wants a shot of a cloudy day over New York at 8am. The camera would have recorded the time and GPS position of the shot, a publically available weather system would be able to provide historical data about the weather conditions at that time, and the big data retrieval engine would be able to join and match all of this information and provide a link to the media. This fundamentally requires the camera’s data to be still available. As the metadata would have been recorded by the big data system at the point of ingest, the information created by the camera is maintained, even after the rushes have been edited or transcoded.

Future proofing legacy systems will be a giant task going forward. The traditional method has been to copy SQL databases by decoding their data and tables, reformatting to the new database design, and then transferring the data, hoping there are no corruptions, ambiguities or errors on the way. Another method is to leave the legacy database alone and build an API to extract information from it. This will work to a certain extent, but the database is generally heavily tied to the underlying operating system which at some point will no longer be supported by the manufacturer, causing a major headache for the IT department, especially with security and compliance. It’s quite common for a large client to audit an IT system, and if they find unsupported operating systems, they will simply refuse to deal with you because of their concerns about security.

A single solution, that is easily accessible, can parse data from different sources including scripts, rights databases, xml files and emails. Then meaningfully join them together, and make the search results available is key to library and archive systems of the future.

Let us know what you think…

Log-in or Register for free to post comments…

You might also like...

Essential Guide:  IP - The Final Frontier

Today’s broadcast engineers face a unique challenge, one that is likely unfamiliar to these professionals. The challenge is to design, build and operate IP-centric solutions for video and audio content.

IMF - Interoperability and Content Exchange Made Easy

As the television business has become more global, and evolving consumer devices spawn the need for ever more formats, there has been an explosion of the number of versions that are needed for an item of content. The need to…

Broadcast For IT - Part 19 - Why Use IP?

Moving from the luxury of dedicated point-to-point connectivity in favor of asynchronous, shared, and unpredictable IP networks may seem like we’re making life unnecessarily difficult for ourselves. However, there are compelling reasons to make the transition to IP. In t…

Field Report: TV 2 DANMARK Upgrades to Pebble Beach Systems Marina

TV 2 DANMARK is the most popular commercial television network in Denmark and provides national channels including news and sports programming as well as multiple thematic channels. Faced with the need to move to a new technical facility, management decided that…

Articles You May Have Missed – August 15, 2018

The standards for moving video over IP are all decided, right? Not yet. Even so, the innovation presents unprecedented opportunities and empowers broadcasters to deliver flexibility, scalability, and more efficient workflows. Consultant and The Broadcast Bridge technology editor, Tony Orme,…