Delivering Timing For Live Cloud Productions - Part 2
Highly accurate timing has always been required for video and audio signals due to the sampling requirements, and this has been an inherently hardware trait. But as we move to IP and cloud, we now have the opportunity to evaluate whether sub nanosecond timing is really needed and to consider the alternatives.
Other articles from this series.
Video Frame Referencing
As an alternative, we instead think in terms of frame referencing, then we have a system that is much easier to work with, is much more flexible, and delivers many more infrastructure options.
This can be achieved by timestamping each frame, but instead of using a frame synchronizer to line each video input to a common sync-reference, we instead change the offset in the timestamp. Then we find that we have a system that is no longer reliant on clock synchronous transport streams such as SDI and can work in IP COTS infrastructures as well as public clouds.
One timing reference could be derived by synthesizing the optimal timing point. And assuming all the other video sources had a similar frame rate, then they would be displaced temporally on the timeline relative to the reference input. Each video input timestamp would be normalized to match the nominated reference video and adjusted so that each frame aligns in time. It’s clear we cannot move a video frame in time, but we can send it to a buffer to be delayed to temporally match the nominated video source reference.
The playout-engine will have visibility of the contents of each buffer and will be able to read out the appropriate frame at the designated timestamp. And by choosing an input that has the “oldest” timestamp, we can then use the buffer to hold back the other input frames. In effect we’ve created a one frame buffer, but instead of moving the video frame through the buffer’s memory, which is incredibly resource hungry and therefore inefficient, we change the pointer, or timestamp so the playout-engine knows where to retrieve the video frame from. This results in a system that is highly efficient as we’re reducing the amount of data we are moving around the memory.
The challenge is that the streams of video frames are not locked, they are asynchronous relative to each other. If we take one to be our reference, the other streams will create more or less frames than the reference in a time period. However, this system assumes the frame rates are similar, which is a reasonable assumption in broadcast television. They’re not the same, as we would find with an SPG (Sync Pulse Generator) locked infrastructure, but they are close enough. Figure 2 (shown in Part 1) shows that a standard off-the-shelf 74.25MHz oscillator with an accuracy of 150ppm used as an HD clock source in a camera, exhibits a frame drop or frame duplication once every two minutes. Would a viewer at home see this? Well, we know they don’t because this is how a frame synchronizer works when used to synchronize an outside broadcast video feed for a studio. The adding and dropping of frames are processed differently for a contribution feed in the studio compared to the received signal at the consumer. The CODECs employed are wonderful at smoothing frame anomalies so viewers don’t see them.
Key to understanding this is remembering what we are trying to synchronize and why. In the IP world, we no longer have to frequency and phase lock color subcarriers or SDI transport stream clocks. Instead, we just need to maintain fluidity of motion, which can be achieved at the frame layer where tolerances are much more forgiving.
Operational Latency
Another area where we can think differently about time is human response times. When switching between video sources there will be some delay. In SDI broadcast facilities we didn’t give this much thought as all the signals and control equipment were relatively close and the propagation times of signal paths were very low. However, as we move to IP and internet operation, we cannot take these latencies for granted and must take a closer look at operational controls.
Figure 3 – Human response times are much longer than we may think. This provides the opportunity to integrate network and cloud processing latency into the workflow without any noticeable effects.
There is a tendency to over-use the word “instantaneous” in relation to switching response times, especially when controlling equipment such as production switchers. We like to see an “instantaneous” response time when switching between video sources, but the switching response time has never been instant, there’s always been some delay between switching an input on the production switcher and seeing the change on the program monitor.
Research has demonstrated that the average human takes about 240ms from the triggering of an event to recognizing the response. For example, as seen in Figure 3, if the director calls for the operator to switch from CAM1 to CAM2 on the program bus, the operator, on average, will not recognize the change in video output on the program monitor for about 240ms, or 8 frames of 30fps video.
The profound impact of considering human factors is to realize that it is in fact human factors which can lead to the best time base management of audio and video streams. Because these streams are buffered, and include time stamps, which can be as simple as the RTP stamp for top of frame in SMPTE ST-2110, or as powerful as PTP stamps, used in the same RTP example, it is possible to manage overall latency for a worldwide distributed system.
This opens a whole load of new possibilities for remote controlling video processing equipment. If we now reframe our definition of “instantaneous” not in terms of nano-second timing, but in terms of video frames, we can see that the phrase “instantaneous” now means 8 frames of 30fps video. Consequently, operational controls over the internet can be achieved in many cases. In other words, the operational latency must take into consideration the actual expectation of the users, in this case, the production switcher operator.
Conclusion
Timing in live production environments has always been something that we’ve strived to improve. Nano-second timing and near-zero latencies have been assumed to be fundamental requirements but as we’ve seen, these are based on historical technological constraints and assumed folklore. When we question our assumptions then it becomes clear that there are simpler and more effective ways of working to achieve scalability, flexibility, and resilience, especially in the world of IP.
Supported by
You might also like...
The Resolution Revolution
We can now capture video in much higher resolutions than we can transmit, distribute and display. But should we?
Microphones: Part 3 - Human Auditory System
To get the best out of a microphone it is important to understand how it differs from the human ear.
HDR Picture Fundamentals: Camera Technology
Understanding the terminology and technical theory of camera sensors & lenses is a key element of specifying systems to meet the consumer desire for High Dynamic Range.
IP Security For Broadcasters: Part 2 - The Problem To Be Solved
By assuming that IP must be made secure, we run the risk of missing a more fundamental question that is often overlooked: why is IP so insecure?
Standards: Part 22 - Inside AIFF Files
Compared with other popular standards in use, AIFF is ancient. The core functionality was stabilized over 30 years ago and remains unchanged.