PTP Explained - Part 2 - Redundancy In Media Driven PTP Networks

In the first part of this four-part series we described the basic principles of the Precision Time Protocol. In part two, we investigate PTP redundancy, specifically for media networks.

PTP Explained - Part 1 - Network Architectures For Media Focused PTP Deployments

In part one we learned that applying a continuous bi-directional exchange of messages carrying time information between all nodes within a network can synchronize their local clocks with each other. With appropriate hardware support, accuracies of well below 1 µsec can be reached making PTP ideally suited as a synchronization technology for the All-IP Studio, where all devices exchanging essence following the SMPTE ST 2110 series of standards have to rely on an accurate and highly reliable common notion of time. Aside from mere accuracy considerations, we therefore need to investigate reliability and fault tolerance aspects of PTP in detail.

Autonomous Master Election In PTP

The Best Master Clock Algorithm (BMCA) is a key component of the PTP protocol, allowing all end nodes to always (re-)select one common Master for the whole network i.e. the device with the most accurate (best) clock. This is done autonomously without any user interaction. Moreover, this process is triggered whenever the currently active Master fails or a device with a better clock is connected to the network. As long as at least two devices are configured to be eligible Grandmasters, the network should always maintain a common time. Thus, PTP can cope with the loss of the Grandmaster very effectively.

Since the BMCA is triggered exclusively by the absence of Announce messages, PTP is only able to detect a complete communication loss with the current Master; other, more subtle error conditions would remain undetected unless taken care of separately. If PTP messages carrying time information are not generated or not forwarded any longer, all nodes will remain in their current state while their local clocks will start to drift apart from each other. This is no hypothetical error condition. It can happen due to soft- or hardware problems within the Master and/or within the interconnecting network regardless whether it is PTP aware or not.

It should be further noted that there are certain error conditions, which are undetectable within a basic PTP implementation, such as a deterioration of the quality of the time information, which would either be caused by erratic behaviour of PTP aware network devices or simply by overloading PTP unaware network devices. Finally, PTP can be subjected to malicious attacks by tampering either with the packets on their respective transmission paths or by tampering with the time reference itself, via GPS spoofing, for example. The latter is a generic attack scheme which is independent of PTP and may affect any time distribution system relying on GNSS as a reference.

Diagram 1 – Only one Grandmaster is the timing source for the network, but if its time source quality deteriorates, the second Master clock becomes Grandmaster

Improved PTP Fault Tolerance

A basic first step to improve the fault tolerance of PTP is to add several PTP Grandmasters to a network. Ideally, these devices are not co-located, thus the whole network is able to cope at least partially with localized outages. Furthermore, this approach could improve the resiliency of the network against some attacks like attempts to jam the GNSS signals. To further improve fault tolerance, PTP time information can be provided via more than one network path. This could be done by providing sufficient redundancy within the interconnecting network to cope with single network failures while still maintaining only one network connection at every end node or by actually building a fully redundant network infrastructure.

PTP In Redundant Networks

Providing more than one physical path to receive data from, eliminates several shortcomings a network infrastructure can suffer from. A broken network path due to, for example, a faulty network connection, can be detected and remedied reliably by lower layer network protocols like the Rapid Spanning Tree Protocol (RSTP). However, this is always associated with a transient communication loss. This is tolerated by PTP but may not be acceptable for transmitting Media. Therefore, the complete interconnecting network has to be doubled and every end node would be equipped with two physical network ports. The same data is sent over both channels while every receiver would use only one copy of each data packet (i.e. the first one to arrive) discarding the other one. For applications where zero packet loss is a crucial requirement, this scheme is a viable and effective solution to counteract single-fault conditions. The full network redundancy scheme is specified in the SMPTE ST 2022-7 standard.

This mechanism works perfectly fine for packets and protocols where the transmission time is of no further relevance (unless it exceeds certain absolute bounds). In the case of PTP, this information is absolutely crucial for obtaining a high level of accuracy. Therefore, a solution where the decision whether to use data from network A or B is made on a per packet basis is by no means applicable to PTP. It simply cannot be assumed that the absolute delay via network A is identical to the one through network B. Even when deploying identical infrastructure for both networks (which in some cases is avoided deliberately to limit the effects of network induced errors) this assumption cannot be made, because the two distinct networks could operate in different conditions and states.

As every end node has to maintain one single notion of time, it cannot simply process time information on both channels independently from each other running distinct and separate instances of the PTP stack.

Special care has to be taken on how to attach the various PTP Grandmasters to the two networks: Should every Grandmaster be connected to both networks via separate ports or should separate Grandmasters be foreseen for each network? Both solutions result in a distinctly different behaviour of the BMCA thus it may not be guaranteed that all PTP Slaves select the same PTP Grandmaster. Fortunately, this shortcoming can be easily mitigated by forcing all nodes to use time information only from GMs connected to a traceable time source like GPS.

It has to be clearly stated that parallel redundancy is out of scope of IEEE1588-2008 and neither ST 2059-2, nor ST 2110-10 for that matter have covered this use case. Therefore, it is left to the skill of implementers to devise the most effective solution, which in turn may have significant implications on the roll-out and interconnection of PTP Grandmasters, thus effectively jeopardizing the interoperability of PTP within ST 2022-7 networks. Finally, it has to be pointed out that the complexity of this problem is not significantly reduced by using PTP aware network devices, quite the contrary. TCs do account for varying forwarding time, however, the transmission time via the physical channel may still be distinctly different in network A and B. BCs, on the other hand, may select different Masters, on either network, depending on topology and network state.

Possible Solutions

To make best use of redundant ST 2022-7 networks, the implementation of the PTP stack on the Slaves should be improved. Such a Slave has to be capable of running two independent instances of the PTP stack, one for each network interface. As every end node has one common local hardware clock the PTP stack is striving to adjust to the Master, the PTP stacks would interfere with each other causing a similar effect to switching back and forth on a per packet basis. From the two PTP stack instances only one should be adjusting the clock while the other merely processes PTP messages. Switching back and forth between the two networks can be done “manually” i.e. triggered simply by a permanent loss of data or link on either network. The switching criterion could be enhanced by reacting to a variety of operating conditions on both networks, for example, missing PTP event messages, improved transmission quality on the stand-by network.

The concept of redundant time transfer can be enhanced even further by providing multiple active time sources to every end node. By making use of the concept of PTP domains, a Slave can concurrently receive time information from multiple Masters operating independently from each other with respect to the BMCA. In principle, all PTP devices have to operate in a single PTP domain which has to be configured manually or set according to the rules of a specific PTP profile. If the PTP stack is extended to accept data from multiple domains, it can process time information from different sources and even compare their respective qualities. Using any number of different approaches, Slaves can combine theses feeds to maintain a highly reliable and accurate notion of time. This approach can address Byzantine faults as well i.e. discard the data from Grandmasters sending false time information as a result of having been tampered with.

Conclusions

Inherently, PTP offers certain degree of fault tolerance being able to cope with the loss of the time reference without any user interaction Its resilience against any number of faults and malicious attacks can be improved significantly in various ways without sacrificing IEEE1588 standard compliance. It is of similar importance to continuously monitor time transfer. How to accomplish this with minimal effort and interference will be covered in part-3 of this series.

Other related articles posted on The Broadcast Bridge.

PTP Explained - Part 1 - Network Architectures For Media Focused PTP Deployments

You might also like...

IP Monitoring & Diagnostics With Command Line Tools: Part 9 - Continuous Monitoring

Scheduling a continuous monitoring process will detect problems at the earliest opportunity. If the diagnostic tools run often enough, they can forecast a server outage before a mission critical failure happens. Pre-emptive diagnosis and automatic corrections are a very good…

IP Monitoring & Diagnostics With Command Line Tools: Part 2 - Testing Remote Connections

In the previous article, we set the scene for working with the Command Line Interface (CLI) on a UNIX system. Now we will explore some techniques for performing basic tests on our network infrastructure to check for potential problems.

IP Security For Broadcasters: Part 12 - Zero Trust

As users working from home are no longer limited to their working environment by the concept of a physical location, and infrastructures are moving more and more to the cloud-hybrid approach, the outdated concept of perimeter security is moving aside…

IP Security For Broadcasters: Part 11 - EBU R143 Security Recommendations

EBU R143 formalizes security practices for both broadcasters and vendors. This comprehensive list should be at the forefront of every broadcaster’s and vendor’s thoughts when designing and implementing IP media facilities.

IP Security For Broadcasters: Part 10 - NATS Advanced Messaging

As IT and broadcast infrastructures become ever more complex, the need to securely exchange data is becoming more challenging. NATS messaging is designed to simplify collaboration between often diverse software applications.