Troubleshooting IP-centric technologies can be a new and challenging undertaking for engineers. Often it becomes a case of "You don't know--what you don't know," until it is too late. Consultant Gary Olson describes his experience in helping staffs troubleshoot new IP systems. Perhaps what he learned can save you some pain if similar issues find their way into your systems.
One of my current projects continues to be the proud source of identifying many of challenges that come along with any new generation of technology.
What are the steps or thought processes to diagnose a problem when the server cannot be reached from any remote system and the application appears frozen? Here are a few of my experiences.
Problem: The remote monitoring system sent out an alert that the Asset Library Database Server was not reachable. Air was not yet in jeopardy, however assets were not accessible to the playout out server in the look ahead playlists and acquisition and archive were offline as this was the Asset management database server. The server is connected to a KVM matrix also accessed by Remote Desktop and is enabled with VNC and Team Viewer that has been used to provide vendor access for updates and maintenance. Only one small problem, it seems communication to the server over any of these was not working. The server has two network connections, one for media and one for management.
Troubleshooting the problem. Diagnosis: We needed to have a real person go to the server room and physically connect a display to the device. Once a display was connected directly to the server, it appeared to be hung up in blue screen with an internal battery error dialog box open. How many have maintenance departments are keeping spare internal server batteries in stock? How many technicians know that servers actually have an internal battery that can fail?
Solution: After running diagnostics and a few reboots later, it seemed the error was “false positive”, meaning that the battery is supposedly fine, but the server still froze and it took a few reboots to clear the error. Note to self- How many of the same server types are in the system that have the same battery? GET SPARES!
Then of course the vendor had to remote in and recover the database and re-establish connections to the different asset storage locations once the database was restored.
Troubleshooting IP systems requires both system knowledge and the proper tools. If either is missing, the odds for a winning game plan are small.
IP Adds A New Perspective in Troubleshooting and Maintenance
On the same project we are having a discussion about spares. What is considered reasonable spares for file based and IP technology? Common hard drives? As storage continues to improve in speed and capacity, if a drive fails in a RAID stack do you replace it with the original, upgrade to more capacity for the same price or move to SSD and higher throughput speeds. Will the server or storage technology support mixing capacity and format?
Here’s another one, the KVM server dongle failed at the same time as one of its cables. Chasing that was entertaining - NOT.
Here is a better question. If software is pre-loaded onto a server and considered an appliance, and if the software is supported by the SLA but the hardware is not, what happens if the hardware fails? Is the vendor responsible for migrating their software onto the new hardware if the license is current. Or is it considered a proprietary appliance and the entire server and application needs to be replaced?
Diagnosing problems is a bit more challenging when it’s about applications communicating with each other and where are the configuration settings hidden?
We are investigating a network performance issue and need to confirm the servers are communicating on the correct VLAN’s. There’s the management port and the media port on the server, and a management VLAN and media VLAN on the network. Easy right?
Not so much. The ports on the device are unlabeled and there is a lot of traffic on both VLAN’s. A simple request was sent to the vendor asking which configuration file in which server should be examined. In this case the device is an automation system with multiple servers and multiple applications controlling multiple servers. My request to the vendor first resulted in a head scratch and then a suggestion to look in the host file in the server’s OS.
Did the user manual have this info, NOPE? And, is there any documentation showing the interrelationship and inter-dependencies between the servers and applications? NOPE. But, I do have line drawings because they would be helpful. No I don't think so.
Something as simple as an application workflow diagram can greatly help in troubleshooting an IP system. Does your operation have one? Image: Imagine Communications.
Cross Discipline Knowledge
In this example, we have an automation system communicating with an SDI router, tape machines, servers for ingest and playout, storage and an asset manager. There are API’s between vendors and middleware to handle file transfer. They are all on a mesh network with a 10G backbone and 1G device connections.
As we were diagnosing some of the issues, the conversation moved between network performance, device control through application, device access via RDP and KVM and system timing reference. As I re-organize the network, looking to solve the performance issue with network optimization and packet shaping, I am supported by two engineers. One engineer is highly versed in the automation, media servers and how it interfaces to the library manager. The other engineer is stronger on the SDI side and some network topology.
And then there’s that funny GAP between about how it all works together, and where the two worlds meet. Why are we considering this a network issue if a server is not communicating rather than looking at the application stack and assuring all the services are running correct?
As we adopt computer based technology for broadcast, we need to expand our knowledge and also adopt new tools to analyze and solve problems in file bases and IP infrastructures and facilities. We need to encourage the vendors to provide more and better training on how their systems are configured and integrated with other systems. The customer needs to know where the middleware hosted and if there are API’s where they are stored.
Don't let your digital network crash into the dark. Perform full system checks to ensure the system can recover from power outages and other catastrophic events.
Figuring out how it works
We are all familiar with “pull the plug” testing. Typically this is to make sure all mission critical devices and systems are on protected power.
We did something different and interesting. After all the applications and systems were integrated, “properly” configured and operational, we decided to give the facility a “birthday”. Let’s turn everything OFF and then do a full system restore.
We asked each vendor for the dependencies each of the systems required as an order of both shutdown and more importantly a full restore. This proved to be an interesting exercise. First they had to think about it. Then they had to think what other systems they were connected to and what happens when that connection is disrupted. Wow, they never thought of that.
First we turned everything off, correctly, and watched as the primary systems spin down and see if backup systems kick in, before shutting them down. As things shutdown, we tested backups and redundancy learning a few things. Now we started things up and learned a lot more. For instance, we learned that if systems are not restored in exactly the right way, some devices get lost. This is valuable information because it means that when a single device fails, you cannot necessarily replace it and move on. Sometimes a few of its friends and relationships need to be re-established.
And where is this documented? One engineer fully documented the entire shutdown and restore process and continues to update it as we learn more about the processes. As of now, it has grown to a 4-inch binder.
There’s a lot more to IP than just which format should win the live contest. Look for more tutorials on IP system troubleshooting and documentation in my series published on The Broadcast Bridge.
Related Editorial Content
Is IT, the acronym for Information Technology, still meaningful when it comes to operating media networks?
The transition to IP is having an impact on all aspects of broadcast from business to technical operations, workflow engineering and maintenance. One of the dramatic changes is how a system is documented. Gone are the good old days when…
IT engineers use generic propriety models when designing networks. As television moves towards IP, broadcast engineers must understand these networks and how they work. This series of articles explains networks from a broadcast engineers point of view so they can…