IP Monitoring & Diagnostics With Command Line Tools: Part 10 - Example Monitoring Probes

A server will experience problems when the processing demands hit a resource limit. Observing trends by measuring and comparing results periodically can alert you before that happens.

Managing Potential Problems

Predicting the onset of a problem early that might escalate and bring down a server allows sufficient time to solve the problem in advance. If important services cannot respond to incoming connections this leads to client-side problems. A failure in one machine may cascade to others. Consuming most or all of a particular resource is a likely reason:  

• Available CPU capacity.
• Available physical memory.
• Available memory swap space.
• Memory leaks per process.
• Disk space filling up.
• Disks or shared file systems going offline.
• Process counts.
• Zombie processes.
• File buffer limits per process.
• Number of files per file system.
• Network latency.
• Network bandwidth/throughput.

Define sensible measurement thresholds for these characteristics and watch for resource allocations rising to meet them. Resource limits are managed under these categories:

• System-wide hard maximum limits are often configured into the kernel. Only the ops team should alter them and then reboot the server.
• User limits apply to an individual login session. The ops team will define maximum values. The ulimit command tunes your session within them. Only the ops team can increase the upper limits.
• Application specific limits are imposed by services such as database managers. Alter these with their config files. There may be maximum values that require ops team intervention. In rare cases, the limits may be defined in the application source code, which requires a rebuild to change.

Designing your own measuring probes

Design your monitoring system in a disciplined way. Consider these decision points for each probe:

• What is being measured?
• How can it be measured?
• Which machines will deploy this test?
• How often will the measurement be taken?
• Capture and display a single value for a live status check or a time-based series for a trend?
• Cache the results in a file or database table?
• Is real-time feedback needed? Database caching is optimal for real-time data.
• Are the results pushed from the remote system or pulled by the central cortex?
• Is an HTTP web server end-point helpful for fetching results?
• Is file sharing useful for delivering results?

The examples demonstrate different measurement solutions. Dismantle them and reuse the ideas as a starting point for your own monitoring probes.

Don't forget to put a shebang at the top of every script so that it runs in the correct shell interpreter. These examples are all based on the bash shell.

Ping Testing For Reachability

The ping command will time-out and return a message that describes the packet loss if the target node is unreachable. Filter the result to detect that outcome.

ping -c 1 {host-name-or-ip} |
tr ',' '\n' |
grep "packet loss" |
cut -d ' ' -f 2

Split the result based on comma characters (,) by converting them to line-breaks (\n) with a tr command. Use grep to isolate the line with the "packet loss" message. Then cut will split the line using spaces as the delimiter. Take field 2 because there is a leading space on the line.

Now use an if test to check for 100% packet loss and record the target node as being unreachable.

if [ "${MY_PACKET_LOSS}" = "100.0%" ]
   echo "Unreachable" >> ./MACHINE_STATUS.log
   echo "Online" >> ./MACHINE_STATUS.log

Use an equals character (=) to compare text strings.  The -eq test expects integer values and would be inappropriate in this case.

Checking For Closed Network Ports

Use the nmap or netcat (nc) commands to check whether a port is open on a remote system.

These tools may need to be installed first because they are not always available by default.

The nc command is very easy to use:

nc -zv {host-name-or-ip} {port}

The -z flag checks the connection without transmitting any data if it succeeds. This avoids waking up remote daemons and triggering spurious activity in the remote machine. The -v flag provides the necessary verbose output as a result. Filter the result for the "Connection refused" message.

MY_PORT_TEST=$(nc -zv {host-name-or-ip} {port} 2>&1 |
tr ':' '\n' |
grep "Connection refused" |
sort -u |
wc -l |
tr -d ' ')

if [ "${MY_PORT_TEST}" -eq "1" ]
   echo "Port closed"
   echo "Port open"

The useful part of the result is an error message delivered on the STDERR stream. This is not passed to the rest of the toolchain when commands are piped together with a vertical bar (|). The STDERR stream must be redirected into the STDOUT stream (2>&1) first in order to access its content. The tr command adds line-breaks so the grep command can filter the result. The sort -u command removes duplicate results. Then the wc command yields an integer 1 or 0 as a result. The final tr removes the whitespace introduced by the wc command. The result is tested with the -eq option which compares integer values:

1 = Port closed
0 = Port open

Add a logging line to note the hostname, symbolic name and timestamp with the port status.

Detecting missing disks

Entire disks may vanish if they are shared in from another system that suffers a network disconnect or shuts down. Hard drives can fail to spin up at boot time or dismount when they go wrong or overheat.

Filter the output of the df command and check whether the 'Mounted on' column lists the volume you expect to be there. A missing shared volume might indicate that one of your other machines is down.


Filesystem 1K-blocks    Used Available Use% Mounted on
/dev/root    2451064 1027040   1321624  44% /
none          512652       0    512652   0% /dev
/tmp          516844     812    516032   1% /tmp
/volume1      516844    6108    510736   2% /volume1
/dev/shm      516844       4    516840   1% /dev/shm

Test for a missing mount point like this:

df | grep "/volume1" | wc -l

The result counts will be:

1 = Disk mounted - everything OK
0 = Missing disk

Measuring The Available Disk Space

Use grep to isolate the line you want and then cut to extract column 5 to obtain the percentage of the allocated space after collapsing the multiple space characters with a tr command.

MY_VOLUME1=$(df |
grep "/volume1"|
tr -s ' ' |
cut -d ' ' -f 5)

Trigger a warning when the usage threshold is exceeded.

Spotting Zombie Processes

Use a ps command to display a user defined format. This example presents just the process state, PID number and the command:

ps -eo stat,pid,comm

The stat column will contain a letter 'Z' if the process is Zombified.

Ss   11679 mobileassetd
Z    13192 networkserviceproxy
S    19596 nsurlsessiond
S    30572 nsurlstoraged
S    32727 opendirectoryd

Use grep to detect a letter 'Z' in the first column. The circumflex character (^) represents the start of the line. Tell grep to ignore upper/lower case (-i) when matching:

grep -i "^Z"

Add a word-count and wrap this in a command substitution to assign the result to a variable:

MY_ZOMBIE_COUNT=$(ps -eo stat,pid,comm |
grep -i "^Z" |
wc -l)

Checking Physical Vs Swap Memory Space

The free -m command shows the available RAM and swap space. Install this add-on command if it is not already available by default. The swapon command is also useful but it provides less information.

free -m

      total   used     free   shared  buff/cache available
Mem:   1009    215       15       56         778       654
Swap:  2047      0     2047

A quarter of the available physical memory appears to be in use. There has been no need to use any swap-space so far.

Extract the most useful items like this:

TOTAL_MEMORY=$(free -m | grep "Mem:"  | tr -s ' ' | cut -d ' ' -f 2)
USED_MEMORY=$( free -m | grep "Mem:"  | tr -s ' ' | cut -d ' ' -f 3)
TOTAL_SWAP=$(  free -m | grep "Swap:" | tr -s ' ' | cut -d ' ' -f 2)
USED_SWAP=$(   free -m | grep "Swap:" | tr -s ' ' | cut -d ' ' -f 3)

If ${USED_MEMORY} is less than ${TOTAL_MEMORY} and $[{USED_SWAP} is 0 you have sufficient physical memory. If $[{USED_SWAP} increases significantly add more physical memory.

Detecting An Overloaded CPU

Use the top command to reveal CPU loading as well as memory usage.

top -b -n 1

  1 root      20   0    4.3m   2.6m   0.0  0.3   0:12.78 S /sbin/init
  2 root      20   0    0.0m   0.0m   0.0  0.0   0:00.00 S [kthreadd]

Column 7 indicates how much of the CPU capacity is consumed by each process. The memory usage is in column 8.

A process might occasionally require 100% of the CPU for a few moments before going back to zero. Processes that hog the CPU capacity continuously need to be inspected to find out why.

The ps command is also useful for observing CPU usage but it uses a different method to calculate the percentage used. This example ranks the processes by CPU usage and lists only the top 10 culprits as process IDs:

ps -eo %cpu=,pid= | sort -r | head -10

Detecting Potential Memory Leaks

Observing an increasing memory usage over time for a constantly running process suggests a potential memory leak in that application. Fix the code to remove the leak.

In the meantime, regularly stopping and restarting the process may help avoid using up all the memory.


Be aware of the differences in command output for each operating system and alter these examples accordingly.

Infer problems in other machines by attempting a connection or detecting that a shared disk that it vends is not mounted.

Drill down into the results and track them over a time period. Measuring disk capacity on a daily basis allows you to predict when a disk will become full by observing the trend. Schedule a disk upgrade or clean-up to remove unnecessary content before that happens.

Predicting failures and preventing them is much easier than waiting for them to happen and rectifying the damage afterwards. It is much kinder to your end-users and much less hassle for you.

You might also like...

An Introduction To Network Observability

The more complex and intricate IP networks and cloud infrastructures become, the greater the potential for unwelcome dynamics in the system, and the greater the need for rich, reliable, real-time data about performance and error rates.

Designing IP Broadcast Systems: Part 3 - Designing For Everyday Operation

Welcome to the third part of ‘Designing IP Broadcast Systems’ - a major 18 article exploration of the technology needed to create practical IP based broadcast production systems. Part 3 discusses some of the key challenges of designing network systems to support eve…

What Are The Long-Term Implications Of AI For Broadcast?

We’ve all witnessed its phenomenal growth recently. The question is: how do we manage the process of adopting and adjusting to AI in the broadcasting industry? This article is more about our approach than specific examples of AI integration;…

Next-Gen 5G Contribution: Part 2 - MEC & The Disruptive Potential Of 5G

The migration of the core network functionality of 5G to virtualized or cloud-native infrastructure opens up new capabilities like MEC which have the potential to disrupt current approaches to remote production contribution networks.

Standards: Part 8 - Standards For Designing & Building DAM Workflows

This article is all about content/asset management systems and their workflow. Most broadcasters will invest in a proprietary vendor solution. This article is designed to foster a better understanding of how such systems work, and offers some alternate thinking…