IP Monitoring & Diagnostics With Command Line Tools: Part 10 - Example Monitoring Probes

A server will experience problems when the processing demands hit a resource limit. Observing trends by measuring and comparing results periodically can alert you before that happens.

Managing Potential Problems

Predicting the onset of a problem early that might escalate and bring down a server allows sufficient time to solve the problem in advance. If important services cannot respond to incoming connections this leads to client-side problems. A failure in one machine may cascade to others. Consuming most or all of a particular resource is a likely reason:

• Available CPU capacity.
• Available physical memory.
• Available memory swap space.
• Memory leaks per process.
• Disk space filling up.
• Disks or shared file systems going offline.
• Process counts.
• Zombie processes.
• File buffer limits per process.
• Number of files per file system.
• Network latency.
• Network bandwidth/throughput.

Define sensible measurement thresholds for these characteristics and watch for resource allocations rising to meet them. Resource limits are managed under these categories:

• System-wide hard maximum limits are often configured into the kernel. Only the ops team should alter them and then reboot the server.
• User limits apply to an individual login session. The ops team will define maximum values. The ulimit command tunes your session within them. Only the ops team can increase the upper limits.
• Application specific limits are imposed by services such as database managers. Alter these with their config files. There may be maximum values that require ops team intervention. In rare cases, the limits may be defined in the application source code, which requires a rebuild to change.

Designing your own measuring probes

Design your monitoring system in a disciplined way. Consider these decision points for each probe:

• What is being measured?
• How can it be measured?
• Which machines will deploy this test?
• How often will the measurement be taken?
• Capture and display a single value for a live status check or a time-based series for a trend?
• Cache the results in a file or database table?
• Is real-time feedback needed? Database caching is optimal for real-time data.
• Are the results pushed from the remote system or pulled by the central cortex?
• Is an HTTP web server end-point helpful for fetching results?
• Is file sharing useful for delivering results?

The examples demonstrate different measurement solutions. Dismantle them and reuse the ideas as a starting point for your own monitoring probes.

Don't forget to put a shebang at the top of every script so that it runs in the correct shell interpreter. These examples are all based on the bash shell.

Ping Testing For Reachability

The ping command will time-out and return a message that describes the packet loss if the target node is unreachable. Filter the result to detect that outcome.

MY_PACKET_LOSS=$( ping -c 1 {host-name-or-ip} | tr ',' '\n' | grep "packet loss" | cut -d ' ' -f 2 )

Split the result based on comma characters (,) by converting them to line-breaks (\n) with a tr command. Use grep to isolate the line with the "packet loss" message. Then cut will split the line using spaces as the delimiter. Take field 2 because there is a leading space on the line.

Now use an if test to check for 100% packet loss and record the target node as being unreachable.

if [ "${MY_PACKET_LOSS}" = "100.0%" ] then echo "Unreachable" >> ./MACHINE_STATUS.log else echo "Online" >> ./MACHINE_STATUS.log fi

Use an equals character (=) to compare text strings. The -eq test expects integer values and would be inappropriate in this case.

Checking For Closed Network Ports

Use the nmap or netcat (nc) commands to check whether a port is open on a remote system.

These tools may need to be installed first because they are not always available by default.

The nc command is very easy to use:

nc -zv {host-name-or-ip} {port}

The -z flag checks the connection without transmitting any data if it succeeds. This avoids waking up remote daemons and triggering spurious activity in the remote machine. The -v flag provides the necessary verbose output as a result. Filter the result for the "Connection refused" message.

MY_PORT_TEST=$(nc -zv {host-name-or-ip} {port} 2>&1 | tr ':' '\n' | grep "Connection refused" | sort -u | wc -l | tr -d ' ')

if [ "${MY_PORT_TEST}" -eq "1" ] then echo "Port closed" else echo "Port open" fi

The useful part of the result is an error message delivered on the STDERR stream. This is not passed to the rest of the toolchain when commands are piped together with a vertical bar (|). The STDERR stream must be redirected into the STDOUT stream (2>&1) first in order to access its content. The tr command adds line-breaks so the grep command can filter the result. The sort -u command removes duplicate results. Then the wc command yields an integer 1 or 0 as a result. The final tr removes the whitespace introduced by the wc command. The result is tested with the -eq option which compares integer values:

• 1 = Port closed
• 0 = Port open

Add a logging line to note the hostname, symbolic name and timestamp with the port status.

Detecting missing disks

Entire disks may vanish if they are shared in from another system that suffers a network disconnect or shuts down. Hard drives can fail to spin up at boot time or dismount when they go wrong or overheat.

Filter the output of the df command and check whether the 'Mounted on' column lists the volume you expect to be there. A missing shared volume might indicate that one of your other machines is down.

df

Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 2451064 1027040 1321624 44% / none 512652 0 512652 0% /dev /tmp 516844 812 516032 1% /tmp /volume1 516844 6108 510736 2% /volume1 /dev/shm 516844 4 516840 1% /dev/shm

Test for a missing mount point like this:

df | grep "/volume1" | wc -l

The result counts will be:

• 1 = Disk mounted - everything OK
• 0 = Missing disk

Measuring The Available Disk Space

Use grep to isolate the line you want and then cut to extract column 5 to obtain the percentage of the allocated space after collapsing the multiple space characters with a tr command.

MY_VOLUME1=$(df | grep "/volume1"| tr -s ' ' | cut -d ' ' -f 5)

Trigger a warning when the usage threshold is exceeded.

Spotting Zombie Processes

Use a ps command to display a user defined format. This example presents just the process state, PID number and the command:

ps -eo stat,pid,comm

The stat column will contain a letter 'Z' if the process is Zombified.

STAT PID COMMAND Ss 11679 mobileassetd Z 13192 networkserviceproxy S 19596 nsurlsessiond S 30572 nsurlstoraged S 32727 opendirectoryd

Use grep to detect a letter 'Z' in the first column. The circumflex character (^) represents the start of the line. Tell grep to ignore upper/lower case (-i) when matching:

grep -i "^Z"

Add a word-count and wrap this in a command substitution to assign the result to a variable:

MY_ZOMBIE_COUNT=$(ps -eo stat,pid,comm | grep -i "^Z" | wc -l)

Checking Physical Vs Swap Memory Space

The free -m command shows the available RAM and swap space. Install this add-on command if it is not already available by default. The swapon command is also useful but it provides less information.

free -m

total used free shared buff/cache available Mem: 1009 215 15 56 778 654 Swap: 2047 0 2047

A quarter of the available physical memory appears to be in use. There has been no need to use any swap-space so far.

Extract the most useful items like this:

If ${USED_MEMORY} is less than ${TOTAL_MEMORY} and $[{USED_SWAP} is 0 you have sufficient physical memory. If $[{USED_SWAP} increases significantly add more physical memory.

Detecting An Overloaded CPU

Use the top command to reveal CPU loading as well as memory usage.

top -b -n 1

PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND 1 root 20 0 4.3m 2.6m 0.0 0.3 0:12.78 S /sbin/init 2 root 20 0 0.0m 0.0m 0.0 0.0 0:00.00 S [kthreadd]

Column 7 indicates how much of the CPU capacity is consumed by each process. The memory usage is in column 8.

A process might occasionally require 100% of the CPU for a few moments before going back to zero. Processes that hog the CPU capacity continuously need to be inspected to find out why.

The ps command is also useful for observing CPU usage but it uses a different method to calculate the percentage used. This example ranks the processes by CPU usage and lists only the top 10 culprits as process IDs:

ps -eo %cpu=,pid= | sort -r | head -10

Detecting Potential Memory Leaks

Observing an increasing memory usage over time for a constantly running process suggests a potential memory leak in that application. Fix the code to remove the leak.

In the meantime, regularly stopping and restarting the process may help avoid using up all the memory.

Conclusion

Be aware of the differences in command output for each operating system and alter these examples accordingly.

Infer problems in other machines by attempting a connection or detecting that a shared disk that it vends is not mounted.

Drill down into the results and track them over a time period. Measuring disk capacity on a daily basis allows you to predict when a disk will become full by observing the trend. Schedule a disk upgrade or clean-up to remove unnecessary content before that happens.

Predicting failures and preventing them is much easier than waiting for them to happen and rectifying the damage afterwards. It is much kinder to your end-users and much less hassle for you.

You might also like...

IP Monitoring & Diagnostics With Command Line Tools: Part 7 - Remote Agents

How to run diagnostic processes in each machine and call them remotely from a centralised system that can marshal the results from many other networked systems. Remote agents act on behalf of that central system and pass results back to…

Growing Momentum For 5G In Remote Production

A combination of factors that includes new 3GPP 5G standards & optimizations that have reduced latencies & jitter, new network slicing capabilities and the availability of new LEO satellite services are bringing increasing momentum to the use of 5G for…

Building Software Defined Infrastructure: Part 4 - Integration

Welcome to Part 4 of Building Software Defined Infrastructure. This multi-part content series from Tony Orme explores the microservices based IT technologies that are driving the next phase of transition from hardware to software based broadcast systems. This series is essential…

Monitoring & Compliance In Broadcast: Accessibility & The Impact Of AI

The proliferation of delivery devices and formats increases the challenges presented by accessibility compliance, but it is an area of rapid AI powered innovation.

IP Monitoring & Diagnostics With Command Line Tools: Part 6 - Advanced Command Line Tools

We continue our series with some small code examples that will make your monitoring and diagnostic scripts more robust and reliable