Training neural networks is one of the most import aspects of ML, but what exactly do we mean by this?
Other articles in this series:
In previous articles in this series, we looked at the applications for ML and the comparison to the brain and the method of repeated learning for humans. It is this repeated learning that is critical for the learning of a neural network.
Figure 2 in Part 3 of ML for Broadcasters shows a very simple neural network with an input, hidden, and output layers. With just a single output it would probably be used for a classification type solution such as whether a video passes or fails a test. In reality, an NN like this will consist of thousands, or even tens of thousands of neurons all connected together.
Training a network is an exercise in repeated learning. Just as a human repeats a task until it becomes second nature to them, then we repeatedly apply the data to a neural network to facilitate training. This process is summarized in figure 1.
Figure 1 – training consists of applying the training data to the model and comparing its predictions to the known data, after which, the weights and biases are updated
In this example we are using labelled datasets, that is, somebody has had to classify the data to provide the appropriate result, or label. When applying the data to the network, the act of updating the weights and biases in each neuron moves the result of the network to be closer to the labelled data. Consequently, there are two fundamental processes in training the network, the forward pass, and the backwards propagation.
The forward pass is applying the data to the neural network and determining a result. Initially this will be massively different to our expected result. And the difference between the two is referred to as the loss. The intention is to make our loss as small as possible so that the model can accurately predict an output on data it has never seen before.
The backwards propagation is the process of taking the loss value and updating the weights and biases based on it. Anybody who can remember their school calculus will understand that finding a minima in a function helps find the lowest value in the function. And this is exactly what we do with backwards propagation. Essentially, the process is calculating the rate of change of the output with respect to the input in order to determine the global minima. When this is achieved, the model is said to be trained.
This is a highly and oversimplified description of backwards propagation and anybody wanting to understand it better would do well to look at the application of the chain rule found in intermediate calculus. The chain rule is used to update each individual weight and bias on each neuron. And this is one of the reasons we use the sigmoid function (as highlighted in Part 3) as differentiating ex with respect to x is ex. The sigmoid function both introduces the non-linearity into the model and is easy to differentiate when training with back propagation!
Figure 2 shows an example of the training code written in Python for the authors own research into determining elephant flows in IP networks. Much of the models’ detail, such as the neurons and configuration of the model is hidden in the library, in this case Pytorch, but the overall structure of learning can be seen. The LSTM (Long Short-Term Memory) hinted at in line 229 is an advanced form or a neural network which is used in time-series and sequence predictions. In this example, four or five IP packets of a TCP flow are applied to the model to predict very long TCP elephant flows. Determining the existence of an elephant flow early on will help network operators route the IP traffic more efficiently and reduce head-of-line blocking and hence keep latency low.
Line 227 provides the repeated learning, in this case referred to as the epoch. The number of times the dataset is presented to the model is defined here and in the case of this design was set to 300.
Line 229 randomly loads segments of the dataset until the whole dataset has been presented to the model to facilitate training. Randomness is important in machine learning as it stops the model for inadvertently finding patterns that don’t really exist.
Line 231 is the forward pass of the model. The segments of data are presented to the model and the outputs are determined. Line 233 shows the loss function that compares the predicted output to the known output. By comparing the two, the loss value is calculated which is then processed by line 235 to provide the backwards propagation. When this occurs, the weights and biases are updated within the model.
When the loss value does not reduce further then the model is said to have learned the training dataset and a file containing the weights and biases is made available. A further process then takes place where data not seen before by the model is applied to it and the models’ prediction can be compared to the known labels. This then determines the accuracy of the model. Assuming the accuracy is good enough, the file containing the weights and biases are used by the forward prediction in line 231 of Figure 2 to provide the process that the end user or broadcaster is most interested in.
This whole process can take a great deal of time. In this example, training for 300 epochs takes about a day, even with fast GPU acceleration, and video-based models can take weeks to learn. However, the forwards propagation used in the final product, depending on the complexity of the model, only takes a few milliseconds to provide the result of the particular application. In the elephant flow classification described above, the detection only takes 2 milliseconds after the fourth IP packet has been processed by the forwards propagation code.
A human training to become a concert violinist takes in excess of 10,000 hours of learning (practice), but the performance only takes a fraction of that time. And this is the power of machine learning.
You might also like...
We are told that in the future all cars will be electrically powered. It is therefore quite natural that a broadcaster should consider whether outside broadcast vehicles might follow suit.
IP has succeeded in abstracting away the media essence from the underlying transport stream, and in doing so is providing scalable and dynamic solutions that are facilitated through cloud technologies and software.
IP is an enabling technology that facilitates the use of data centers and cloud technology to power media workflows. The speed with which COTS (Commercial Off The Shelf) hardware can now process data means video and audio signals can be…
Distributing error free IP media streams is only half the battle when building reliable broadcast infrastructures. SDP files must match their associated IP media essence or downstream equipment will not be able to decode it. In this article we dig…
In the last article we looked at why TCP/IP internet delivery is incredibly difficult to scale and how 5G-NR can overcome its limitations. In this article we dig deeper into 5G-NR to understand why it is such a powerful…