An overhead automated camera tracks the field action based on deep-learning algorithms.
Deep learning technology is more common than one might think. This technology is used to identify objects in images, texts or audio, achieving results that were not possible before. This article will examine how deep learning is revolutionizing sports production to enable low-cost, fully automated production for semi-professional and amateur sports broadcasts.
To understand how deep learning works, let's examine how our brains work. A human brain is made up of nerve cells, called "neurons," which are connected in adjacent layers to each other, forming an elaborate "neural network." In an artificial neural network, signals also travel between "neurons.” Instead of firing an electrical signal, a neural network assigns "weights" to various neurons.
Deep learning neural networks comprise as many as 150 connected layers. The more layers developed, the “deeper” the network. Deep learning models are trained by using large sets of labeled or annotated data. The neural network architectures learn features directly from the data, so you do not need to identify the features used to classify images. The relevant features are not pretrained either; they are learned while the network trains on a collection of images. This automated feature extraction makes deep learning models highly accurate for computer vision tasks such as object classification.
Although there is no need to manually extract each feature, there is a need to create a large enough training data set with annotations. So, for example, to identify a ball, you will need a data set of hundreds of thousands of unique images, which are annotated by humans and present the "ground truth" for the deep learning model. If you consider the fact that you would usually annotate other elements, such as players, this can add up to millions of annotations. The result is a "trained model" that can identify the objects it was trained on.
Deep Learning in Sports Production
Deep learning is used to generate fully automated sports production that looks very similar to professional sports broadcast, including camera zoom ins on the action, panning, etc. The basis for any decent-level automated sports production is the ability to at least identify the ball and the players. Identifying the ball is not an easy task, if you consider the fact that the ball can be on the ground and sometimes held by a goal keeper or a player (e.g. before kicking a foul).
Deep learning technologies enable software to identify all of the required elements of a sports broadcast to automate its live production.
If you think about it, in all these different situations the ball "looks" different, yet, we, as humans, have no problem identifying it as ball from a single frame. Identifying the players is not simple either, as the system will have to distinguish between "real" players and referees, bench players, etc.
Identifying the Field/Court
In sports production, one of the ways used to help identify the ball and the players is to define to the system the area that constitutes the field/court. This process -- "calibration" -- limits the scope of options for the DL algorithm by establishing within each frame which pixels are part of the field and court and which ones are not. It then translates these pixels to physical dimensions based on real-world coordinates.
By establishing the area of the field/court, it is possible to distinguish between players who are inside the field/court versus others outside of it, such as bench players, and between players on the field and spectators, who are outside the field.
Data Annotation for Sports
As mentioned above, as part of the deep learning model training is a need for a large data set to establish the "ground truth" for the deep learning algorithm. This is a major undertaking that should be done on an ongoing basis as more data is gathered and the algorithm evolves.
There are several options to achieve this. A minimal number of frames must be annotated by humans. In addition, several methods that require less effort, including:
- Google/YouTube images - it is possible to augment the data set by searching "soccer players" on Google or YouTube. This will yield frames or images that include soccer players, or, in other words, have been "pre-annotated" as soccer players.
- Unsupervised learning – this technique uses un-labeled data by applying an additional non-deep-learning algorithm to first segment the area of the potential players. For example, we can use known background subtractors such as MOG to roughly identify players.
- Augmentations – another commonly used technique is to modify or augment the images, for example to stretch them, modify angles, etc. These augmentations produce an additional data set that has been already labeled.
One key to proper camera tracking is for the system to recognize the area of the field or court. The software must distinguish between players who are inside the field/court versus others outside of it.
As we've seen with deep learning technologies, computers can understand the sports action, opening new opportunities in sports production that were never possible before. In its highest level, this technology can mimic the decision-making process of a human camera operator and video editor, providing almost the same experience of a professional live sports broadcast, at a fraction of the cost. This technological revolution will allow semi-professional and amateur sport clubs to broadcast the games to their fans and even monetize their content.
Yoav Liberman is Director of Computer Vision & Deep Learning Algorithms at Pixellot.
You might also like...
Today’s broadcast engineers face a unique challenge, one that is likely unfamiliar to these professionals. The challenge is to design, build and operate IP-centric solutions for video and audio content.
As the television business has become more global, and evolving consumer devices spawn the need for ever more formats, there has been an explosion of the number of versions that are needed for an item of content. The need to…
Moving from the luxury of dedicated point-to-point connectivity in favor of asynchronous, shared, and unpredictable IP networks may seem like we’re making life unnecessarily difficult for ourselves. However, there are compelling reasons to make the transition to IP. In t…
TV 2 DANMARK is the most popular commercial television network in Denmark and provides national channels including news and sports programming as well as multiple thematic channels. Faced with the need to move to a new technical facility, management decided that…
The standards for moving video over IP are all decided, right? Not yet. Even so, the innovation presents unprecedented opportunities and empowers broadcasters to deliver flexibility, scalability, and more efficient workflows. Consultant and The Broadcast Bridge technology editor, Tony Orme,…