When, in May 2019, AMD announced their Ryzen Zen 2 architecture, beyond the amazing performance offered by the new Series 3000 microprocessors, they announced the new chips would support PCI 4.0. Although I was pretty confident the step from 3.0 to 4.0 meant 2X greater bandwidth, I decided it was time to learn more about the PCIe bus.
The problem with most descriptions of PCIe: they present a top-down explanation of the PCIe version being described. Thus, an explanation of PCI 3.0 describes it as a “thing” rather than something built on more fundamental “things.” The more fundamental thing is a lane and every instance of a PCIe bus is a set of lanes.
Another problem with most PCIe bus descriptions is their focus on the length of desktop motherboard PCIe connectors. Figure 1 shows two motherboard PCIe connectors: EX4 (89mm) and EX1 (25mm.) (A shorter PCIe card can be inserted into a longer PCIe connector.) Of course, slot-lengths have no meaning for those of us who use laptops since they don’t have PCIe connectors.
PCI 1.0 and 2.0 Lanes
While we correctly understand PCIe bus performance has increased over generations, fundamentally it’s lane performance that increases over time. So, the first technology to understand is how a lane functions.
A lane consists of four signal lines. Each pair of lines comprise a differential line-pair that sends signals in one direction. A line-pair functions as do a twisted pair of wires in a cable carrying a balanced microphone signal.
Were a lane simply a differential signal path—the only way to increase lane speed would be to increase the speed at which data passes over a lane. When the PCI Express (PCIe) bus was released in 2004, it was designed to transmit data not simply as ones and zeros representing the data itself. Rather, the data to be moved were coded using a scheme called “8b/10b.”
If you are curious, this scheme is described as a line-code that maps 8-bit words to 10-bit symbols to… provide enough state changes to allow reasonable clock recovery.
If you are really curious, line-codes go back to IBM in 1983 and are used, for example, to burn pits on optical discs. HDMI, Displayport, SATA, SD UHS II, and USB 3.0 all employ 8b/10b coding.
At each end of a lane’s line-pair you will find an I/O processor that inputs or outputs 8-bit data to be moved from one point to another. (Lanes are point-to-point connections.) An Input processor rapidly translates 256 possible codes from 8-bit data to one of 1024 possible 10-bit symbols. At the other end of a lane’s line-pair, an output processor translates 10-bit symbols back to 8-bit data.
The speed of an I/O processor determines symbol transfer rate. A PCI 1.0 supports a symbol transfer rate of 2.5 Gigatransfers per-second (GT/s).
While 8b/10 coding is a clever way to obtain greater noise immunity—it isn’t free. Encoding imposes a 20-percent overhead because every byte becomes a 10-bit symbol. Thus, a symbol transfer rate of 2.5GT/s provides only a 2Gbps data-rate.
A 2Gbps bandwidth is equal to a unidirectional data-rate of 250MBps. A PCI 2.0 lane supports a 2X greater symbol transfer rate and thus provides a unidirectional 500MBps data-rate. (Figure 2.)
PCI 3.0 Lanes
To increase lane speed, lane processor I/O performance must be increased. (Lanes are backward compatible because a faster processor can translate lower symbol transfer rates.)
Were the move from PCI 2.0 to 3.0 to be accomplished the same way as the move from PCI 1.0 to PCI 2.0, the symbol rate would need to double to 10GT/s.
To avoid such a high symbol transfer rate, PCI 3.0 utilizes the more efficient “128b/130b” coding scheme. This scheme’s greater efficiency means the aggregate symbol transfer rate needs only to be increased by 60-percent from 5GT/s to 8GT/s.
Encoding overhead now falls from 20-percent to only 1.54-percent, thus enabling a 7880GT/s symbol transfer rate. This symbol bandwidth translates to 7.9Gbps which is a unidirectional data-rate of 985MBps. Not quite a doubling of the data-rate, but close enough for marketing purposes. (Figure 3.)
There are four things to know about PCI 4.0. First, as expected it is twice as fast as PCI 3.0. To achieve this performance, the symbol bandwidth is doubled from 7,880GT/s to 15,760GT/s. This translates to a unidirectional data-rate of 1,970MBps. (Figure 4.)
Second, AMD has announced it will only support PCI 4.0 on motherboards based on their top-of-the-line X570 chipset. Thankfully, the well-regarded Gigabyte Aorus Elite WIFI motherboard employs the X570 chipset and yet sells for only $200. See Figure 5.
Third, the current highest performing graphics card, NVIDIA’s RTX-2080, is able to only slightly exceed the bandwidth provided by PCI 3.0. PCI 4.0 performance will not be needed until, for example, a card with a pair of 2080-class GPUs becomes available. Figure 6 shows an Apple announced dual GPU board for its new Mac Pro.
Fourth, the question of PCI Generation 4.0’s role in supporting newly announced very high-performance M.2 4.0 SSDs is unclear. See Figure 7.
Figure 8 presents Sequential-Access performance data from a Corsair Gen4 SSD plugged into an M.2 4.0 slot. Both Read and Write performance, as expected, are very high. The MP600’s Read data-rate is almost 5,000MBps. Figure 8 also shows the performance of a Gen3 Samsung EVO 970+ in a PCIe 3.0 slot. As expected, it provides lower Sequential-Access Read/Write performance.
However, when both the MP600 and EVO 970+ drives are tested in a multi-tasking Random-Access Read/Write situation, the MP600 Gen4 performs about the same as the cheaper, but far less sexy looking, Gen3 Samsung 970 EVO+.
Figure 9 presents a matrix of five PCIe generations by the number of PCIe bus lanes. (Intel is expected to jump directly to PCI 5.0.) Specifically, an x1 PCIe bus carries a single lane. An x4 bus carries 4-lanes; an x8 bus carries 8-lanes; while an x16 bus carries 16-lanes. Each matrix cell provides unidirectional bandwidth values. Thus, for example, a PCI 4.0 x4 bus provides an almost 8000MBps Read or Write connection.
More About PCIe
Where do lanes originate? Intel microprocessors typically host 16- or 24-lanes. Very high-performance CPU chips, such as the AMD Threadripper, can host up to 60-lanes.
Via a Direct Media Interface (DMI), an Intel microprocessor connects to an Intel Platform Controller Hub (PCH) chip that typically can host an additional 24-lanes. (AMD systems operate in a similar manner.)
Unfortunately, DMI bandwidth is equivalent to four PCIe 3.0 lanes. An Intel hub (chipset) thus acts only as a high-speed switch for the additional lanes. Not only is a PCIe bus from a hub bandwidth limited, additional latencies are introduced.
Assuming a high-performance laptop, the discrete GPU chip is connected by a x16 PCIe bus to the microprocessor. This connection will be PCI 3.0, or 4.0 on AMD Series 3000 chips.
Modern laptop and desktop computers offer an M.2 slot to support an SSD. To communicate with an SSD, an M.2 slot employs the NVMe (Non-Volatile Memory Express) protocol. An M.2 slot provides a x4 PCIe connection. (Figure 10.) Many systems now provide a second M.2 slot. Again, the connection will be either PCI 3.0 or 4.0.
Figure 10: M.2 Slot with Four Mounting-posts for Different Length cards. (Courtesy MSI).
A microprocessor’s hub chip also supports 5Gbps USD 3.0 (aka USB 3.1 Gen 1) and 10Gbps USB 3.1 (aka USB 3.1 Gen 2). Both speeds can be supported by a USB-C connector.
Also using a USB-C connector: the 40Gbps Thunderbolt-3 bus. Because Thunderbolt-3 employs 8b/10b coding, its actual data transfer rate is 20-percent less—only 32Gbps. A PCI 3.0 x4 bus provides the 32Gbps unidirectional transfer rate employed by Thunderbolt-3.
Now that we have a good understanding of the PCIe bus, and we know how PCIe 4.0 performs, it’s not unreasonable to see both PCIe 4.0 and PCIe 5.0—and, yes, PCIe 6.0 specifications as offering mostly marketing performance. Kind of a “…move on folks, nothing to see here—YET” technology.
You might also like...
We all understand what it means when someone says a video went viral. It typically means a person used a mobile device to record an event and posted it to any number of social media websites. How does that have…
With 6K acquisition becoming more common, you may be considering getting ahead of the editing curve by upgrading your computer system. Likely you’ll want a hot system based upon one of the new AMD or Intel 6- or 8-core m…
In the first part of this four-part series we described the basic principles of the Precision Time Protocol. In part two, we investigate PTP redundancy, specifically for media networks.
As the broadcasting industry is moving from a traditional SDI infrastructure towards the All-IP Studio providing a common frequency and – equally important – an absolute notion of time for all devices is now provided by the underlying infrastructure itself. In this fou…
Esports viewership worldwide is on a steep upward trajectory and will soon begin to challenge traditional sports broadcast audience figures. As the esports and traditional sports communities converge, what can traditional broadcasters learn from the remote production workflows being pioneered…