6G NETWORK DELIVERY FOR WIRELESS COMMUNICATIONS
For our bandwidth-hungry society, 2G, 3G, and 4G have used frequencies that reach approximately up to 6 GHz, while 5G systems exploit the range of less than 6 GHz as efficiently as possible by combining 24–100 GHz.
Recently, developers are realizing that the current frequency bands may not be enough to serve the growing demands; for example, an uncompressed ultra high-definition video may reach 24 Gb/s, and some 3D videos may reach to 100 Gb/s. As a result, in 6G, we will jump above 100 GHz, and the new radio will consider not only the traditional sub-5 GHz band but also validate little-explored frequency sources such as mm Wave and terahertz bands to overcome the spectrum scarcity and provide wide bandwidth from hundreds of megahertz to several gigahertz and even to terahertz.
In recent years, a flurry of research activities have been reported concerning the use of multiple high-frequency bands for ultrafast-speed transmissions, which are recommended as promising solutions for 6G. Terahertz (THz) Band communications have gained even greater interest and higher expectations to meet an ever increasing demand for the speed of wireless communications. The characteristics of electromagnetic waves propagating in the THz Band, which is one of the key technology to satisfy the increasing demand for Terahertz Wireless Data Communication (ThWDC). The performance of future terabit super channels implemented using bipolar phase- shiftkeying which gives the best BER (Bit Error Rate) with today’s technology is done through ThWDC. Communication is possible from 0.01 to 0.5 THz frequency range and the best transmission window in this range have been found x1 = [0.01–0.05 THz], x2 = [0.06–0.16 THz] and x3 = [0.2–0.3 THz].
We can currently deliver over 100 GB/s Gbps data link layer processor for wireless communication (6G) using custom algorithms codes with dedicated link adaptation, fragmentation, aggregation, and hybrid-automatic-repeat-request. The main advantage is the low-chip area required to fabricate the processor, which is at least two times lower than the area of low-density parity check decoders. Our solution loses only ∼1dB gain when compared to high-speed low-density parity- check decoders. Moreover, with only 2.38 pJ/bit of energy consumption at 0.8 V, one of the best results in the class of comparable implementations has been achieved. For the baseband realization, we propose a parallel sequence spread spectrum and channel combining at the baseband level. Although the sub-terahertz band of 200-300 GHz allows to allocate channel bandwidth of several gigahertz and supports a data rate of 100 Gbps, the wide bandwidth and high data rate require demanding processing. All analog components have to support high gain and linearity over a wide spectrum at ultra-high frequencies. The digital parts, however, have to deal with data rates of 100 Gbps and bit processing time < 10ps. Thus, we face many difficulties on each design step of such a transceiver. Moreover, we need to keep in mind that wireless communications used in battery powered devices and has to operate at strictly limited energy limits.
Ultra-high speed wireless systems require either very high bandwidth or very high bandwidth efficiency. In cellular architectures like LTE or 5G high bandwidth efficiency is in the focus. This is due to the limited bandwidth in the available radio bands. Increasing the bandwidth efficiency requires a corresponding increase in signal processing power that increases the complexity of the baseband processor dramatically. In THz bands, bandwidth limitation is no issue such that these bands are today considered for ultra-highspeed communications. If we use 25GHz bandwidth, at bandwidth efficiency of 4 b/s/Hz, is sufficient enabling less complex baseband processing. THz channels are known to be highly attenuating and require high-gain- antennas and highly efficient amplifiers. However, manufacturing the amplifiers and antennas is challenging. At such high frequencies, it is impossible to connect the antenna using wire bonding due to reflections, cross- talks, and attenuation on the bonding wires. Therefore, the antenna has to be integrated into the RF-frontend.
This leads, however, to interference with metal layers of the ASIC and gain reduction. Further problems arise in the design of signal processing within the baseband (BB). For such fast links, typical digital design is inefficient, because digital technology consumes too much power and silicon area. Thus, the algorithms have to be simplified and do not work as effectively as for slower communication systems. Currently, all problems in the RF-frontend and BB design are shifted to higher layers. In such a case, FEC and data link layer (DLL) must tackle these problems. FEC and DLL are expected to repair channel impairments, and additionally errors caused by lower layers. Thus, the design of FEC and DLL for the targeted system becomes complex, requires a large chip area, high power, and cooling. A mobile-transceiver for the targeted application has to consume less than 1W or equivalently ∼10 pJ/bit in the case of 100 Gbps data-rate. This limit includes the whole RF frontend, BB, and DLL processing. For example, RF-frontend of a 240 GHz system with an output power of −4.4 dBm consumes 1.2 W. This allows to establish a PSK-modulated link of ∼23 Gbps at a distance of 15 cm. Although the output power, data rate, and distance are smaller than targeted, the RF-frontend alone exceeds the assumed power for the whole transceiver (RF +ADC/DAC +BB +DLL +FEC). Apparently, increasing the data rate and range will increase the consumed power significantly. The high-speed FEC decoder presented in [12]–[14] is another example of the challenges encountered in the design of 100Gbps transceivers. Even if the implementation exploits all known techniques to improve LDPC-decoding efficiency, it still needs 12 mm2 of silicon and consumes 5W. Due to the consumed power and low flexibility due to fixed code rate, today’s solutions have to be revised on an algorithmic level. Although it is possible to reduce the power of the FEC processor down to ∼600 mW by applying ultra-high scaled 7 nm technology [15]–[17], it is still far beyond the targeted limit of 1 W for the complete transceiver. This becomes even more challenging when code rate slower than 13/ 16 and data rates > 100 Gbps are considered. Lower code rates support higher gain but require much more computations. Thus, the DC-power will be significantly higher than the estimated 1 W for the complete transceiver. We first refer to two RF frontend implementations that successfully demonstrate THz and sub-THz communication. As our processor is equipped with compatible interfaces, either of them can be used in our design. Afterward, other analog components required by the data link layer (DLL) processor are also introduced.
1. 300 GHz RF FRONTEND
The RF frontend proposed, operates at 300 GHz and is able to transfer up to 64 Gbps using QPSK modulation on a distance of 1m. The chipset is expected to operate at higher data rates or longer distances as well, but the setup is limited by practical constraints of employed instruments. The design uses horn antennas with 24.2 dBi gain and the chip is realized in 35nm GaAs mHEMT technology with fT and fmax of more than 500 and 1000 GHz, respectively.
240 GHz RF FRONTEND
The chip is fabricated in IHP 130 nm SiGe BiCMOS technology with fT and fmax of 300 and 500 GHz, respectively [21] and operates at 240 GHz. The recently published revision [20] occupies RFRX bandwidth of 55 GHz, RF-TX bandwidth of 35 GHz, and supports a data rate of ∼ 25 Gbps with BER ≈2e-4 on a distance of 15-30 cm. The transceiver uses a double-folded dipole antenna combined with 40mmÅ~40 mm plastic lens(polyethylene). Such an antenna set provides 14 dBi of combined gain, while the transmitter alone delivers−0.8 dBm of output power.
ADC AND DAC UNITS
The next step after RF-fronted up- and down-conversion are ADCs and DACs. The design of digital and analog converters for data rates approaching 100 Gbps is difficult as well. Therefore, at this level, we apply one of two proposed improvements. Instead of processing the whole bandwidth in a single AD or DA converter, we split and merge the signal at the baseband level in the analog domain, before the AD and DA conversions. For this purpose ‘‘parallel sequence spread spectrum (PSSS)’’ and ‘‘channel combining’’ can be employed. We explain both methods in the next two sub-sections.
CHANNEL SPLITTING AND COMBINING
The baseband channel splitter divides the analog baseband signal into parallel streams. Each stream can be processed by an individual baseband core, and thus by a separate ADC. Thus the bandwidth and data rate are also divided between streams, and therefore the demands for ADC, DAC, and baseband are significantly reduced. We have incorporated channel splitters and combiners with five and three outputs and inputs into our proposed design. The combiner and splitters are realized as a set of analog mixers (Fig. 2) that are fed with different local oscillator frequencies, e.g., 3.75 GHz, 7.5 GHz, and 15 GHz [22]. The chip is fabricated in IHP 130nm SiGe BiCMOS technology and can be directly integrated with the previously mentioned 240 GHz frontend. The drawback of the latest 3-input combining chip is limited baseband bandwidth of ‘only’ 6.5GHz. This problem is partially resolved in the next 5- input release, but still, we will need more bandwidth. Therefore we also work on a parallel sequence spread spectrum (PSSS) that is described in the next subsection.
PARALLEL SEQUENCE SPREAD SPECTRUM (PSSS)
The PSSS has been proposed as a spreading technique for different communications systems. The input data bits are multiplied by direct sequence spread spectrum (DSSS) sequences (e.g., Barker codes or m-sequences), and then added together in the time domain. After that, a multilevel amplitude waveform is produced, which carries N bits in N multilevel- chips(multilevel-symbols).
Thus, the rate of PSSS modulated data is unchanged. The spreading is performed in the frequency domain, but not in the time domain. In our case, the frequency- spreading ability of the PSSS is rather a drawback than an advantage, because spreading the 100 Gbps signals in the frequency domain leads to ridiculously large bandwidth. The very large PSSS bandwidth is anyway cut out by the RF-frontend and the PSSS modulator itself. The circuits have limited bandwidth, which is usually much lower than the bandwidth of the resulting 100 Gbps PSSS signal. Instead of the spreading, we concentrate on the analog implementation of the PSSS-based receiver, which has two significant advantages. Firstly, most of the PSSS-receiver can be implemented in the analog domain, which consumes less power, less chip area, and works much faster than in the digital domain. Secondly, we sample the baseband signal in parallel with N ADCs. Thus, the sampling clock is reduced by N-times, where the N is the spreading sequence length (N = 15). These two advantages of the PSSS allow us to cross the 100 Gbps data rate barrier on the baseband level. In the previous parts, we identified the main challenges of high-speed communication and described two RF-frontends and two baseband techniques that can be used for communication in the THz band. To build a complete transceiver, we additionally need a data link layer (DLL) processor with FEC. The design of these elements is explained in this section.
1. SUPPORTED FUNCTIONALITY
The implemented DLL processor supports three essential functionalities. Firstly, it aggregates the data to 64kBframes and divides them into 1 kB fragments. Each 1kB- frame fragment is protected by an individual cyclic redundancy check (CRC) code. Secondly, it uses interleaved Reed-Solomon (RS) FEC codes and supports hybrid automatic repeat request-I (HARQ-I) scheme with selective fragment retransmissions. Thirdly, it reduces the overhead of HARQ-I by a dedicated link adaptation algorithm and an acknowledge compression scheme.
SELECTION OF FEC ALGORITHM
The FEC method used for 100 Gbps applications has to be selected very carefully to avoid hardware and power overhead. All algorithms are individually parametrized to typical configurations, thus each implementation has a different code rate and shows different error correction performance. At this step, we shortly introduce the hardware complexity, and the correction performance is discussed later in this section. All implementations are tested in a Kintex7 FPGA, keeping in mind that the resources needed for FPGA implementation are correlated with the ASIC area and hardware complexity. The largest RS decoder achieves 2.6 times higher normalized throughput than the 1/2-rate Viterbi decoder at the cost of 1.5 dB loss, and 17.6 times higher normalized throughput than the LDPC(10368, 8448) code at the cost of 0.2 dB loss. The overall performance of the FPGA implemented LDPC decoders at the selected code rate is poor. Both implementations require large resources, provide relatively low correction performance and decoding throughput. Later, we compare our RS ASIC implementation to fully-parallel, fully-unrolled ASIC LDPC decoders, which achieve higher decoding and correction performance, but due to high resources demand, they are not targeted for FPGA designs. The selected turbo decoder provides the highest gain at the lowest code rate from the selected decoders, but it requires 28 times more resources than the largest RS. Moreover, it is proven in that turbo decoders have internal decoding dependencies and design of high-speed parallel implementation in hardware is difficult. Considering hardware implementation in a Xilinx irtex 7VX690TFPGA, which has 433200LUTs, we require more than 18 development boards to support turbo codes at the decoding throughput of 100 Gbps, assuming 100% chip utilization that cannot be achieved in reality. For LDPC we need more than 11 boards, for Viterbi more than 2 boards, while RS needs only 1 development kit and this has been already proven by us in. We compare hard decision RS with selected soft-decision algorithms, which are suited to operate on significantly lower code rate (e.g., 1/2, 1/3). Thus, the comparison may lead to false conclusions. The 8-bit RS codes are suited for low overhead, and code rates below 0.874 are used rarely (more in section V.J).
In our application, however, we target high-speed communication (≥ 100 Gbps) with low power demand, and therefore the 1-bit quantization and low redundancy overhead are demanded. For other applications, a 1/2-rate LDPC decode will be a better choice probably. Especially, when soft decision decoding, low code rates, and high gain are desired. To give a better overview of the advantages of RS codes, we additionally compare LDPC, BCH, and RS codes at similar code rates with 1-bit quantized bit input. In such a case, the decoding conditions for all algorithms are normalized. At packet error rate (PER) equal to 0.5 (AWGN, BPSK), LDPC(64800,57600) code of DVBT-S2 implementation operates at ∼ 12% higher BER than the RS. The LDPC decoder uses up to 50 decoding iterations and is based on a powerful sum product algorithm (SPA) with floating-point arithmetic. The tested LDPC algorithm works on binary quantized input data, like RS and BCH, but the internal decoding stages are represented by floating-point variables and are performed by SPA (please do not confuse it with bit-flipping). Such algorithms are usually used for software realizations only, and for hardware, the min-sum approximation with fixed-point logic is commonly employed. Additionally, the number of decoding iterations is significantly lowered, thus the presented DVBT-S2 decoder realized in software shows very good correction performance. The loss to the BCH decoder that operates on block length very similar to the RS is higher, and the I could not find a seller direct, the best I could do is seller mandate for Anthony. If you could pass that on to Anthony, that would be appreciated. BCH corrects up to 25% more bit errors. This situation changes when the AWGN channel is replaced with an error characteristic that generates single and short-burst errors. To demonstrate this, we prepared a Markov chain BER generator, which produces an error characteristic. The RS decoder shows better performance than the BCH as well as the complex SPA-LDPC and this is the main advantage of RS codes. The codes, in general, are very efficient against burst errors. Although the coding gain for the AWGN channel and the operable code rate is very limited, they have low complexity and achieve high decoding throughput in hardware and software. We use RS codes as a base of our data link layer processor and prove that this lightweight FEC can be used for low-power, high-speed data link layer.
INTERLEAVED RS CODES
Although the selected hard decision interleaved RS codes have limited correction performance, we favors over LDPC due to two reasons. Firstly, the RS requires very low resources to support highspeed decoding. Secondly, the PSSS baseband processor delivers only binary-quantized bits and cannot support soft-decision LDPC decoding. Thus, for our application, the RS is a more practical solution. Furthermore, we try to mitigate the gain loss of the RS codes by two means. Firstly, we interleave the decoders, and therefore we can correct a longer burst error. In general, the symbol interleaving improves correction performance for burst errors. Long sequences of errors are interleaved among multiple decoders, and therefore the effective number of erroneous symbols per decoder is reduced. This is important for our application because at 100 Gbps any synchronization error or voltage ripple destroys tens or even hundreds of consecutive bits. Thus, an extremely strong correction performance against burst errors is desired. Secondly, we designed a dedicated fragmentation and link adaptation schemes that improve the interleaved RS coding efficiency.
Despite the fact that soft decision LDPC codes provide higher correction performance, we should note that ultra-high-speed LDPC decoders for≥ 100 Gbps use hardware optimized decoding schemes and usually show lower error correction performance than sum-product (SPA) decoding. In the worst case, we lose only ∼ 1 dB as compared to soft decision LDPC shown in, considering AWGN channel and dedicated data fragmentation for RS codes. We need to keep in mind that similar fragmentation can be proposed for LDPC, as well as it is possible to implement an LDPC decoder with higher gain. The inter leaver size depends on the word size and interfaces available for the targeted technology. For Virtex7 FPGA, we usually interleave the data between eight RS decoders. This gives the processing speed of 64 bit/clk and fits the bus size of High-Speed Serial Transceivers [, which are used as the main communication interfaces. Thus, for Xilinx FPGAs, the interleaver size is fixed to eight, or multiple of eight when the transceivers are combined in parallel. This gives the best power and area efficiency because the data do not have to be restructured and fits perfectly to the communication interfaces. In such a case, only routing resources are required to construct the interleaver. In the case of ASIC implementation, we have more freedom and we can select any arbitrary defined size. Based on the results shown in section V, we know that to reach 100 Gbps with RS(255,223) coding, we need to combine at least 7 decoders (7 Å~ 14.7 Gbps = 102.9 Gbps). Although a single RS decoder achieves up to 14.7 Gbps at 2.1 GHz, this mode is not recommended due to dissipated energy. The chip needs to run at the highest voltage (1.1V) and all power optimization options have to be disabled (e.g., clock gating, static power optimizations). This is reflected in the energy efficiency, which will be no better than ∼15pJ/bit. Therefore, we increase the number of decoders and reduce the voltage and clock frequency. Moreover, we enable clock gating and optimize static power (more in section V). In such a case, a single RS decoder runs at 3.15 Gbps only, but the energy is optimized to∼2.4 pJ/bit at 0.8V. This means that we need to place at least 32 decoders in parallel to reach 100 Gbps.
We decided to use 16 decoders. To utilize energy optimization features, e.g., clock gating and static power optimizations, we need to place at least 11 decoders to reach 100 Gbps (max. 9.1 Gbps/decoder).
DATA AGGREGATION
Data aggregation is a widely used technique that significantly increases transmission performance in wireless systems. In our implementation, we set the minimal transmission frame length to 64 kB. Thus, we avoid frames shorter than 64 kB by merging the data when the system is fully loaded. This, in turn, reduces the total number of frames and framepreambles, which are attached to each frame. In short, we reduce the transmission overhead. The improvement in the performance of our method depends on the data size that is transmitted over the link. For example, the throughput is increased by 47% when a typical 1.5 KB Ethernet data size is considered. In such a case, the aggregation module merges 43 Ethernet- frames into a single wireless-frame that is transmitted over the air (43Å~1.5KB=64.5KB).
DATA FRAGMENTATION AND SELECTIVE FRAGMENT REPETITIONS
Although the 64kB-aggregation scheme significantly improves transmission efficiency for short frames, the aggregated frames are more sensitive to bit errors. Thus, we need efficient FEC and ARQ mechanisms that recover and retransmit corrupted data. Due to the targeted processing speed of 100 Gbps, we use the simplest HARQ-I method that is enhanced by selective fragment repetitions and link adaptation. In our case, the 64 KB frames are logically divided into 1 KB fragments, which have unique addresses and CRC sums. Thus, in case of bit errors, our ARQ retransmits 1KB-data fragments, instead of retransmitting the whole 64 KB-frames. The selected 1KB retransmission size is a trade off between the optimality and simplicity needed for practical realizations. From the data delivery efficiency point of view, the size should be equal to the message length of the employed FEC method. Then, the processor retransmits only the defected code words and the number of transmitted headers and CRCs remains at low. In our case, we use a set of 16 interleaved RS codes with variable message length in the range of 3568B – 4048B (16 Å~ 223B – 16 Å~ 253B). This size depends on BER, which influences the overhead generated by the RS encoders (link adaptation, more in section III.F). Thus, we should adapt the retransmission size to BER continuously. Such an approach, however, leads to very complex implementation. The ARQ module need store fragment the user data each time when the FEC code rate is changed and needs to keep the track of irregular fragment addressing. By fixing the size to 1 KB, we significantly reduce the complexity of the ARQ at the cost of reduced transmission efficiency. For BER < 3e-3 and RS(255,223), the efficiency degradation of ∼ 1% is caused by redundant fragment-headers and CRCs. For such a low BER, retransmissions are infrequent. For BER∈(3e-3, 6e-3), we lose up to 7% of efficiency due to the fragment retransmission mismatch. For example, the ARQ has to retransmit a single interleaved- RS(255,223) block of the length of 3568 B (16x223B), but in our scheme, we need to retransmit 5Å~1KB (5120B) in the worst case. For BER > 6e-3, the wireless link is down regardless of the selected fragmentation scheme. From the statistical point of view, the probability of error free transmission of small fragments is higher than the probability of transmission of long frames.
LINK ADAPTATION
Link adaptation algorithm tracks the link quality and finds the trade off between FEC redundancy and ARQ data repetitions. In short, the algorithm selects one of RS(255, k) codes, where k is in the range of 223 to 253, so that the fragment error rate and FEC overhead are compromised. To keep the retransmission rate on a low level with low FEC overhead, we solve two inequalities in real time.
The first inequality compares whether the fragment error rate in the receiving stream is higher than the RS redundancy. If is not satisfied, then the code rate has to be decreased and a more robust RS(255, k − 2) code has to be used to reduce the retransmission. This means that more redundancy is added to the data frames.
The second inequality, increases the FEC-code rate and reduces the redundancy when the fragment error rate at an increased code rate will be low enough to satisfy. we need to predict the number of erroneous frame fragments at an increased code rate represented by RS(255, k + 2) coding, which is relatively difficult to calculate. The processor decodes the data using RS(255, k) and predicting the fragment error rate at RS(255,k+2)decoding is challenging. Thus, we estimate the number of erroneous frame fragments at RS (255,k+2) code by the RS-block error. Thus, we simply count the number of symbol-errors in each RS-block and compare it with s. After that, a minimum-filtering is applied to improve the stability of the communication. With BER increase, the algorithm reduces the code rate of RS coders. That is to say, more redundancy is added to frames. The uncomplicated HARQ-I method combined with link adaptation and selective fragment repetitions achieves pretty good efficiency, so we avoid HARQ-II and HARQ-III schemes We have already proven that implementing HARQ-II and HARQ-III at the targeted data rate is challenging. In our case, we achieve up to 20 Gbps higher throughput due to the link adaptation and ∼dB higher gain due to the fragmentation.
TRANSMITTER AND RECEIVER IMPLEMENTATION
The design has a 128-bit architecture, which means that 128 bits of data are processed in each clock cycle. Prototyped in Virtex7FPGA, it is able to achieve ∼ 9.9 Gbps at a clock frequency of 156 MHz [10], [30]. When synthesized into 28 nm technology, we set the clock rate to 1.3 GHz and this gives the user data rate of 165 Gbps with RS(255,253) coding, and 145.5 Gbps with RS(255,223) coding. Due to similar TX-and RX-architectures, the transmitter and receiver consume the same chip area of ∼ 1.04 mm2 in 28 nm, and achieve the same clock speed and throughput. In fact, the transmitter and receiver have complete transmitting and receiving hardware due to ARQ and acknowledge-processing. The ARQ requires bidirectional communication, even if the user data flow is unidirectional, and therefore the data link layer transmitter has similar complexity as the receiver. Both units own a parallel array of eight RS encoders and decoders with aggregated processing speed equivalent to 16 Å~ 10.313 = 165 Gbps. All other processing is fast enough to handle the 165 Gbps data rate in a single thread.
ASIC SYNTHESIS AND LAYOUT
The design is fully implemented in VHDL, synthesized with Genus software. All power and energy results are estimated with real signal activity files (VCD) and performed on the chip layout considering typical process conditions. Each data word shifted to the chip is randomly generated and has changed 50%of the data bits as compared to the previous word. Thus, the measurements gave a realistic overview of the power and energy consumption. To achieve the reported throughput of 165 Gbps and 4.47 pJ/bit, we performed the following net list and layout optimizations:
The dual port static RAM memories (FIFOs) needed for RS implementation are replaced by Flip-Flop (FF) arrays. This solution sounds insane from the power and area point of view, but the memories are the main bottleneck of throughput in our design. Moreover, planning a chip with memories is more difficult than placing pure logic alone. In our case, we need to place 64 memories each of the size of 256Å~8bits. Replacing the memories with FF arrays increases the clock speed from 600 MHz up to 2100 MHz, which corresponds to the throughput improvement from 67 Gbps up to 235 Gbps. The performance is increased, but we also increase the chip area from 0.57 mm2 to 1.02 mm2 and the power from 0.286 W up to 3.5 W.
In the next steps, power optimizations are performed. Mainly, we need to reduce the energy dissipated in the FF-arrays emulating the memory blocks.
In the next step, clock gating is added to reduce the very high dynamic power. In each clock cycle, we read and write just a single byte to each FF-memory. This means that we access only ∼ 0.19% of the total memory registers in each clock cycle. Thus, we can significantly reduce the power by inserting clock gates and deactivating ∼ 99.81% of the memory registers. Although the clock gates increase the area by∼0.02mm2 and reduce the clock by 600MHz (2100MHz → 1500 MHz), the power is desirably reduced to 0.928W from the initial 3.5W.
In the next step, we reduce the static power dissipated by the chip. This is achieved by performing multi-threshold voltage optimizations. In short, for all critical paths, the transistors with the lowest voltage switching threshold are inserted, while for non-critical paths transistors with a high threshold and reduced leakage are used. This reduces the power from 928 mW to 602 mW.
After this step, the chip area remains almost unchanged, but the clock frequency is reduced by∼200 MHz (1500 MHz→1300 MHz). The layout of the chip is shown in Fig. 19. We use a doubled VDD-VSS power ring around the placed logic. The IO pads are excluded from the area and power analysis. We highlighted a single RS decoder entity and its belonging code word memories. The input memory is placed close to the chip edge due to the input signals routing.
The corrected code word, after fixing the evaluated error, is stored in the memory placed next to the decoder. It is possible to reduce the memory size by 25% by removing the bypass FIFOs, which are used to shift-out the originally received code word, in the case when the decoder cannot correct all bit errors.
ENERGY CONSUMPTION
As mentioned in the introduction, energy and power consumption are one of the most critical parameters of the high-speed transceivers. In our case, the energy and power depend on channel BER and selected FEC code. The energy is mostly consumed by the RS decoders and it is indeed related to the code rate curve.
For BER < 1e-5 which is correspondent to the highest code-rate of RS(255,253), the processor consumes extremely low DC-power of 29.7 mW, equivalent to 0.22 pJ/bit. With the increase of BER, the DC-power also increases and saturates at 602mW, or equivalently 4.47pJ/bit. This high-power mode corresponds to the lowest code-rate of RS (255,223). It should be noted that energy is dependent upon the number of Galois field multiplications and additions, which in the case of RS(255,k) decoding, are asymptotically 2(n2) and can be determined for our implementation.
VOLTAGE SCALING
Although the throughput of 165 and 145 Gbps for RS(255,253) and RS(255,223) satisfies our needs, the maximum energy consumption of 4.47 pJ/bit exceeds our targeted limit. As mentioned in the introduction, our goal is to design a complete 100 Gbps transceiver (RF+BB+DLL) within 1 W power envelope. Assuming that the power is equally distributed between RF, BB, and DLL, we set the power limit of ∼ 333 mW for our DLL implementation. This corresponds to the max. ∼ 3.8 pJ/bit (information bit) at RS(255,223) coding. One workaround is to adjust the throughput, clock speed, and voltage in order to get some savings in consumed energy. The voltage range for the targeted process is 0.8-1.1V. Surprisingly, the clock speed of our design scales almost linearly with the voltage, which is not observed for LDPC decoders realized incomparable technologies. The energy per bit looks to be a linear function as well, but in the targeted range of 0.8-1.1V, a quadratic function fits the points more precisely. In our case, we need to reduce the throughput to ∼115 Gbps at∼1.01V to achieve the limit of∼3.8 pJ/bit at BER ≈ 6.3e-2 with RS(255,223). The BER value of 6.3e-2 is the lowest achievable BER for AWGN channel as well as the worst case from the energy consumption point of view. Assuming the lowest possible voltage of 0.8V, the processor achieves 50.4 Gbps and consumes max. 2.38 pJ/bit. Above we presented 28 nm data link layer processor for 100Gbps wireless communication in the THz-band. This processor uses light weight interleaved RS codes and requires at least two times less chip area than LDPC decoders at the cost of ∼ 1 dB gain. Additionally, we show a dedicated link adaptation, aggregation, fragmentation, and ARQ with selective fragment repetitions. In our case, these methods improve user data throughput by max. 20 Gbps and the gain by ∼ 0.55 dB. ASIC post layout results show that the processor easily achieves 145 Gbps and 165 Gbps at 1.1V with RS(255,253) and RS(255,223), respectively. Energy consumption is as low as 2.38 pJ/bit at 0.8V with RS(255,223). The methods achieve a good trade-off between throughput, energy consumption, and error correction performance for applications that do not require maximal coding gain and soft decision decoding. Additionally, we mention two novel baseband architectures, as well as two RF-frontends capable to work in the THz band. Challenges to high- speed wireless transmission are addressed as well.
All Trademarks and IP belong to their respective owners.