

Journal of Engineering Science and Technology Review 14 (3) (2021) 76 - 84

**Research Article** 

JOURNAL OF Engineering Science and Technology Review

www.jestr.org

# Low Latency NoC Switch using Modified Distributed Round Robin Arbiter

Ashok Kumar. K<sup>1,\*</sup>, P Dananjayan<sup>2</sup> and Karunakar Reddy Vanga<sup>1</sup>

<sup>1</sup>Department of ECE, Matrusri Engineering College, Hyderabad, India-500059 <sup>2</sup>St. Peters Institute of Higher Education and Research, Chennai, India-600054

Received 13 October 2020; Accepted 26 June 2021

### Abstract

Switch design in NoC is a critical issue in intercommunication network of SoC because it is responsible for performance such as power consumption, area and latency. The performance of NoC switch is enhanced by reducing the number of input and out ports. By modifying and upgrading Minimal Adaptive Circuit Switching (MACS) NoC, the switch design is proposed thereby improving the performance and also flexibility. To achieve low latency and area overhead for NoC, a modified distributed round robin arbiter based switch is introduced, designed with Store and Forward packet switching and minimal adaptive routing. The proposed NoC switch with modified distributed round robin arbiter is implemented using Xilinx ISE 14.7 in vertex-6 (V6) FPGA and the results are compared with MACS. The proposed design shows a 58% higher operating frequencies and a 25% lesser area overhead when compared with MACS.

Keywords: NoC; Arbiter; FIFO buffer; Adaptive Routing; FPGA

### 1. Introduction

In Modern Generation, Field Programmable Gate Array (FPGA) is improved massively in terms of size and performance due to thousands of logic elements fabricated on chip. According Moore's law, the number of devices is doubled at every 18 months. Chip Multi Processor (CMP) has become popular with limited number of processors but its performance degrades when increase in the processors and interconnections. Emo Salminen et al [1] proposed an overview of bus based communication in system on chip (SoC) and concluded that it has limited flexibility when processors are increased. For enhancement in flexibility of system, Network on Chip (NoC) paradigm proposed thereby improving the system parameters i.e. scalability, flexibility, bandwidth and reliability. Typically, NoC is composed of switches and processing elements (PEs) where each PE is connected to dedicated switch. Hence, the performance depends on the intercommunication of PEs and also data transfer between source and destination PE through intermediate switches. A generic NoC switch consists of four directional ports and a local port to transfer the data from source to destination ports.

Dally et al [2] and Benini et al [3] introduced the concept of NoC and presented the improved performance in terms of the latency when compared to buses and dedicated wires. The design and performance of NoC switch depends on the parameters such as topology (2D/3D regular /irregular), arbitration (fixed, round-robin), buffering strategy (buffered/buffer less), switching (circuit/packet) and routing (deterministic/minimal). A lot of work has been done in different topologies (i.e. mesh, torus, ring, fat-tree and others) [4-6]. Wang et al [5] compared and concluded that the mesh topology is suitable for NoC because it has better performance

\*E-mail address: kashok483@gmail.com

ISSN: 1791-2377 © 2021 School of Science, IHU. All rights reserved. doi:10.25103/jestr.143.09 than other topologies. Despite of the drawbacks found in mesh topology, most of switches use mesh based designs because of flexible in architecture when the data is transferred within shared resources.

NoC switch is described with its design parameters. Packet switching (PS) [7-9] provides a higher bandwidth and more resource utilization than circuit switching (CS) due to the dynamic distribution of communication channels. The disadvantages of PS include size and processing of buffer that affects overall area, latency and saturation point of NoC based SoC that means the size of network increases, throughput does not increase and latency grows exponentially. CS [10-14] provides a guaranteed throughput and reduces latency. The disadvantages of CS include lower transmission rates and higher channel setup latency in larger network thereby the bandwidth and throughput being limited. Mixed switching [15, 16] proposed for to avoid the disadvantages of both PS and CS. Moderassi et al [16] shown 19% reduction in packet latency but 10% increment in area utilization by the mixed design. The performance of NoC is improved by advancing design parameters. NoC switch is enhanced the bandwidth utilization and speed of data transfer in case of more number of PEs are used. The communication requests increase and the number of paths increases thereby increasing in bandwidth utilization. Generic NoC switch with deterministic X-Y routing does not use the benefits of mesh topology [17] because of data packet route is fixed and evaluated by source and destination switches which increase the data packet latency whereas in adaptive routing [18], data packet route multiple directions on the basis of various parameters (congestion aware, shortest path). Chiu et al [19] proposed partial adaptive routing algorithm (Odd-Even) turn model which is based on even and odd without virtual channel. Nasiri et al [20] proposed a fully adaptive routing with virtual channels (VCs) for transferring the data packet thereby achieving 65.3% and 35.4% of improvement in latency. NoC switch with 2-D mesh topology proposes with minimal adaptive routing, store and forward switching and each switch

consists of a PE pair called producer-consumer pair. This producer- consumer pair communicates without a switch involved, reducing the requirement of switch, enhancing of bandwidth utilization and providing flexibility than other designs. The switch creates dynamic links between PE pair, using the following steps: providing efficient communication path for PE pair by adaptive routing, improving channel latency by a modified distributive round robin arbiter and higher bandwidth utilization using FSM module. For analyzing area, power and delay, the switch is simulated in verilog HDL and implemented with choice of the FPGA device family i.e. VERTEX-6 device thereby comparing with the recent deigns. The remaining paper is organized as follows: section 2 provides details of previous work of NoC switch. Section 3 discusses the internal structure of improved NoC switch. Section 4 describes Finite State Machine (FSM) routing. Section 5 presents results and discussion with comparison of the recent work. Section 6 provides conclusion with indication of the future scope.

## 2. Previous Work

Dally et al [2] have proposed an efficient NoC switch with packet switching and discussed overhead of wires with packets during transmission after which reduced latency was seen, as also increased bandwidth with 6.6% area overhead. D. Wiklund and D. Liu [21] have discussed various switching techniques with various topologies and showed that 2D mesh topology is best suited for NoC. Wiklund et al [13] have proposed a modified switching technique i.e. packet connected circuits (PCC). A round robin arbiter selects the best path which is locked until the packet reaches the destination. This round robin selection, results in SOCBUS providing lesser transmission channels. Zheng et al [22] have proposed a circuit switched Time Division Multiplexing (TDM) switch for raising the transmission channels. This is useful in the construction of communication channels. Hence NoC provides more transmission paths over SOCBUS but TDM suffers from more packet delay and less efficiency.

Bobda et al [23] have presented a reconfigurable multi bus (RMB) on chip based circuit switching for improved communication bottlenecks. RMBoC switch is implemented with horizontal and vertical bus controllers for managing communication requests by the surrounding X-Y(S-XY) routing on both 1D and 2D mesh topologies. The S-XY routing causes high packet delay, area overhead increase and low operating frequency in RMBoC. Carara et al [24] have provided a 4×4 switch with hardware and software designs with mixed switching of fixed packet size in the buffered cells. They have proposed double parallel paths per switch and used data transfer using multi-layer sessions and showed enhancement in bandwidth allocation through improvement in resource allocation. Carara [24] has proposed two communication paths for every direction from source switch to destination switch. Hence there was an increase in communication requests, followed by increase in bandwidth utilization and a reduction in the area consumed. But there was a problem with Carara [24] switch with use of X-Y routing. This switch suffers from customization being a fixed architecture. Abelardo Jara et al [25] have proposed SCORES architecture for enhancing improvement in customizable architecture for the NoC switch. It has a stream based model

for transfer of data between the computation modules. SCORES has represented tunable architecture parameters which provide better results than from the recent switch designs. Producer interface and consumer interface pair improve operating frequency and area consumption but with less number of available communication channels due to switch design with 1D topology resulting in less bandwidth utilization.

Jing Lin et al [26] proposed express circuit switching with buffer-less NoC (EBLESS) to enhance bandwidth utilization. The design provides a low area, reduced power consumption and improved latency with dedicated links due to switch design having extra connections. EBLESS showed 19.43% reduced packet latency compared with buffered NoC providing a smaller saturation point. Lusala et al [15] have provided a SDM-TDM based circuit switching NoC for improving the saturation point. This NoC shares communication links among multiple connections with a consequent increase in links and in bandwidth utilization, increased packet latency and area requirement due to X-Y routing. Teimouri et al [27] have proposed partial circuits for switch which helps reduction in the area requirement and divided the network into sub networks and one sub network dealing with packet switching and the other with circuit switching. Compared with conventional NoC, this switch showed 25% and 37% reduction in power consumption and packet latency respectively but the switch provided a 2% increase in area overhead due to X-Y routing and decreased bandwidth utilization due to total available paths not used. R.kumar et al [28] have proposed a MACS switch for efficient use of available links and to increase the bandwidth utilization. Every switch with PE pair used the shortest path routing thereby improving  $6 \times$  saturation point,  $2 \times$  to  $7 \times$ improvement in link setup latency and a  $1.7 \times$  to  $2 \times$  reduction in area consumption compared with the generic NoC switch. The distributive Round robin arbiter caused an increased in area overhead and provided less bandwidth utilization. To guaranteed service and throughput, Hansson et al. [30] proposed flit-synchronous Time Division Multiplexing (TDM) called aelite. The aelite is used TDM arbiter to get global synchronous of NoC. Stefan et al. [31] presented a dAElite, a new configuration infrastructure based on TDM therefore guaranteed services are obtained with low cost effective.

For eliminating communication bottlenecks, NoC switch has proposed based on parallel distributed round robin arbiter and Store and forward switching. This work was directed from MACS, LUSALA and CARARA and enhanced MACS NoC switch with adaptive routing which selects the best path with increased link utilization and modified distributed round robin arbiter which has reduced channel latency. There was enhancement in distributed round robin arbiter in such a way that, the size of multiplexer and counter was reduced and the arbiter with multiplexer and counter was grouped parallel. The adapting routing was executed after the arbiter selected input and output ports and routing mechanism being the same as proposed in MACS [28]. The input and output ports were with registers and the data transmission control with signal forwarders make the data transmission much delay in checking availability register and the connectivity register. This overhead was also reduced in the proposed work while using the FIFO and FSM controllers.



Fig. 1. 3×3 NoC design with 2D Mesh topology where each switch has 2 PEs and four neighbour switches

### 3. Switch Architecture

Fig.1 shows  $3 \times 3$  NoC switch architecture of 2D mesh topology with PE pair in the form of row-columns making communication of switch simple. Each switch is connected to a producer- consumer pair as local ports (left local and right local) and four neighboring switches (east, west, north, and south). The switch and its ports have bi-directional channels for data transfer through a crossbar switch i.e. input port and output port and each input port receive data from the neighbour switches where as each output port sends data to the neighbour switch when available [29]. The FIFO buffer puts in an input request based on its status in a round robin manner. The request is transferred to the neighbour switch based on adaptive routing algorithm until it reaches the destination switch.

The NoC switch is proposed with PE pair to improve bandwidth utilization and increase the speed of data packet transfer. Each input port equipped with buffer to store the data packet partially to avoid deadlock. Apart from packet switching, FSM based routing is proposed for data transfer from source to destination PE. This improved work is implemented for 3x3 mesh based NoC therefore low latency presented.

### 3.1 Operation of Switch

Fig.2 shows the internal structure of single switch with 6x6 input and output ports. The input port of switch starts transmission of data to the output port when the entire data

packets available in source PE. The buffer of input port stores the entire data packet hence the data packet never experiences deadlock. A Finite State Machine (FSM) controller is used for controlling the data transfer based on buffer memory. The data packet transfers from input port to crossbar directly when only one port is ready transfer. If two or more ports are ready transfer the data, then arbitration is solved by the arbiter. The arbiter provides the grant signal to one of the ports based on the scheduling algorithm [33]. The crossbar switch is transfers the data packet to output port of the router. To avoid congestion delay, the acknowledgment signal from arbiter is given to the crossbar. Section 3.2.1 describes FIFO buffer for input port and output port and section 3.2.2 gives FSM controller for the data transmission while 3.2.3 explains the modified distributed round robin scheme which helps the selection of the best arbitration ports.

#### **3.2 FIFO Structure**

A buffer stores the data packets at ports which are coming from neighbour port. The data packet needs to store in buffer when neighbour switch is busy. Otherwise, the data packet drops or loss in the interconnection links. The size of buffer is depending on the data packet. Memory of buffer is full indicates that the entire data packet is received and next data must to be waited in corresponding port. The empty of buffer indicates that the entire data packet sent to the destination. If buffer of input is full, it will consider as high priority among arbitration ports.



Fig. 2 Architecture of Single Switch with Bi-Directional Ports

# 3.2.1 FSM controller:

FSM controller has a controller for data transfer depending on the status signal and a request for the ports. Here the controller has control signals like request, acknowledgement, data in, grant/deny with FIFO. The input and output ports have a FSM controller along with FIFO.



Fig. 3 Input/ Output Port Structure for 8bit NoC switch

Data packets which are transferred are controlled by FSM controller. The neighbour output port sends the request to the input port thereby controller will send the acknowledgement depending on FIFO. The data transfer starts with signals of data in, receive/deny and buffer full or empty. The data read/write operations of FIFO are controlled FSM controller. The status signals are presented status of FIFO in terms of FIFO empty/full and control signals are directs the data read/write operations. Fig.3 shows the input /output structure and data transmission and reception based on the controller. In MACS [28], the ports are structured with the registers and the data communication is controlled with two different logics namely the external signal forwarder and the internal signal forwarder. During data transmission, the port is checked with the availability register (EXSIF) and internal signal forwarder (INSIF) but in the proposed work, data is transferred with the condition of FSM controller. At link establishment, EXSIF and INSIF control blocks have to depend on the connectivity register. But, in this work data transfer depends on FIFO status with FSM controller establishing a link between the ports making the controller responsible for the establishment, maintenance and release of a link between the ports.

## 3.2.2 Modified Distributed Round Robin Scheme

When a number of input ports assert a request signal, the cross bar switch decides based on the arbiter and many methods have been proposed for the arbitration. Round robin is a scheme which provides high fairness and is simple and easy to implement. In each cycle, one of port gets permission based on updated priority with last serviced port being the least. The Round robin method is simple when the number of input requests is less and its complexity increases with increase in the number of ports. The port requests are dealt within a distributive manner for reducing the complexity. The arbiter assigns highest priority to the port, when FIFO is full and masks when not full. Fig.4 shows the distributive round robin scheme with a cyclically rotating priority. Multiplexer output is given as enabling signal for the purpose of controlling the counter. If the enable signal is active low then the counter stops counting and vice-versa. The output of counter is given to the multiplexer select lines and transfers a high priority request to the output section as a grant signal. In MACS [28], the efficiency gets reduced when the number of inputs and outputs are increased for the arbiter but in this proposed work, the size of multiplexer  $(2 \times 1)$  and counter (1-bit) gets reduced and the same blocks gets connected for all inputs parallelly. With this, even number of inputs and outputs increase, with no reduction in the efficiency of the arbiter. This arbiter works faster because of the grouping of 2 request and 2 grants in parallel thereby the performance of arbiter improves. Modified distributed round robin provides a 1338 MHz clock frequency which has a better performance compared to the previous arbiters and comparison shown in table 1. It also showed higher fairness than DRRA because all requests are serviced periodically with particular amount of time. Because of requests are distributed, each request is serviced based priority with minimal period than DRRA. If six input ports request to service, three 2x1 distributed round robin blocks are enabled. The three blocks are serviced round robin fashion therefore the entire requests are serviced periodically.

| Ports |          |       | DRRA | [28]  |             | MDRRA    |       |      |       |             |
|-------|----------|-------|------|-------|-------------|----------|-------|------|-------|-------------|
|       | Occupied | LUT-  | IOBs | Delay | Power       | Occupied | LUT-  | IOBs | Delay | Power       |
|       | Slices   | FF    |      | (ns)  | consumption | Slices   | FF    |      | (ns)  | consumption |
|       |          | pairs |      |       | (mW)        |          | pairs |      |       | (mW)        |
| 5×5   | 3        | 7     | 12   | 1.11  | 3.53        | 3        | 3     | 15   | 0.74  | 3.61        |
| 6×6   | 4        | 7     | 14   | 1.11  | 3.57        | 3        | 3     | 17   | 0.74  | 3.79        |
| 7×7   | 5        | 8     | 16   | 1.12  | 3.87        | 4        | 4     | 20   | 0.74  | 5.32        |
| 8×8   | 5        | 8     | 18   | 1.13  | 4.03        | 4        | 4     | 22   | 0.74  | 5.54        |
| 9×9   | 6        | 10    | 20   | 1.142 | 4.18        | 3        | 5     | 25   | 0.74  | 6.29        |
| 10×10 | 5        | 10    | 22   | 1.147 | 4.35        | 5        | 5     | 27   | 0.74  | 7.56        |
| 16×16 | 8        | 13    | 34   | 1.18  | 7.38        | 8        | 8     | 42   | 0.74  | 11.52       |
| 32×32 | 16       | 23    | 58   | 1.20  | 14.44       | 16       | 16    | 82   | 0.74  | 22.72       |

Table 1. Performance comparison of DRRA and Modified DRRA

### 4. Finite State Machine Routing

Adaptive routing is used for transferring data packet from the source port to the destination port [32]. The routing is based on the location of the shortest distance through intermediate switches. Routing in 2D Mesh is as follows, the switch address consists of X- coordinate and Y-coordinate for identification of best path. The shortest route is identified using  $\frac{(\Delta X + \Delta Y)!}{[(\Delta X)! \times (\Delta Y)!]}$  where  $\Delta X = (X_{destination} - X_{source}), \Delta Y =$  $(Y_{destination} - Y_{source})$  where  $X_{source}$  has address of Xcoordinates for the source node and X<sub>destination</sub> is address of the X- coordinate for the destination node, the same as for Y<sub>source</sub> and Y<sub>destination</sub>. Adaptive routing finds the best path for data packet, coming from the neighbour switch to the destination switch. The crossbar switch has a combination of Multiplexer-Demultiplexer  $[(6 \times 1) - (1 \times 6])$  circuits where as the Multiplexer circuit receives the data packet from the input port and the data packet is transferred through Demultiplexer circuit based on selection lines which grant signals using the distributive round robin arbiter.



Fig. 4 Distributed Round Robin Arbiter with 6 Request and 6 Grant Signals

The crossbar switch is transfers data packets from the input port to the output port with the help of efficient Multiplexer-Demultiplexer pair which provides a better performance than the other crossbar designs. The multiplexer selects high priority input port to transfer the data to demultiplexer. The demultiplexer selects the output port to transfer the data to input port of neighbour switch. This crossbar switch is simple and easy to implement for NoC than other designs.



Fig.5 Multiplexer and de-multiplexer based crossbar

Fig.6 shows data routing by a finite state machine in four stages (i.e. request stage, grant/ mask stage, data transfer stage and release stage). At the data request stage, the sources switch makes a request to the destination switch through the intermediate switches (neighbour switches) which compare the address of destination switch with the current switch, to check whether address matches [34]. The request is checked to ensure whether the local ports are matched (P0 or P1) or not. The request is sent to the next switch by a repeated process. The request locates destination switch. While transferring the request through intermediate switches, each switch finds the shortest path to the destination switch. When the shortest path is available, request is forwarded to the grant/deny stage otherwise it is returned to the initial stage. When two local ports (P0 and P1) are identified to be the shortest to switch, the port is selected on the basis of its FIFO status. In grant/mask stage, if request is not fulfilled by the grant signal of distributive round robin arbiter then it waits for the buffer to get full else it forwards to the next stage. On the basis of the status of FIFO buffer, the arbiter provides grant signals and the request is sent to the next stage. At data transfer stage, the entire data packets are transferred to destination PE. Once the data transfer completes, the routing state goes to initial stage for next data packet. FSM routing ensures data packet transfer, with no data loss resulting from the use of Store and forward switching and prevents deadlock with the status of buffer used by distributive round robin arbiter. The FSM based routing is proposed for shortest algorithm to improve the data transfer from source to destination PE. The FSM based routing is checked the availability of channels to transfer the data.



Fig. 6 Routing state diagram for data packet with states

### 5. Results and Discussion

A NoC switch of  $6 \times 6$  ports with modified round robin arbiter is simulated and synthesized using Verilog Hardware Description Language (HDL) in Xilinx ISE 14.7. The parameters used for simulation of NoC switch are tabulated in Table 2. The synthesis results of NoC switch with buffered and buffer-less in terms of area (occupied slices, LUT-FF pairs, bonded IOBs), delay (ns), and power consumption (mW) for input port, modified round robin arbiter, crossbar switch and output port with different data width of 8, 16 and 32 bit are presented in Table 3.

| Table 2. S | imulation | parameters |
|------------|-----------|------------|
|------------|-----------|------------|

| Simulation Parameter | Values                              |
|----------------------|-------------------------------------|
| Topology             | 2D Mesh                             |
| Arbiter              | Modified Distributed Round<br>Robin |

| Store and Forward         |
|---------------------------|
| Minimal Adaptive          |
| 8 bit                     |
| Yes                       |
| Riviera-Pro               |
| Uniform Random, Transpose |
| Poisson                   |
| No                        |
|                           |

From the Table 3 it is clearly shown that the performance of NoC switch without buffer is better than with buffer. However, the deadlock issue arises in without buffered switch when congestion of data packets is high there by the total system gets failure. The synthesis results of individual components of NoC switch are shown in Table 3. From the table, it is evident that the synthesis results of MDRRA and crossbar switch are not changing with different data width because these components are independent of memory.

Table 3. Performance of NoC switch for data width of 8, 16, 32 bit with buffer and without buffer

| NoC switch  | Occupied slices LUT-FF pairs |     |     | Bo  | Bonded IOBs Delay (ns) |     |     | )   | Power Consumption (mW) |      |      |      |       |       |       |
|-------------|------------------------------|-----|-----|-----|------------------------|-----|-----|-----|------------------------|------|------|------|-------|-------|-------|
| (6×6)       |                              |     |     |     |                        |     |     |     |                        |      |      |      |       |       |       |
| Data (Bit)  | 8                            | 16  | 32  | 8   | 16                     | 32  | 8   | 16  | 32                     | 8    | 16   | 32   | 8     | 16    | 32    |
| With Buffer | 98                           | 164 | 232 | 281 | 512                    | 639 | 192 | 352 | 592                    | 5.97 | 6.22 | 6.73 | 28.84 | 49.04 | 85.28 |
| Without     | 63                           | 121 | 183 | 146 | 469                    | 573 | 109 | 201 | 393                    | 3.45 | 3.62 | 3.77 | 15.62 | 25.45 | 43.52 |
| Buffer      |                              |     |     |     |                        |     |     |     |                        |      |      |      |       |       |       |

| <b>Table 3.a.</b> Synthesis results of components of NoC switch with buffer and without bu | uffer |
|--------------------------------------------------------------------------------------------|-------|
|--------------------------------------------------------------------------------------------|-------|

| NoC                   | Occupied Slices |       |     |    |             | LUT-FF pairs |     |        |     | Bonded IOBs |             |     |        |     |     |             |     |     |
|-----------------------|-----------------|-------|-----|----|-------------|--------------|-----|--------|-----|-------------|-------------|-----|--------|-----|-----|-------------|-----|-----|
| <b>Buffer/Buffer-</b> |                 | Buffe | er  | B  | Buffer-Less |              |     | Buffer |     | B           | Buffer-Less |     | Buffer |     |     | Buffer-Less |     | ess |
| Less                  |                 |       |     |    |             |              |     |        |     |             |             |     |        |     |     |             |     |     |
| Data (Bit)            | 8               | 16    | 32  | 8  | 16          | 32           | 8   | 16     | 32  | 8           | 16          | 32  | 8      | 16  | 32  | 8           | 16  | 32  |
| MDRRA                 | 3               | 3     | 3   | 3  | 3           | 3            | 7   | 7      | 7   | 7           | 7           | 7   | 14     | 14  | 14  | 14          | 14  | 14  |
| Input Port            | 7               | 10    | 14  | 0  | 0           | 0            | 19  | 25     | 29  | 0           | 0           | 0   | 26     | 42  | 74  | 18          | 34  | 66  |
| Output port           | 9               | 13    | 16  | 0  | 0           | 0            | 22  | 30     | 34  | 0           | 0           | 0   | 20     | 36  | 68  | 18          | 34  | 66  |
| Crossbar Switch       | 53              | 98    | 157 | 53 | 98          | 157          | 160 | 298    | 541 | 160         | 298         | 541 | 109    | 205 | 397 | 109         | 205 | 397 |
| (6×6)                 |                 |       |     |    |             |              |     |        |     |             |             |     |        |     |     |             |     |     |

Ashok Kumar. K, P Dananjayan and Karunakar Reddy Vanga/Journal of Engineering Science and Technology Review 14 (3) (2021) 76 - 84

| NoC                       |        | Dela | y (ns) |             |      | Power Consumption (mW) |        |       |       |             |      |       |
|---------------------------|--------|------|--------|-------------|------|------------------------|--------|-------|-------|-------------|------|-------|
| <b>Buffer/Buffer-Less</b> | Buffer |      |        | Buffer-Less |      |                        | Buffer |       |       | Buffer-Less |      |       |
| Data width(Bits)          | 8      | 16   | 32     | 8           | 16   | 32                     | 8      | 16    | 32    | 8           | 16   | 32    |
| MDRRA                     | 0.74   | 0.74 | 0.74   | 0.74        | 0.74 | 0.74                   | 3.75   | 3.75  | 3.75  | 3.57        | 3.57 | 3.57  |
| Crossbar Switch           | 0.74   | 0.74 | 0.74   | 0.74        | 0.74 | 0.74                   | 3.95   | 8.06  | 12.47 | 3.95        | 8.98 | 12.47 |
| (6×6)                     |        |      |        |             |      |                        |        |       |       |             |      |       |
| Input Port                | 1.70   | 1.73 | 1.85   | 0.74        | 0.79 | 0.83                   | 15.41  | 20.57 | 30.97 | 4.81        | 9.12 | 19.05 |
| Output port               | 1.74   | 1.77 | 1.92   | 0.74        | 0.79 | 0.83                   | 4.83   | 8.98  | 22.05 | 2.73        | 2.86 | 4.45  |

Table 3.b Synthesis results of components. of NoC switch with buffer and without buffer

Fig.7 shows the comparison of area utilization of NoC switch per PE with MACS switch for different data widths. Though NoC switch uses buffers, its area overhead is 25% lesser than that of MACS switch because of its simple architecture.



Fig. 7 Comparison of area utilization of NoC switch (#2) with MACS switch (#1) for data of 8, 16 and 32 bit

The comparison of clock frequency for NoC switch with MACS switch for different data widths is shown in Fig.8. From this figure it is evident that the NoC switch achieves 58% higher operating frequencies than MACS switch because of parallel execution at MDRRA and also modified FSM based switch used for shortest path between source and destination.

Table 4 presented the performance improved NoC switch with generic and MultiCS [35] in terms number of sub network, number of channels, maximum of data frequency and power consumption. The improved NoC switch is showed improvement in Max. data frequency and power consumption.



Fig. 8 Comparison of clock frequency of NoC switch (#2) with MACS switch (#1) for data of 8, 16 and 32 bit

Fig.9 presented the comparison of power consumption for NoC switch (#2) with MACS (#1) with 8, 16 and 32 bit. The modified NoC switch is improved when compared with existed work in terms of power consumption because of distributed round robin router.



Fig. 9 Comparison of power consumption of NoC switch (#2) with MACS switch (#1) for data of 8, 16 and 32 bit

| Table 4. Com | parison perfo | rmance of NoC | switch with | MultiCS [35] |  |
|--------------|---------------|---------------|-------------|--------------|--|
|              |               |               |             |              |  |

| S.no | NoC (16-bit) | Number of sub | Number of channels | Max. data       | Power consumption |
|------|--------------|---------------|--------------------|-----------------|-------------------|
|      |              | network       |                    | frequency (GHz) | (mW)              |
| 1    | Generic      | 1             | 2                  | 0.527           | 47.32             |
| 2    | MultiCS [35] | 1             | 4                  | 1.116           | 50.1              |
| 3    | This work    | 1             | 2                  | 1.392           | 18.57             |

**Table 5.** Comparison of 3×3 mesh based NoC with existing NoCs

| NoC                        | Target Device | Numbers of Slices | LUTs   | Delay(ns) |
|----------------------------|---------------|-------------------|--------|-----------|
| dAElite [31]               | Vertex-6      | 3098              | 10,026 | 8.19      |
| aelite [30]                | Vertex-6      | 5500              | 7665   | 8.379     |
| NoC switch( $3 \times 3$ ) | Vertex-6      | 2982              | 5217   | 5.97      |

 $3 \times 3$  mesh based NoC is implemented with 9 NoC switches and 18 PEs. Table 5 comprises the performance metrics of  $3\times3$  NoC with aelite [30] and dAElite [31]. The area of occupy of  $3\times3$  NoC is 38% and 45% lesser when compared to aelite [30] and dAElite [31] respectively.

The different traffic patterns on  $3 \times 3$  mesh based NoC are simulated with Riviera-Pro on windows to analyze average latency and bandwidth utilization. The simulation is performed for 64 iterations in Uniform-Random and Transpose traffic patterns. It is observed that the average latency is reduced and bandwidth utilization is increased. Fig.10 shows the comparison of simulation results of MACS and mesh based  $3\times3$  NoC for Uniform and Transpose traffic patterns. It is evident that the latency and bandwidth

utilization are 25% improved in mesh based  $3 \times 3$  NoC when compared with MACS, since modified DRRA and FSM based routing state diagram are used.



Fig.10 Simulation results in terms of average latency and bandwidth utilization of NoC (#2) MACS switch (#1) and in Uniform and Transpose Traffic

# 6. Conclusion

A modified NoC switch with store and forward switching is proposed in 2D mesh topology. The proposed switch uses modified distributed round robin arbiter for reducing the complexity of the arbiter with reduced delay and adaptive routing for identifying the shortest path with the switch having a PE pair with increased number of available channels for data routing. The increased bandwidth utilization enhances the clock frequency, reduces the area overhead, reduces the circuit set up latencies and guarantees data packets and prevents them from deadlock condition. As proposed, the switch uses the modified distributive round robin arbiter with buffer status, reducing packet latency and improving flexibility. In addition, the proposed design shows a 58% higher operating frequencies and a 25% lesser area overhead. Reduction in power and area with CDMA cross bar is suggested as work to be taken up in the future.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License.



#### References

 E. Salminen, V. Lahtinen et al., "Overview of bus-based systemon-chip interconnections", IEEE International Symposium on Circuits and Systems, 2002 Scottsdale, AZ, USA, pp. 372–375.

W. J. Dally, B. Towles, "Route packets, not wires: on-chip interconnection networks," Proceedings Design Automation Conference, 2001, USA, pp. 684–689.

#### Ashok Kumar. K, P Dananjayan and Karunakar Reddy Vanga/Journal of Engineering Science and Technology Review 14 (3) (2021) 76 - 84

- L. Benini, G. D. Micheli, "Networks on chips: A new soc paradigm," Computer 2002, vol. 35, no. 1, pp. 70–78.
- 4. Pan Hao, et al, "Comparison of 2D MESH Routing Algorithm in NOC", ASIC (ASICON), 2011, china, pp.1-5.
- Wang Zhang, et al., "Comparison Research between XY and Odd-Even Routing Algorithm of a 2-Dimension 3×3 Mesh Topology Network-on-Chip," WRI Global Congress on Intelligent Systems, 2009, china, pp.329-333.
- Erno Salminen, et.al, "On network-on-chip comparison," 10th Euro micro Conference on Digital System Design Architectures, Methods and Tools, 2007, Germany, pp.1-8.
- L. Benini, D. Bertozzi, "Xpipes: A network-on-chip architecture for gigascale systems-on-chip," IEEE Circuits and Systems Magazine, 2004, vol. 4, no. 2, pp. 18–31.
- T. Marescaux A, et.al, "Interconnection networks enable finegrain dynamic multi-tasking on FPGAs" Springer-International Conference on Field Programmable Logic and Applications, 2002, Berlin, pp. 795-805
- Y. Lan, Shin Lo, et.al, "BiNoC: A bidirectional NoC architecture with dynamic self-reconfigurable channel," Networks on Chip, 2009, USA, pp. 266–275.
- N. E. Jerger, et al., "Circuit-Switched Coherence," Second ACM/IEEE International Symposium on Networks-on-Chip, 2008, UK, pp. 1-10.
- P. T. Wolkotte, et al., "An energy efficient reconfigurable circuitswitched Network-on-Chip", 19th IEEE International Parallel and Distributed Processing Symposium 2005, USA, pp.1-8.
- K. Goossens, et al., "Æthereal network on chip: Concepts, architectures, and implementations," IEEE Design & Test of Computers, 2005, vol. 22, no. 5, pp. 414–421.
- D. Wiklund, D. Liu, "SoCBUS: switched network on chip for hard real time embedded systems", Proceedings. International Symposium Parallel and Distributed Processing, 2003, France, pp.1-8.
- C. Hilto, B. Nelson, "PNoC: A flexible circuit-switched NoC for FPGA-based systems," IEE Proceedings - Computers and Digital Techniques, 2006, vol. 153, no. 3, pp. 181-188.
- A. K. Lusala, J.-D. Legat, "A sdm-tdm based circuit-switched router for on-chip networks," 6th International Workshop on Reconfigurable Communication-centric Systems-on-Chip 2011, France, pp. 1–8.
- M. Modarressi, et al, "A hybrid packet-circuit switched on-chip network based on SDM," Design, Automation & Test in Europe Conference & Exhibition, 2009, France, pp. 566–569.
- W. Ling, et al., "A Routing-Table-based Adaptive and Minimal Routing Scheme on Network-on-Chip Architectures," Computers & Electrical Engineering Vol. 35, Issue 6, 2009, pp. 846-855.
- M. Dehyadegari, et al., "An Adaptive Fuzzy Logic-based Routing Algorithm for Networks-on-Chip," Proceedings 13th IEEE/NASA-ESA International Conference on Adaptive Hardware and Systems (AHS) 2011, USA, pp. 208-214.
- G. Chiu, "The Odd-even Turn Model for Adaptive Routing", IEEE Transactions on Parallel and Distributed Systems 2000, vol.11,no.7, pp. 729 –738.
- Kamran Nasiri, Hamid.R. Zarandi, "MWPF: A Deadlock Avoidance Fully Adaptive Routing Algorithm in Networks-On-Chip," 24th Euro micro International Conference on Parallel,

Distributed, and Network-Based Processing (PDP), 2016, Greece, pp.734-741.

- D. Wiklund, D. Liu, "Design of a system-on-chip switched network and its design support," International Conference on Communications, Circuits and Systems and West Sino Expositions, IEEE 2002, China, pp. 1279–1283.
  L. R. Zheng and H. Tenhunen, "A circuit-switched network
- L. R. Zheng and H. Tenhunen, "A circuit-switched network architecture for network-on-chip," Proceedings. IEEE International SOC Conference, 2004. USA, pp. 55–58.
- C. Bobda, A. Ahmadinia, "Dynamic interconnection of reconfigurable modules on reconfigurable devices," IEEE Design & Test of Computers 2005, vol. 22, no. 5, pp. 443–451.
- 24. E. Carara, et al., "A new router architecture for high-performance intrachip networks," Journal Integrated Circuits System. 2008, vol. 3, no. 1, pp. 23–31.
- A. Jara-Berrocal, A. Gordon-Ross, "SCORES: A scalable and parametric streams-based communication architecture for modular reconfigurable systems," Design Automation & Test in Europe Conference & Exhibition, 2009, France, pp. 268–273
- J. Lin and X. Lin, "Express circuit switching: Improving the performance of bufferless networks-on-chip," First International Conference on Networking and Computing. 2010, Japan, pp. 162– 166.
- N. Teimouri, et al., "Power and performance efficient partial circuits in packet-switched networks-on-chip," 21st Euro micro International Conference on Parallel, Distributed and Network-Based Processing 2013, UK, pp. 509–513.
- Rohit Kumar, Ann Gordon-Ross, "MACS: A Highly Customizable Low-Latency Communication Architecture", IEEE Transactions on Parallel and Distributed Systems 2016, Vol. 27, no. 1, pp.237-249.
- Ashok Kumar K, P.Dananjayan, "A Survey for Silicon on Chip Communication", Indian Journal of Science and Technology 2017, Vol. 10, no.1, pp.1-10.
- A. Hansson, et al., "aelite: A flit-synchronous network on chip with composable and predictable services," Design, Automation & Test in Europe Conference & Exhibition, 2009, France, pp. 250– 255.
- R. A. Stefan, et al., "daelite: A tdm noc supporting qos, multicast, and fast connection set-up," IEEE Transactions on Computers, 2014, vol. 63, no. 3, pp. 583–594.
- Wu, C. W., Lee, K. J., & Su, A. P. (2018). A hybrid multicast routing approach with enhanced methods for mesh-based networks-on-chip. IEEE Transactions on Computers, 67(9), 1231-1245.
- Wang, L., Liu, L., Han, J., Wang, X., Yin, S., & Wei, S. (2019). Achieving Flexible Global Reconfiguration in NoCs using Reconfigurable Rings. IEEE Transactions on Parallel and Distributed Systems.
- Xu, C., Liu, Y., & Yang, Y. (2019). SRNoC: An Ultra-fast Configurable FPGA-based NoC Simulator Using Switch-Router Architecture. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
- Liu, S., Jantsch, A., & Lu, Z. (2015). MultiCS: Circuit switched NoC with multiple sub-networks and sub-channels. Journal of Systems Architecture, 61(9), 423-434.