Improvement The Reliability of 3D-Networkon - Chip by Triple Modular Redundancy
Improvement The Reliability of 3D-Networkon - Chip by Triple Modular Redundancy
ORG
63
AbstractA new chip design paradigm called Network on Chip (NOC) offers a promising architectural choice for future systems on chips. NOC architectures offer a packet switched communication among functional cores on the chip. NOC architectures also apply concepts from computer networks and organize on-chip communication among cores in layers similar to OSI reference model. We propose an MPLS fault recovery mechanism using Triple modular redundancy (TMR) in NOC switches, in order to increase fault tolerance permanent and transient faults in NOC. Index Terms integrated circuit (IC), Systems-on-chip (SOC), MPSoC, Network on Chip (NOC), Processing Element (PE), Triple modular redundancy (TMR), MPLS, fault tolerance.
1 INTRODUCTION
oore's law predicts that, it will be possible to integrate over a billion transistors on a single chip. Current core based on Systems-on-chip (SOC) methodologies will not respond to the needs of the billion transistor era. Thus modern integrated circuits (ICs) are becoming increasingly complex. The complexity makes it difficult to design, manufacture and integrate these highperformance ICs. The advent of multiprocessor SOC makes it even more challenging for programmers to utilize the full potential of the computation resources on the chips[4],[5].Network on Chip (NOC), a new chip design paradigm concurrently proposed by many research groups [1],[2],[3]. Fault-tolerance is fast becoming an integral part of system on chip and multi-core architectures. Another trend for such architectures is network-onchip (NOC) becoming a standard for on-chip global communication. In an earlier work, a generic fault tolerant routing algorithm in the context of NOCs has been presented [5].
2 LITERATURE REVIEW
There exist several dimensions in classifying the possible fault occurrences during the life cycle of an MPSOC. We list the Fault Model classification as follows: Duration In terms of duration, the faults can be classified into transient faults and permanent faults [6]. In the case of the MPSOC, both types of fault can occur in the chip life cycle. Crash failures are permanent faults which occur when a tile halts prematurely or a link disconnects, after having behaved correctly until the failure. Transient faults can be either omission failures, when links lose some messages and tiles intermittently omit to send or
receive, or arbitrary failures (also called Byzantine or malicious), when links and tiles deviate arbitrarily from their specification, corrupting or even generating spurious messages [7]. Location In general, MPSOC designs consist of two integrated parts, the Processing Elements (PEs) and Network-on-Chip (NOC). Faults can occur in both parts. In the case that a fault occurs in the PEs, the computation results will be erroneous. Dynamic fault detecting and masking actions are needed to make sure the erroneous results will not contaminate the application environment. In the case that a fault occurs in the communication path, such as link failure and scrambled messages, a faulttolerant communication protocol suite, including errorresilient coding schemes, are needed to ensure the reliable delivery of on-chip messages on top of an unreliable onchip communication substrate [8]. Time to Failure: Faults can occur throughout the lifetime of an IC. Using the point when the chip is packaged and tested as the watershed event, we distinguish between before-shelf faults and after-shelf faults. Currently, chips with before shelf faults, i.e., defects which are discovered during testing, are invariably discarded. Only dies with no discovered defects are shipped out as products. With the shrinking feature size, it is becoming increasingly difficult to achieve decent yield with reasonable cost. The low yield problem will become more acute for the 90nm technology and beyond. On the other hand, the potential yield of the manufacturing process can increase tremendously if some defects on the die can be tolerated in the ICs after-shelf life. Static fault masking and isolation techniques, both hardware and software based, can be used to use these previously deemed Bad chips in commercial products, such as Pico Chip [9]. For
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
64
after-shelf faults, dynamic fault detection and recovery means are needed to ensure the correct function of the chip as long as possible. Furthermore, graceful degradation of system performance is necessary for some mission-critical Applications. Three dimensional NOC was the new architecture for NOC that has many advantages related to two dimensional NOC that mentioned as below: 1-Reduce hop count 2-More link between switches 3-Supports more bandwidth .
3.1 Types of fault Two types of failure can be occurring in NOC as below: a) Transient fault When this fault occurs in NOC just one or some of the packets caused to be corrupted. As shown as figure 1, in three dimensional we have more links between the switches that we can sent a copy of packet to destination in deferent path. Thus we have three copy of each packet (row, column and height) that forward to destination switch. This means that we have redundancy equals to three in all NOC architecture, but each copy of packet forward in deferent dimension and if one packet corrupted or cant reach to destination, the other copy of that packet from the another path can reach to destination node. b) Permanent fault Crash failures are permanent faults which occur when a node halts prematurely or a link disconnects.Unlike the traditional network that a router can replace by the other, in NOCs the failed node cant be replaced or moved but this fault may at communication time occurred. This type of fault can occur in two place of NOC as below: between switches or middle of NOC switches: In this situation when a permanent fault occurs, depend on rerouting or redundancy of packets, the other copy of packet from the other path, can reach to destination node. between resource and switches: In this situation when a permanent fault occurs, because of one link between resource and switches and can't be replaces with another path, all packets dropped and can't flow to destination node. Thus we propose that each resource connected to 3 switches that connected to each other, thus our three dimensional NOC have as below architecture.
As shown in figure 2 when a resource communicates with another resource create 3 copy of packet and forward them to 3 switches toward destination in three deferent paths. Thus if a link between a resource and switch dropped, the other copy of that packet can reach to the destination from the other link between that resource and switches. Network-on-chip have heterogeneous or homogeneous core that communicate with each other. In best situation just two cores communicate with each other and in worst status all cores communicate with each other. Thus the other traffic may influence in our communication and made distortion in our packet or bandwidth utilization. In IP NOC it may occur that the other interfering in our communication and caused to packet lost, reordering, jitter and etc. but in MPLS network-on-chip we can guarantee the bandwidth and support the Quality of services as need [10]. In order to prevention of interfering traffics, convergence of flows and support the quality of services (QOS) of the NOC, we propose to uses of the MPLS path reservation of these three flows.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 4, APRIL 2012 JOURNAL OF COMPUTING, VOLUME 4, ISSUE
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG
65
3
4. SIMULATION METHODOLOGY
4.1. Simulating NoCs with NS-2 A SoC design process involves three major stages; Behavioral design, Structural design and Physical design [11]. The behavioral design specifies the functionality of the system at higher level of abstractions, whereas structural and physical design view reduces the abstraction level to logic gate and transistor level respectively. At behavioral design level, a SoC is realized as a collection of components which are modeled as blocks and connections along with protocols that govern the communication. Considering the above mentioned scenario, it is clear that NS-2 is a perfect candidate for simulating and evaluating NoCs at behavioral design level. The individual blocks of a NoC are defined as "nodes'' and connections as "links'' in NS-2. Similarly, protocols can be defined over the blocks as "agents'' with relevant applications if any. Using the graphical animation of NS-2 (NAM), the behavior of the protocols can be observed interactively. It is quite convenient to realize various regular as well as irregular topologies using the TCL scripting language used in NS-2. Any form of topology ranging from mesh, torus, fat tree to even a fully connected network can easily be created in NS-2. In contrast to traditional networks, a NoC has considerably short distance wires (4.5 mm in a 20mm x 20mm chip, for instance) and very large bandwidth (ranging from 8 Gbits/sec to 16 Gbits/sec). This can be realized by setting the link delay and bandwidth attributes of the links accordingly in NS-2. 4.2. NS-2 network simulator
NS-2 is an open source, object-oriented and discrete event driven network simulator written in C++ and OTcl. Its a very common and widely used tool to simulate small and large area networks. Due to similarities between NoCs and networks, NS-2 has been a choice of many NoC researchers to simulate and observe the behavior of a NoC at a higher abstraction level of design. It has a huge variety of protocols and various topologies can be created with little effort. Moreover, customized protocols for NoCs can easily be incorporated into NS-2. The parameters for routers and links can easily be scaled down to reflect the real situation on a chip. Based on this fact, we have successfully simulated a hundred node 2D mesh based NoC using our reliable protocol for safe delivery of packets. The purpose of this paper is to show the network community the similarities that exist between general networks and NoCs and show how NS-2 is facilitating the NoC designers to realize new design paradigms for this novel communication architecture. Furthermore, we hope that this paper would motivate network researchers to make a valuable contribution toward NoCs, hence opening a new dimension of research. NS-2 is an object-oriented, discrete event driven
network simulator developed at UC Berkely and written in C++ and OTcl [12]. NS-2 is a very common tool used for simulating local and wide area networks. It implements network protocols such as TCP and UPD; traffic source behavior such as FTP, Telnet, Web, CBR and VBR; router queue management mechanism such as Drop Tail, RED and CBQ; routing algorithms such as Dijkstra, and a lot more. NS-2 also implements multicasting and some of the MAC layer protocols for LAN simulations. The simulator is open source, hence, allowing anyone and everyone to make changes to the existing code, besides adding new protocols aand functionalities to it. This makes it very popular among the networking community which can easily evaluate the functionality of their new proposed and novel designs for network research. The simulator is developed in two languages: C++ and OTcl. C++ is used for detailed implementations of protocols like TCP or any customized ones. TCL scripting, on the other hand, is the front-end interpreter for NS-2 used for constructing commands and configuration interfaces. For example, if you want to develop a new routing protocol, you have to write it in C++ and add it into the NS-2 library. In order to check the functionality of this protocol, you use TCL scripting through which you can create the required topology, define parameters for links and nodes, and perform simulations to realize your own protocol in action. Besides above mentioned functionality of NS-2, a Network AniMator (NAM) is also provided with NS-2 in order to visualize and interact with the
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG APRIL 2012 JOURNAL OF COMPUTING, VOLUME 4, ISSUE 4,
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG
66
4
system at run-time. Finally, graphs can be created from the produced results to evaluate and analyze the performance of the system.
5. SIMULATION DETAILS
All of the topology parameters can be described as a script file; in Tcl. A part of the ns-2 script file about constructing the topology is shown below: ##---------------- Switches Nodes ---------------------set s 1 for {set k 1} {$k <= $x} {incr k} { for {set i 1} {$i <= $x} {incr i} { for {set j 1} {$j <= $x} {incr j} { set sw([expr $k*1000+$i*100+$j*10+$s]) [$ns node] $sw([expr $k*1000+$i*100+$j*10+$s]) shape circle $sw([expr $k*1000+$i*100+$j*10+$s]) color "blue" $sw([expr $k*1000+$i*100+$j*10+$s]) label "sw([expr $k*1000+$i*100+$j*10+$s])" }}} #---------------- Resources Nodes ---------------------for {set k 1} {$k <= $x} {incr k} { for {set i 1} {$i <= $x} {incr i} { for {set j 1} {$j <= $x} {incr j} { set Res([expr $k*1000+$i*100+$j*10]) [$ns node] $Res([expr $k*1000+$i*100+$j*10]) shape square $Res([expr $k*1000+$i*100+$j*10]) color "red" }}} #Create links row-by-row set j 1 set s 1 for {set k 1} {$k <= $x} {incr k} { for {set i 1} {$i <= $x} {incr i} { $ns duplex-link $sw([expr $k*1000+$i*100+$j*10+$s]) $sw([expr $k*1000+$i*100+($j+1)*10+$s]) $max_bandwidth_switch $linkDelay_switch DropTail $ns duplex-link-op $sw([expr $k*1000+$i*100+$j*10+$s]) $sw([expr $k*1000+$i*100+($j+1)*10+$s]) queuePos $queuePosition_switch $ns queue-limit $sw([expr $k*1000+$i*100+$j*10+$s]) $sw([expr $k*1000+$i*100+($j+1)*10+$s]) $Qlimit_switch_switch #add_route $sw([expr $k*1000+$i*100+$j*10+$s]) $sw([expr $k*1000+$i*100+($j+1)*10+$s]) }} #Create links Column-By-Column set i 1 set s 1 for {set k 1} {$k <= $x} {incr k} { for {set j 1} {$j <= $x} {incr j} { $ns duplex-link $sw([expr $k*1000+$i*100+$j*10+$s]) $sw([expr $k*1000+($i+1)*100+$j*10+$s])
$max_bandwidth_switch $linkDelay_switch DropTail $ns duplex-link-op $sw([expr $k*1000+$i*100+$j*10+$s]) $sw([expr $k*1000+($i+1)*100+$j*10+$s]) queuePos $queuePosition_switch $ns queue-limit $sw([expr $k*1000+$i*100+$j*10+$s]) $sw([expr $k*1000+($i+1)*100+$j*10+$s]) $Qlimit_switch_switch add_route $sw([expr $k*1000+$i*100+$j*10+$s]) $sw([expr $k*1000+($i+1)*100+$j*10+$s]) }} #Create links Page-By-Page set k 1 set s 1 for {set i 1} {$i <= $x} {incr i} { for {set j 1} {$j <= $x} {incr j} { $ns duplex-link $sw([expr $k*1000+$i*100+$j*10+$s]) $sw([expr ($k+1)*1000+$i*100+$j*10+$s]) $max_bandwidth_switch $linkDelay_switch DropTail $ns duplex-link-op $sw([expr $k*1000+$i*100+$j*10+$s]) $sw([expr ($k+1)*1000+$i*100+$j*10+$s]) queuePos $queuePosition_switch $ns queue-limit $sw([expr $k*1000+$i*100+$j*10+$s]) $sw([expr ($k+1)*1000+$i*100+$j*10+$s]) $Qlimit_switch_switch #add_route $sw([expr $k*1000+$i*100+$j*10+$s]) $sw([expr ($k+1)*1000+$i*100+$j*10+$s]) }} #Setup a UDP connection set udp [new Agent/UDP] $udp set fid_ 1357 ;#Red $ns attach-agent $Res(1110) $udp #Setup a CBR over UDP connection set cbr [new Application/Traffic/CBR] $cbr set type_ CBR $cbr set packet_size_ [expr $packet_size_byte] $cbr set rate_ $rate $cbr set random_ false ;#or 0 $cbr attach-agent $udp set LM [new Agent/LossMonitor] $ns attach-agent $Res(2220) $LM #$ns attach-agent $sw(2223) $LM $ns connect $udp $LM # #--------------------------------------------------# routing manual add_route $Res(1110) $sw(1111) add_route $sw(1111) $sw(1121) add_route $sw(1121) $sw(2121) add_route $sw(2121) $sw(2221) add_route $sw(2221) $Res(2220) add_route $Res(1110) $sw(1111) add_route $sw(1111) $sw(2111)
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 4, APRIL 2012 JOURNAL OF COMPUTING, VOLUME 4, ISSUE
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG
67
5
add_route $sw(2111) $sw(2211) add_route $sw(2211) $sw(2221) add_route $sw(2221) $Res(2220) add_route $Res(1110) $sw(1111) add_route $sw(1111) $sw(1211) add_route $sw(1211) $sw(1221) add_route $sw(1221) $sw(2221) add_route $sw(2221) $Res(2220)
REFERENCES
[1] M. Sgroi, et al, "Addressing the System-on-a-Chip Interconnect Woes ThroughCommunication-based Design", 38th Design Automation Conference, June, 2001. [2] Luca Benini, Giovanni De Micheli, "Network on Chips: A new SoC Paradigm", IEEE computer, Jan., 2002. [3] Shashi Kumar, et. al, "A Network on Chip Architecture and Design Methodology", IEEE Computer Society Annual Symposium on VLSI, Pittsburgh,Pennsylvania, USA, April 2002. [4] S. D. Mediratta, J. Draper, Performance Evaluation of Probe-Send Fault-tolerant Network-on-chip Router , 2007 IEEE. [5] S.D. Mediratta, J. Draper, Characterization of a Fault-tolerant NoC Router, 2007 IEEE. [6] D. K. Pradhan. "Fault-Tolerant Computer System Design". Prentice-Hall, Inc., 1996. [7] T. Dumitras, "On-Chip Stochastic Communication", Electrical and Computer Engineering, May 1st, 2003. [8] X. Zhu, W. Qin, Prototyping a Fault-Tolerant Multiprocessor SoC with Run-time Fault Recovery , DAC 2006, July 2428, 2006, San Francisco, California, USA. [9] W. Robbins. Redundancy and binning of picoChip processors. Fall Processor Forum, 2004, San Jose, CA. [10] M.R.Nouri Rad,M. Poyan,M.R. Nasab,R. Kourdy "Improvement network-on-chip bandwidth utilization through multi protocol label switching by dividing bandwidth", 2010 The 2nd International Conference on Computer and Automation Engineering (ICCAE),26-28 Feb. 2010,IEEE. [11] Rochit Rajsumman, "System-on-a-chip: Design and Test'', Artech House Publishers, 2000. [12] Network Simulator (NS-2) web site: http://wwwmash.cs.berkeley.edu/ns
Reza Kourdy received his B.Sc. degree in Computer Engineering and his M.Sc. degree in Computer Architecture both from Azad University of Arak, Iran, in 2002 and 2007, respectively. His research interests include Network-On-Chip Architecture and Fault-tolerance.
Mohammad Reza Nouri Rad received his B.Sc. Degree in Computer Engineering Software from Azad University of Najafabad, Iran, in 2001, and his M.Sc. Degree in Computer Software from Azad University of Arak, Iran, in 2010. His research interests include NetworkOn-Chip Architecture and Network Security. He is Program Committee of following conferences : WICT 2011 CSNT 2011 CICN 2011 SocProS 2011 CSNT 2012 CICN 2012 BIC-TA 2012