0% found this document useful (0 votes)
58 views4 pages

Performance Analysis of 3D Stacked Memory Architectures in High Performance Computing

Uploaded by

jackkacy1010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views4 pages

Performance Analysis of 3D Stacked Memory Architectures in High Performance Computing

Uploaded by

jackkacy1010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2024 4th International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE)




Performance Analysis of 3D Stacked Memory


2024 4th International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE) | 979-8-3503-6016-5/24/$31.00 ©2024 IEEE | DOI: 10.1109/ICACITE60783.2024.10616405

Architectures in High Performance Computing


Venkat Tulasi Krishna Gannavaram Arun Kumar Gajula
School of Electrical, Computer and Energy Engineering Department of Electronics and Communication Engineering
Arizona State University Kakatiya Institute of Technology and Science
Tempe, USA Warangal, India
[email protected] [email protected]

Abstract: The increasing speed of the CPU is greater interconnect density. In a 3D stacked structure, DRAM
than that of the RAM. The 'Memory Wall' issue, which memory is stacked on top of the logic layer, bonded with
occurs when the processor runs out of data, could occur. vertical interconnects.
The I/O infrastructure is strained as the number of cores
in CMPs rises because more data is needed from the 3D stacking enables an increase in bandwidth by using
memory subsystem. Memory bandwidth becomes a these dense interconnects instead of traditional I/O pins,
bottleneck in terms of performance. By layering allows for mixing dissimilar process technologies such as
memories on top of logic, three-dimensional integrated highspeed CMOS with high-density DRAM leading to an
circuits are proposed as a solution to this problem. The increase in on-chip memory capacity, as well as reductions in
widely-used compute-in-memory benchmark framework power requirements by reducing the number of external I/O
'NeuroSim' is utilized by us to investigate the benefits of drivers and interconnects [1].
a three-dimensional design for a Neural Network
application. Some issues with building in the third dimension, is the
integration of different process technologies (and materials)
Keywords: 3D Stacked Memory, Compute in together can be challenging. Further, reliability issues can be
Memory, Monolithic 3D Integration, Neural Networks induced due to stress from higher tiered layers in the stack.
I. INTRODUCTION Additionally, power dissipation has major influence as the
high-power compute layers can create thermal hotspots in the
Most computer chores are done by the processor (CPU) stacked memory modules, especially those near the centre
getting data and instructions from main memory, which is which are far away from the heat sink.
usually an external memory that uses DRAMs, and then II. BACKGROUND STUDY
running them in order. Processors are becoming more and
more efficient at a rate of about 60% per year thanks to new G. H. Loh [2] examines the integration of 3D DRAM in
methods and technologies being used. On the other hand, multicore computers, with the control and peripheral access
DRAM technology and access times have been getting better circuitry being located on a distinct CMOS technology layer
at a rate of less than 10% per year. When more data needs to specifically designated for this purpose. The bit cells are
be retrieved from memory, the I/O system has to work harder fabricated using vertically stacked NMOS technology, with
to make sure there is enough memory speed. This type of 2D Through-Silicon Vias employed for interconnecting the
integrated circuit has the main memory and processor on two layers. This design implements a distributed architecture for
different chips. Because the bus link between chips is so long, DRAM ranks, utilising multiple layers instead of a single
there are a lot of capacitive loads. This makes the process of layer. This results in a 32% reduction in memory access time
moving data from main memory to the CPU very slow and for a DRAM with five layers. The authors propose the
uses a lot of power. Because logic and memory don't work utilisation of a vector bloom filter to enhance the
together as well as they should, microprocessor makers have functionality of the L2 miss handling architecture (MHA) in
had to come up with complicated, energy-hungry order to optimise the additional capacity provided by the 3D-
architectures that allow for out-of-order and speculative stacked memory system. The test findings indicate that the
execution. To hide the delay in main memory, computers have proposed method of memory organisation is 1.75 times more
also been made with bigger and bigger cache hierarchies. efficient than alternative 3D DRAM concepts when
People often call these kinds of problems "Memory Wall performing memory-intensive jobs on a quad-core CPU.
problems." Applications that use a lot of data, like machine
learning, neural network calculations, and real-time data D. H. Woo et al. [3] suggest a 3D-stacked memory
analytics, are directly affected by the memory wall limit. architecture with a vertical L2 network using a large array of
high- density TSVs to improve memory bandwidth even
One way to deal with this problem is to bring the main further. Their results show at least a 1.27x speedup over
memory closer to the processor die thus reducing access traditional 3D stacked DRAM architectures.
latencies, by using 2.5D or 3D packaging. In a 2.5D structure,
the processor and memory are placed side-by-side on a D. Lee et al. [4] explore another approach to better utilize
silicon interposer to achieve extremely high die-to-die the total potential bandwidth increase offered by TSVs for
3D stacked DRAMs. Their proposed architecture delivers an

979-8-3503-6016-5/24/$31.00 ©2024 IEEE 1634


Authorized licensed use limited to: National Yang Ming Chiao Tung University. Downloaded on November 22,2024 at 06:14:04 UTC from IEEE Xplore. Restrictions apply.
2024 4th International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE)


increase in the internal DRAM bandwidth by accessing
multiple DRAM layers simultaneously, thus making much We employ NeuroSim [8], a commonly utilised
greater use of the bandwidth that the TSVs offer. benchmark platform for 3D integrated CIM accelerators
specifically developed for DNN inference. NeuroSim is
capable of facilitating both monolithic and heterogeneous 3D
integration.

3.1 NeuroSim Framework


The desired DNN model is trained externally through a
PyTorch wrapper, and the final trained weights get mapped
to hardware synapses within NeuroSim. The tool uses a DNN
network topology defined by the user, and then using hard-
ware inferences, decides the floorplan of the CIM accelerator.
This enables instruction-accurate evaluation on both
Fig. 1. NeuroSim Chip Architecture (Peng et al. IEDM accuracy and hardware performance of inference. The high-
2019) level chip floorplan is shown in figure 1.

In [5], the authors abandon the idea of die-to-die stacking The synaptic arrays contain one-transistor-one-resistor
which uses TSVs as vertical interconnects, instead opting for (1T1R) based resistive random-access-memory (RRAM)
a monolithic based 3D integration approach, where multiple cells. Multiple such cells act together to mimic synapses in
tiers of devices are fabricated sequentially over each other. hardware. Additional peripheral circuits like multiplexers,
This technology uses monolithic inter-tier vias (MIVs) for analog- to-digital converters, shift-adders, etc. are present
vertical connections, which are over three orders of within the array block. The framework assumes that the on-
magnitude smaller than TSVs, allowing for fine-grained chip memory is sufficient to store all weight data, while input
vertical integration, and reporting both performance and data lies off- chip.
thermal advantages over their TSV counterparts.
The synaptic array size is decided by user input. Tile and
M. M. Sabry Aly et al. [6] introduce a new architecture PE size are iteratively optimized by NeuroSim to get the
incorporating monolithic 3D integration with new logic de- highest possible memory utilization. Multiple tiles can map
vices (such as carbon nanotube field-effect) as well as the use to one layer, but multiple layers do not map to a single tile.
of high-density Nonvolatile memory (ReRAM and Interconnects between modules use a H-tree based wiring
STTRAM) improving the energy-delay product for common structure, also visible in the figure.
workloads by almost three orders of magnitude over
conventional systems. Despite the potential benefits offered Monolithic integration was chosen as the 3D integration
by 3D stacking, there are substantial worries regarding its method for this study. In monolithic 3D, multiple tiers are
thermal effects. High power dissipation from the compute fabricated over each other sequentially instead of die-
layer can create thermal hot spots in the stacked DRAM stacking, like that in heterogeneous 3D. It utilizes finer
modules, especially those near the centre that are far away grained back end of the line (BEOL) monolithic inter-tier vias
from the heat sink. This can lead to higher peak temperature (MIVs) for inter-tier communication, thus resulting in higher
than that in 2D chips. Higher temperatures lead to an issue memory bandwidth and lower access times when compared
with performance, leakage power, and reliability of the to the through silicon vias (TSV) used in heterogeneous 3D.
circuit. The research presented in [7] reports that thermal
constraints for 3D stacked memory do place a lower limit on NeuroSim implements a two-tiered chip structure for
the operating frequency in 3D ICs, while still giving large mono- lithic 3D, as shown in figure 2, where the bottom tier
performance benefits over traditional 2D designs. consists of logic elements (ADC and Accumulation Circuits),
and the top tier is strictly dedicated for RRAM memory and
III. PROPOSED WORK its peripheries. This design keeps area consuming logic layers
to be on a separate tier, allowing the use of advanced tech
We take a look at Compute-in-Memory (CIM) nodes (7 nm) for logic, and older tech nodes (22 nm) for the
architectures, which is a popular use case of 3D integrated memory tier.
chips. CIM attempts to overcome the memory wall bottleneck
by performing operations on data within the memory, where
possible. The application we have selected is Deep Neural
Networks (DNN), which is a class of machine learning
algorithms that employs multiple convolutional layers to
perform inferences. This structure leads to large memory
bandwidth requirements, and allows us to get a good
comparison of the benefits of 3D integration. For the scope
of this project, we will not delve deep into how CIM or DNNs
work.

1635
Authorized licensed use limited to: National Yang Ming Chiao Tung University. Downloaded on November 22,2024 at 06:14:04 UTC from IEEE Xplore. Restrictions apply.
2024 4th International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE)


The following parameters can be changed at run-time to
tune the hardware implementation for an RRAM based
mono- lithic 3D CIM accelerator.

● Network: A layer by layer network structure of the desired


DNN, including the sizes of the IFM and weight matrices.
This directly influences the floorplan in NeuroSim [10].

● Sub Array Size: Decides the size of each RRAM sub array
module in the chip architecture. Influences the PE and Tile
sizes [11].

● Speedup Degree: Allows NeuroSim to more aggressively


perform weight duplication on the given network structure
[12].

● Mapping Style: Choice between conventional and novel


Fig. 2. Monolithic 3D Floorplan (Peng et al. IEDM 2020) weight mapping.

3.2 Optimizations by NeuroSim Some additional hardware parameters include selection of


tech nodes for the logic and memory blocks, RRAM crossbar
To further improve memory utilization and the processing sizing, buffer and ADC types, etc.
speed of the whole network as much as possible, weight
duplication is introduced for each layer. Layer structures IV. RESULTS
(such as input feature size, channel depth and kernel size)
vary significantly within DNNs. Hence, NeuroSim iteratively The NeuroSim inference implementation results for a
decides the PE and tile sizes, and possibilities of weight VGG- 8 model trained on CIFAR-10 dataset are given in table
duplication among PEs. For example, if the weight-matrix of I. It is evident from the results that 3D integration
a layer is smaller than the tile size, it’s possible to duplicate improves throughput and access energies for CIM systems,
the weight-matrix, fetching in multiple neural activation given the same sub array architecture. Lower synaptic array
vectors in parallel, thus speeding up the computation of this sizes improve memory utilization as this leads to fewer arrays
layer. Further, the slower shallow layers of the DNN can be with empty memory cells, but the added need to communicate
sped up by using this parallelism of weight duplication, so between blocks as well as an increase in the peripheral
that the deeper layers do not have to wait idly as long for the circuitry reduces performance and increases power [13].
input feature maps (IFMs) to arrive [9].
Similarly, larger array sizes lead to an increase in
performance as more computes can take place within the
array, reducing the amount of inter module communications
[14]. However, more arrays would have empty cells now due
to the larger size, leading to low memory utilization. For this
DNN structure, a sub array size of 128x128 proves to be a
Table 1. Inference implementation results in NeuroSim for balanced design option.
VGG-8 model trained on cifar-10 dataset.
Further, we also take a look at the comparison of different
The novel weight mapping mechanism utilised by sub array sizes layer by layer for the DNN in NeuroSim,
NeuroSim is detailed in [10]. The weights allocated to shown in table II. The layer in question is layer 7 of the VGG-
different sub-matrices at different spatial locations of each 8 network, and is a fully connected layer of size 1024. All
kernel are determined by the processing element (PE) size three implementations have a 100% memory utilization.
that is selected. The input data that requires assignment to
distinct spatial positions within each kernel will also be
conveyed to the corresponding submatrix.

One may designate a collection of subarrays that include


accumulation modules and input and output containers as a
single processing element (PE). By enabling the recycling of
input data across these processing elements (PEs), this Table 2. NeuroSim latency and energy parameters for layer
mapping reduces the need for inter-PE communication and, 7 of VGG-8 network
consequently, computational latency when compared to the
conventional weight mapping method. The 64x64 implementation is the worst performing for
this layer. This is possibly due to the energy leakage and
3.3 NeuroSim Parameters access time latencies caused by the additional peripherals.
The 256x256 floorplan has the lowest read energy. This could

1636
Authorized licensed use limited to: National Yang Ming Chiao Tung University. Downloaded on November 22,2024 at 06:14:04 UTC from IEEE Xplore. Restrictions apply.
2024 4th International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE)


be due to the reduction in total number of tiles, leading to a [10] X. Peng, R. Liu, S. Yu,” Optimizing weight mapping and data flow for
convolutional neural networks on RRAM based processing-in-memory
reduced number of total buffers required for each module, architecture”, IEEE International Symposium on Circuits and Systems
and thus data needs to be read by fewer total buffers. The (ISCAS), 2019.
leakage power also reduces, as there are fewer peripherals’ [11] J. Sun, P. Houshmand and M. Verhelst, "Analog or Digital In-Memory
circuits for the entire layer. However, this has the largest read Computing? Benchmarking Through Quantitative Modeling," 2023
latency. Finally, the 128x128 design has the lowest read IEEE/ACM International Conference on Computer Aided Design
(ICCAD), San Francisco, CA, USA, 2023, pp. 1-9, doi:
latency, proving to be the prefect size for this layer [15]. The 10.1109/ICCAD57390.2023.10323763.
remaining parameters are also balanced in comparison to the [12] J. Rhe, K. E. Jeon, J. C. Lee, S. Jeong and J. H. Ko, "Kernel Shape
other implementations. Other layers also seem to follow this Control for Row-Efficient Convolution on Processing-In-Memory
trend, if we do not include weight duplication. Enabling Arrays," 2023 IEEE/ACM International Conference on Computer
Aided Design (ICCAD), San Francisco, CA, USA, 2023, pp. 1-9, doi:
speed- up for layers results in different amounts of weight 10.1109/ICCAD57390.2023.10323749.
duplication per layer depending on the sub array sizes, and [13] Y. Halawani, H. Tesfai, B. Mohammad and H. Saleh, "FORSA:
not just the speed-up degree. This can lead to varying trends Exploiting Filter Ordering to Reduce Switching Activity for Low
across sub array sizes, due to different degrees of latency and Power CNNs," 2023 IEEE 66th International Midwest Symposium on
utilization benefits from parallelism. Circuits and Systems (MWSCAS), Tempe, AZ, USA, 2023, pp. 561-
565, doi: 10.1109/MWSCAS57524.2023.10406115.
[14] L. Han, P. Huang, Z. Zhou, Y. Chen, X. Liu and J. Kang, "A
V. CONCLUSION Convolution Neural Network Accelerator Design with Weight
Mapping and Pipeline Optimization," 2023 60th ACM/IEEE Design
A monolithic 3D based CIM accelerator for DNNs was Automation Conference (DAC), San Francisco, CA, USA, 2023, pp. 1-
successfully simulated for various hardware parameters in 6, doi: 10.1109/DAC56929.2023.10247977.
NeuroSim. The improvements of a 3D based chip over [15] O. Krestinskaya, L. Zhang and K. N. Salama, "Towards Efficient In-
Memory Computing Hardware for Quantized Neural Networks: State-
traditional 2D architectures for data hungry applications were of-the-Art, Open Challenges and Perspectives," in IEEE Transactions
tested [16]. The importance of proper hardware planning is on Nanotechnology, vol. 22, pp. 377-386, 2023, doi:
also apparent from the results. 10.1109/TNANO.2023.3293026.
[16] J. Song, X. Tang, X. Qiao, Y. Wang, R. Wang and R. Huang, "A 28
nm 16 Kb Bit-Scalable Charge-Domain Transpose 6T SRAM In-
REFERENCES Memory Computing Macro," in IEEE Transactions on Circuits and
Systems I: Regular Papers, vol. 70, no. 5, pp. 1835-1845, May 2023,
[1] G. H. Loh and Y. Xie, ”3D Stacked Microprocessor: Are We There doi: 10.1109/TCSI.2023.3244338.
Yet?” in IEEE Micro, vol. 30, no. 3, pp. 60-64, May-June 2010, doi:
10.1109/MM.2010.45
[2] G. H. Loh, ”3D-Stacked Memory Architectures for Multi-core
Processors,” 2008 International Symposium on Computer
Architecture, Beijing, China, 2008, pp. 453-464, doi:
10.1109/ISCA.2008.15.
[3] D. H. Woo, N. H. Seong, D. L. Lewis and H. -H. S. Lee,” An
optimized 3D-stacked memory architecture by exploiting excessive,
high- density TSV bandwidth,” HPCA - 16 2010 The Sixteenth
International Symposium on High-Performance Computer
Architecture, Bangalore, India, 2010, pp. 1-12, doi:
10.1109/HPCA.2010.5416628.
[4] Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan,
and Onur Mutlu. 2016. Simultaneous Multi-Layer Access: Improving
3D- Stacked Memory Bandwidth at Low Cost. ACM Trans. Archit.
Code Optim. 12, 4, Article 63 (January 2016), 29 pages, doi:
1145/2832911
[5] Aqeeb Iqbal Arka, Biresh Kumar Joardar, Ryan Gary Kim, Dae Hyun
Kim, Janardhan Rao Doppa, and Partha Pratim Pande. 2021. HeM3D:
Heterogeneous Manycore Architecture Based on Monolithic 3D
Vertical Integration. ACM Trans. Des. Autom. Electron. Syst. 26, 2,
Article 16 (March 2021), 21 pages. https://doi.org/10.1145/3424239
[6] M. M. Sabry Aly et al.,” The N3XT Approach to Energy-Efficient
Abundant-Data Computing,” in Proceedings of the IEEE, vol. 107, no.
1, pp. 19-48, Jan. 2019, doi: 10.1109/JPROC.2018.2882603.
[7] G. Loi, B. Agrawal, N. Srivastava, S. Lin, T. Sherwood, K. Banerjee,
A thermally aware performance analysis of vertically integrated (3-D)
processor-memory hierarchy, in: Proceedings of the 43rd annual
Design Automation Conference, 2006, pp. 991–996.
[8] X. Peng, S. Huang, Y. Luo, X. Sun and S. Yu,” DNN+NeuroSim: An
End- to-End Benchmarking Framework for Compute-in-Memory
Accelerators with Versatile Device Technologies,” 2019 IEEE
International Electron Devices Meeting (IEDM), San Francisco, CA,
USA, 2019, pp. 32.5.1- 32.5.4, doi:
10.1109/IEDM19573.2019.8993491.
[9] X. Peng, W. Chakraborty, A. Kaul, W. Shim, M. S Bakir, S. Datta
and S. Yu,” Benchmarking Monolithic 3D Integration for Compute-in-
Memory Accelerators: Overcoming ADC Bottlenecks and Maintaining
Scalability to 7nm or Beyond”, IEEE International Electron Devices
Meeting (IEDM), 2020.

1637
Authorized licensed use limited to: National Yang Ming Chiao Tung University. Downloaded on November 22,2024 at 06:14:04 UTC from IEEE Xplore. Restrictions apply.

You might also like