Group 1 Eee 304
Group 1 Eee 304
Group 1 Eee 304
TECHNOLOGY
DEPARTMENT OF ELECTRICAL AND ELECTRONIC ENGINEERING
Course Instructors:
Nafis Sadik 1. Lecturer
Sadat Tahmeed Azad 2. Part-Time Lecturer
i
8 Project Management and Cost Analysis ...................................................24
8.1 Bill of Materials .................................................................................................................. 24
9 Future Work ................................................................................................ 24
10 References .................................................................................................... 24
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Group 1- Final Project
1 Abstract
In the realm of artificial intelligence and machine learning, specialized hardware accelerators have
emerged as vital components to expedite the training and inference of complex neural network models.
Tensor Processing Units (TPUs) have garnered considerable attention for their exceptional performance in
machine learning workloads. In this academic project, we explore an innovative approach to hardware
acceleration by leveraging Field-Programmable Gate Arrays (FPGAs) to implement a simplified version of
a TPU architecture. Our FPGA-based system is designed to demonstrate a two-layer deep learning
architecture, comprising four neurons, and is specifically tailored for linear regression tasks.
The project unfolds with an in-depth analysis of TPUs and their hardware architecture, elucidating their
role in enhancing machine learning performance. Utilizing Verilog Hardware Description Language
(HDL), we meticulously implement a custom TPU-like architecture on an FPGA board. This
implementation enables the execution of a two-layer deep learning model, mimicking the principles of
TPUs, thus offering valuable insights into the workings of dedicated machine learning accelerators.
The core objective of our project is to showcase the feasibility of FPGA-based hardware acceleration for
neural network computations. We elucidate the design choices, optimizations, and trade-offs involved in
creating a functional deep learning architecture on FPGA. To validate our system's capabilities, we utilize
it to perform linear regression, a fundamental machine learning task, and present comparative performance
metrics with traditional computing platforms.
Our findings underscore the potential of FPGA-based hardware acceleration as a cost-effective and
customizable alternative to dedicated AI accelerators like TPUs. Furthermore, this project serves as a
stepping stone for future research and development in FPGA-based machine learning accelerators, opening
doors to new possibilities in edge computing and specialized hardware for AI applications.
In summary, our FPGA-based implementation of a two-layer deep learning architecture provides a
valuable contribution to the exploration of hardware acceleration in machine learning and demonstrates the
adaptability of FPGAs for such tasks. This project bridges the gap between hardware and machine
learning, offering a practical and insightful perspective on the fusion of these two domains.
2 Introduction
In the ever-evolving landscape of machine learning and artificial intelligence, the quest for
accelerating computational workloads has been relentless. The demand for faster and more efficient
solutions has led to the development of dedicated hardware accelerators tailored to the unique
requirements of neural networks. Among these accelerators, Tensor Processing Units (TPUs) have
emerged as a powerful force, offering unparalleled advantages over traditional Central Processing
Units (CPUs) and Graphics Processing Units (GPUs).
The exponential growth in data volumes and the increasing complexity of machine learning models
have made it imperative to seek novel approaches to computing. CPUs, once the workhorses of
general-purpose computing, and GPUs, optimized for graphics rendering and parallel processing,
have their limitations when confronted with the sheer scale and intricacy of modern neural networks.
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 01
Why TPUs Over CPUs and GPUs?
Specialization for Tensor Operations: TPUs are purpose-built to excel in the kinds of operations at
the heart of deep learning: tensor operations. Unlike CPUs and GPUs, which are designed for a
broader range of tasks, TPUs are finely tuned to execute matrix multiplications and other tensor-
based computations with exceptional efficiency. This specialization allows TPUs to perform
machine learning workloads significantly faster.
Parallelism and Scalability: TPUs embrace parallelism at an unprecedented level, enabling the
simultaneous processing of a vast number of mathematical operations. This inherent parallelism,
coupled with the ability to scale TPUs into clusters, empowers them to tackle massive datasets and
train complex models more rapidly than their CPU and GPU counterparts.
Energy Efficiency: TPUs exhibit remarkable energy efficiency, accomplishing more computations
per watt of power consumed compared to traditional CPUs and GPUs. This not only reduces the
operational costs but also aligns with the growing emphasis on environmentally sustainable
computing solutions.
Generic Benefits of TPUs
Beyond their specialization and performance advantages, TPUs offer several generic benefits:
Reduced Training Time: TPUs significantly shorten the training time of deep learning models,
enabling researchers and engineers to iterate and experiment more rapidly in the development of new
algorithms and models.
Improved Inference Speed: For real-time applications and edge computing scenarios, TPUs deliver
swift and responsive inference capabilities, making them ideal for tasks like image recognition and
natural language processing.
Scalability: TPUs can be easily integrated into cloud-based computing environments, allowing
organizations to scale their machine learning workloads as needed without substantial hardware
investments.
Competitive Advantage: Leveraging TPUs can provide a competitive edge in fields where rapid
model development and deployment are crucial, such as autonomous vehicles, healthcare, and
finance.
In this project, we delve into the intriguing world of TPUs, aiming to harness their computational
prowess in the realm of hardware acceleration. We endeavor to replicate their specialized
architecture on Field-Programmable Gate Arrays (FPGAs) and demonstrate their utility through a
two-layer deep learning model, with a focus on linear regression. By examining the capabilities and
advantages of TPUs, we seek to shed light on the transformative potential of specialized hardware in
the field of machine learning and artificial intelligence.
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 02
3 Design
Neural Network:
Neural networks, often referred to as artificial neural networks or simply neural nets, are
computational models inspired by the structure and function of the human brain. These versatile and
powerful machine learning algorithms are designed to recognize patterns, learn from data, and make
predictions or decisions. At their core, neural networks consist of interconnected nodes, known as
neurons, organized into layers. Information flows through these layers, with each neuron performing
a simple computation. Through a process called training, neural networks adjust their internal
parameters, such as weights and biases, to minimize prediction errors, making them adept at tasks
like image recognition, natural language processing, and complex data analysis. Neural networks
have played a pivotal role in revolutionizing fields like computer vision, speech recognition, and
autonomous robotics, making them a fundamental tool in the realm of artificial intelligence.
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 04
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 05
3.2 Literature Review :
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 06
Execution Time_MM_TPU = (M * N * K) / (Throughput_TPU * Batch_Size)
where M, N, and K are matrix dimensions, and Throughput_TPU represents the TPU's processing
capability.
2. Energy Efficiency: Quantitative analysis of energy efficiency involves measuring the
energy consumed (E) during DNN operations on TPUs:
Energy Efficiency (EE) = Throughput_TPU / E
3. Model Training Time: Quantitatively comparing the time (T) required to train DNN
models on TPUs versus other hardware:
Speedup_TPU = Training Time_CPU / Training Time_TPU
4. Memory Usage: Quantitative analysis of memory usage involves tracking memory
consumption (Mem) during forward and backward passes:
Mem_TPU = Mem_Input + Mem_Layer + Mem_Weights + Mem_Other
By incorporating mathematical equations and notations, you can precisely quantify and analyze the
performance, efficiency, and suitability of DNNs and TPUs for various machine learning tasks. This
mathematical framework enables data-driven decision-making in hardware selection, model design,
and optimization.
� = [1]1 � � � [�]� � 1
� = [1]1 � � � [�]� � 1
�2 = [�]1 � � � [�]� � 1
Here, all the variables inside brackets indicate matrices with their dimension in subscript. Other
operations are simply addition and multiplication.
module test(clk,reset,nLb1,nLb2,addr,rd,wr,Cen,en,out1,out2,out3,out4,out,re,nLo,in);
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 07
TPU TPU_1(clk,reset,rb1_to_TPU_1,rb1_to_TPU_2,rb1_to_TPU_3,rb1_to_TPU_4,
rb2_to_TPU_1,rb2_to_TPU_2,rb2_to_TPU_3,rb2_to_TPU_4,
Wbus,
out1,out2,out3,out4,TPU_addr,en,Cen);
regB rb1(nLb1,clk,Wbus,rb1_to_TPU_1,rb1_to_TPU_2,rb1_to_TPU_3,rb1_to_TPU_4,regb_addr,rd,wr);
regB rb2(nLb2,clk,Wbus,rb2_to_TPU_1,rb2_to_TPU_2,rb2_to_TPU_3,rb2_to_TPU_4,regb_addr,rd,wr);
ram1 RAM1(clk, rd, wr, re, RAM_addr, out);
outreg outreg1(nLo, clk, Wbus, Wbus);
endmodule
decoder_4x16 dec(load,addr);
assign {load11,load21,load31,load41,load12,load22,load32,load42,load13,load23,load33,load43,load14,load24,load34,load44}=load;
assign out1=out14;
assign out2=out24;
assign out3=out34;
assign out4=out44;
endmodule
//4 to 16 decoder
module decoder_4x16 (d_out, d_in);
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 08
(d_in == 4'b0100) ? tmp>>4:
(d_in == 4'b0101) ? tmp>>5:
(d_in == 4'b0110) ? tmp>>6:
(d_in == 4'b0111) ? tmp>>7:
(d_in == 4'b1000) ? tmp>>8:
(d_in == 4'b1001) ? tmp>>9:
(d_in == 4'b1010) ? tmp>>10:
(d_in == 4'b1011) ? tmp>>11:
(d_in == 4'b1100) ? tmp>>12:
(d_in == 4'b1101) ? tmp>>13:
(d_in == 4'b1110) ? tmp>>14:
(d_in == 4'b1111) ? tmp>>15: 16'bxxxx_xxxx_xxxx_xxxx;
endmodule
// parameter wordsize = 8;
begin
if(!nLb) begin
memB[addr]<= in;
begin
end
end
end
endmodule
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 09
//$readmemh("ram.mem",rammem);
r0 <= 8'h06;
r1 <= 8'h06;
r2 <= 8'h03;
r3 <= 8'h04;
r4 <= 8'h03;
r5 <= 8'h05;
r6 <= 8'h03;
r7 <= 8'h02;
r8 <= 8'h02;
r9 <= 8'h00;
r10 <= 8'h06;
r11 <= 8'h01;
r12 <= 8'h04;
r13 <= 8'h00;
r14 <= 8'h07;
r15 <= 8'h00;
r16 <= 8'h01;
r17 <= 8'h03;
r18 <= 8'h06;
r19 <= 8'h05;
r20 <= 8'h05;
r21 <= 8'h03;
r22 <= 8'h03;
r23 <= 8'h07;
end
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 10
57: data<=r57;
58: data<=r58;
59: data<=r59;
60: data<=r60;
61: data<=r61;
62: data<=r62;
63: data<=r63;
64: data<=r64;
65: data<=r65;
66: data<=r66;
67: data<=r67;
68: data<=r68;
69: data<=r69;
70: data<=r70;
71: data<=r71;
72: data<=r72;
73: data<=r73;
74: data<=r74;
75: data<=r75;
76: data<=r76;
77: data<=r77;
78: data<=r78;
79: data<=r79;
80: data<=r80;
81: data<=r81;
82: data<=r82;
83: data<=r83;
84: data<=r84;
85: data<=r85;
86: data<=r86;
87: data<=r87;
88: data<=r88;
89: data<=r89;
90: data<=r90;
91: data<=r91;
92: data<=r92;
93: data<=r93;
94: data<=r94;
95: data<=r95;
96: data<=r96;
97: data<=r97;
98: data<=r98;
99: data<=r99;
100: data<=r100;
101: data<=r101;
102: data<=r102;
103: data<=r103;
104: data<=r104;
107: data<=r107;
108: data<=r108;
109: data<=r109;
110: data<=r110;
111: data<=r111;
112: data<=r112;
113: data<=r113;
114: data<=r114;
115: data<=r115;
116: data<=r116;
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 11
117: data<=r117;
118: data<=r118;
119: data<=r119;
120: data<=r120;
121: data<=r121;
122: data<=r122;
123: data<=r123;
124: data<=r124;
125: data<=r125;
126: data<=r126;
127: data<=r127;
endcase
if(ce) out<=data;
else out<=8'bzzzzzzzz;
end
//data <= ce? rammem1 : 8'bzzzzzzzz;
else if(wr&&ce)
begin
//data<=in;
case(addr)
0: r0<=data;
1: r1<=data;
2: r2<=data;
3: r3<=data;
4: r4<=data;
5: r5<=data;
6: r6<=data;
7: r7<=data;
8: r8<=data;
9: r9<=data;
10: r10<=data;
11: r11<=data;
12: r12<=data;
13: r13<=data;
14: r14<=data;
15: r15<=data;
16: r16<=data;
17: r17<=data;
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 12
18: r18<=data;
21: r21<=data;
22: r22<=data;
23: r23<=data;
24: r24<=data;
25: r25<=data;
28: r28<=data;
29: r29<=data;
30: r30<=data;
31: r31<=data;
32: r32<=data;
33: r33<=data;
34: r34<=data;
35: r35<=data;
36: r36<=data;
37: r37<=data;
38: r38<=data;
39: r39<=data;
40: r40<=data;
41: r41<=data;
42: r42<=data;
43: r43<=data;
44: r44<=data;
45: r45<=data;
46: r46<=data;
47: r47<=data;
48: r48<=data;
49: r49<=data;
50: r50<=data;
51: r51<=data;
52: r52<=data;
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 13
53: r53<=data; 54: r54<=data;
61: r61<=data;
62: r62<=data;
65: r65<=data;
68: r68<=data;
69: r69<=data;
70: r70<=data;
71: r71<=data;
72: r72<=data;
73: r73<=data; 74: r74<=data; 75: r75<=data; 76: r76<=data; 77: r77<=data;
80: r80<=data;
81: r81<=data; 82: r82<=data; 83: r83<=data; 84: r84<=data; 85: r85<=data;
86: r86<=data; 87: r87<=data; 88: r88<=data; 89: r89<=data; 90: r90<=data;
91: r91<=data;
92: r92<=data
93: r93<=data;
94: r94<=data;
95: r95<=data;
96: r96<=data;
97: r97<=data;
100: r100<=data;
101: r101<=data;
102: r102<=data;
103: r103<=data;
104: r104<=data;
105: r105<=data;
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 14
108: r108<=data;
109: r109<=data;
110: r110<=data;
111: r111<=data;
112: r112<=data;
113: r113<=data;
114: r114<=data;
115: r115<=data
119: r119<=data;
120: r120<=data;
121: r121<=data;
126: r126<=data;
127: r127<=data;
endcase
end
end
Endmodule
4 Implementation
4.1 Description
Introduction to Linear Regression:
Linear regression is a fundamental statistical technique used in machine learning and statistics for
modeling the relationship between a dependent variable (usually denoted as 'y') and one or more
independent variables (usually denoted as 'x'). It assumes a linear relationship between these
variables and aims to find the best-fitting linear equation (a straight line in simple linear regression)
that predicts the dependent variable based on the independent ones.
The linear regression equation typically takes the form:
y = mx + b
Where:
'y' is the dependent variable.
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 15
'x' is the independent variable.
'm' is the slope of the line (the coefficient).
'b' is the y-intercept.
The goal of linear regression is to find the optimal values of 'm' and 'b' that minimize the sum of
squared differences between the predicted values and the actual data points.
Execution of Linear Regression on TPU:
To execute linear regression on a Tensor Processing Unit (TPU), you can follow these steps:
1. Data Preparation: First, prepare your dataset with the independent and dependent variables.
Ensure that the data is properly preprocessed and cleaned.
2. Model Definition: Define a linear regression model. In this case, the model consists of a
linear equation (y = mx + b). TPUs are highly efficient at handling mathematical
computations, making them suitable for linear regression tasks.
3. Training: Use your dataset to train the linear regression model. During training, the TPU
will compute the optimal values for 'm' and 'b' by minimizing the loss function, typically
mean squared error (MSE), through gradient descent or another optimization method.
4. Inference: Once the model is trained, you can use it for inference. Provide new independent
variables ('x') to the model, and the TPU will quickly calculate the predicted values ('y')
based on the learned coefficients ('m' and 'b').
TPUs offer significant advantages for executing linear regression and other machine learning tasks
due to their high computational power and parallel processing capabilities. They can handle large
datasets and complex models efficiently, reducing the time required for training and inference,
which is particularly beneficial for real-time and large-scale applications.
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 16
4.3 Data Analysis
The values found from experiment and theoretical calculation were compared to assess output
quality.
4.4 Results
The matrix multiplication block and overall TPU successfully performed the numerical
calculations.
5.1 Novelty
The project's novelty lies in its innovative approach to hardware acceleration using Field-
Programmable Gate Arrays (FPGAs) to replicate the functionality of Tensor Processing Units
(TPUs) for a specific deep learning architecture. While TPUs are well-established as dedicated
hardware accelerators for machine learning, the utilization of FPGAs to mimic TPU-like
operations represents a unique and forward-thinking concept.
This approach offers several key advantages. Firstly, it leverages the flexibility and re-
configurability of FPGAs to create a customizable and adaptable hardware accelerator. Unlike
fixed-function TPUs, FPGA-based TPUs can be reprogrammed to accommodate various deep
learning architectures, making them highly versatile for research and development purposes.
Furthermore, the project pioneers the implementation of a two-layer deep learning model on
FPGA, focusing on linear regression. By demonstrating the feasibility of FPGA-based
acceleration for machine learning tasks, it opens doors to new possibilities in the field of edge
computing, where customized hardware accelerators can enhance the performance of AI
applications in resource-constrained environments.
5.3Investigations
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 19
7. Energy Efficiency: Depending on the FPGA's design and usage, it may not achieve the
same level of energy efficiency as dedicated TPUs, which are optimized for neural network
workloads.
8. Compatibility: FPGA-based TPUs do not seamlessly integrate with all machine learning
frameworks or software libraries, requiring additional development effort for compatibility.
9. Size and Form Factor: The physical size and form factor of FPGA boards may limit their
use in certain applications or deployment scenarios.
10. Scalability: Scaling up FPGA-based TPUs for large-scale machine learning tasks are
challenging and may require multiple FPGAs and complex interconnects.
Intellectual Property Rights: One of the foremost legal concerns is intellectual property rights. We
must be vigilant in ensuring that our project does not infringe upon any existing patents, copyrights,
or trademarks related to TPUs, FPGA technology, or machine learning algorithms. Proper due
diligence and clearance for the use of any proprietary technology or software are necessary.
Patent Applications: If our project results in novel inventions or innovations, we should consider
filing for patents to protect our intellectual property. Engaging with legal counsel experienced in
patent law can be beneficial in this regard.
Regulatory Compliance: Depending on the application of our project, we may need to consider
industry-specific regulations. For instance, in healthcare or autonomous vehicles, regulatory
compliance may be necessary to ensure safety and legal conformity.
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 21
5.7Ethical Issues
Challenge - 1: Ensuring that the hardware components we used were ethically sourced, considering
factors like environmental impact and labor conditions.
Mitigation: We researched and selected suppliers with a commitment to ethical practices. Whenever
feasible, we opted for components with a minimal environmental footprint and adhered to ethical
labor standards.
Challenge -2: Ensuring the safety and reliability of our FPGA-based TPUs was paramount.
Mitigation: Rigorous testing, validation, and quality assurance measures were implemented to detect
and rectify any potential issues. We followed best practices in hardware design and testing, focusing
on safety and performance.
"Our team values diversity, fostering an inclusive environment where unique perspectives and
backgrounds enrich our collaborative efforts, fuel innovation, and drive our project's success."
7 Communication
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 23
8 Project Management and Cost Analysis
Project Management:
Effective project management was central to our success. We employed agile methodologies,
ensuring regular team meetings, task tracking, and adapting to evolving project needs. Clear
roles and responsibilities promoted collaboration, keeping us on track and focused.
Cost Analysis:
Our project's cost analysis encompassed FPGA development boards, hardware components, and
research materials.
FPGA board was taken from the laboratory, so no extra expenditure occurred from our end.
9 Future Work
Continuous Learning: Stay updated with evolving AI technologies and ethical practices.
10 References
________________________________________________________________________________________
Basic Core of TPU EEE 304 (January 2023) A2 Page 25