0% found this document useful (0 votes)

9 views19 pages

Assign 01

Uploaded by

Akash Maji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views19 pages

Assign 01

Uploaded by

Akash Maji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Question #1

Part A
The first variant uses (i, j, k) as the loop order
// do the matrix multiplication C = A * B
// loop order is (i, j, k)
for(int i = 0; i < ROWS; i++){
for(int j = 0; j < COLS; j++){
for(int k = 0; k < SIZE; k++){
matrixC[i][j] += (matrixA[i][k]*matrixB[k][j]);
}
}
}

Time Taken => 32.407530000 seconds for 2048x2048 matrix

Time Taken => 3020.189251000 seconds for 8192x8192 matrix

Loop Order: (i, j, k)

The second variant uses (j, i, k) as the loop order
// do the matrix multiplication C = A * B
// loop order is (j, i, k)
for(int j = 0; j < COLS; j++){
for(int i = 0; i < ROWS; i++){
for(int k = 0; k < SIZE; k++){
matrixC[i][j] += (matrixA[i][k] * matrixB[k][j]);
}
}
}

Time Taken => 28.018704000 seconds 2048x2048 matrix

Time Taken => 2519.410180000 seconds 8192x8192 matrix

Loop Order: (j, i, k)

The third variant uses (k, i, j) as the loop order
// do the matrix multiplication C = A * B
// loop order is (k, i, j)
for(int k = 0; k < SIZE; k++){
for(int i = 0; i < ROWS; i++){
for(int j = 0; j < COLS; j++){
matrixC[i][j] += (matrixA[i][k] * matrixB[k][j]);
}
}
}

Time Taken => 23.626690000 seconds 2048x2048 matrix

Time Taken => 1630.848939000 seconds 8192x8192 matrix

Loop Order: (k, i, j)

The order of efficiency in terms of time consumption is:
[I, J, K] < [J, I, K] < [K, I, J]
Part B

For the 2048x2048 matrix multiplication using huge pages

/*
Some Calculation on large pages (2MB)
matrix size : 2048 * 2048 elements
: 2^22 ints
: 2^24 B
: 16 MB
: 8 huge pages per matrix
total size : 3 matrices A, B, C
: 3 * 8 huge pages each
: 24 huge pages
That means, we must have 24 huge pages each of size 2MB allocated
*/

For the 8192x8192 matrix multiplication using huge pages

/*
Some Calculation on large pages (2MB)
matrix size : 8192 * 8192 elements
: 2^26 ints
: 2^28 B
: 256 MB
: 128 huge pages per matrix
total size : 3 matrices A, B, C
: 3 * 128 huge pages each
: 384 huge pages
That means, we must have 384 huge pages each of size 2MB allocated
*/

// allocate memory dynamically to matrices A, B and C

// using nmap with appropriate size and flags
int *matrixA = (int*)mmap(NULL,
NMAP_ALLOC_SIZE,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB,
-1, 0);
/*
Do your work here
*/

// unallocate memory allocated by nmap() using munmap()

munmap(matrixA, NMAP_ALLOC_SIZE);
The first variant uses (i, j, k) as the loop order
// do the matrix multiplication C = A * B
// loop order is (i, j, k)
for(int i = 0; i < ROWS; i++){
for(int j = 0; j < COLS; j++){
for(int k = 0; k < SIZE; k++){
matrixC[i * COLS + j] += (matrixA[i * COLS + k] * matrixB[k * COLS + j]);
}
}
}

Time Taken => 34.480703000 seconds for 2048x2048 matrix

Time Taken => 5426.222405000 seconds for 8192x8192 matrix

Loop Order: (i, j, k)

Time Taken => 32.050390000 seconds 2048x2048 matrix

Time Taken => 5272.878529000 seconds 8192x8192 matrix

Loop Order: (j, i, k)

Time Taken => 19.787064000 seconds 2048x2048 matrix

Time Taken => 1343.409960000 seconds 8192x8192 matrix

Loop Order: (k, i, j)

The order of efficiency in terms of time consumption using huge pages is:
[I, J, K] < [J, I, K] < [K, I, J]
Part C
We have to do matrix multiplication in a tiled fashion as shown (here we only show (i, j, k) loop order)
// do matrix multiplication in a tiled fashion of tile-size=64
// loop order: (i, j, k)
for(int i = 0; i < ROWS; i += BLOCK_SIZE){
for(int j = 0; j < COLS; j += BLOCK_SIZE){
for(int k = 0; k < SIZE; k += BLOCK_SIZE){
for(int ii = i; ii < i + BLOCK_SIZE; ii++){
for(int jj = j; jj < j + BLOCK_SIZE; jj++){
for(int kk = k; kk < k + BLOCK_SIZE; kk++){
matrixC[ii][jj] += (matrixA[ii][kk] * matrixB[kk][jj]);
}
}
}
}
}
}

Time Taken => 29.406290000 seconds 2048x2048 matrix

Time Taken => 2086.445759000 seconds 8192x8192 matrix
(k,i, j) order

// do matrix multiplication in a tiled fashion of tile-size=64

// loop order: (k, i, j)
for(int k = 0; k < SIZE; k += BLOCK_SIZE){
for(int i = 0; i < ROWS; i += BLOCK_SIZE){
for(int j = 0; j < COLS; j += BLOCK_SIZE){

for(int kk = k; kk < k + BLOCK_SIZE; kk++){

for(int ii = i; ii < i + BLOCK_SIZE; ii++){
for(int jj = j; jj < j + BLOCK_SIZE; jj++){
matrixC[ii][jj] += (matrixA[ii][kk] * matrixB[kk][jj]);
}
}
}

}
}
}

Time Taken => 28.364288000 seconds 2048x2048 matrix

Time Taken => 2059.293776000 seconds 8192x8192 matrix

(j, i, k) order
// do matrix multiplication in a tiled fashion of tile-size=64
// loop order: (j, i, k)
for(int j = 0; j < COLS; j += BLOCK_SIZE){
for(int i = 0; i < ROWS; i += BLOCK_SIZE){
for(int k = 0; k < SIZE; k += BLOCK_SIZE){

for(int jj = j; jj < j + BLOCK_SIZE; jj++){

for(int ii = i; ii < i + BLOCK_SIZE; ii++){
for(int kk = k; kk < k + BLOCK_SIZE; kk++){
matrixC[ii][jj] += (matrixA[ii][kk] * matrixB[kk][jj]);
}
}
}

}
}
}

Time Taken => 29.968686000 seconds 2048x2048 matrix

Time Taken => 2103.699815000 seconds 8192x8192 matrix
In this case, we do not see much time savings when we run the three loop orders. There are only
marginal differences in terms of time taken.

However, surprisingly, (i, j, k) performs better than (j, i, k), while (k, i, j) takes the least amount of time.
This is contrary to the fact that (i, j, k) was worst performing in non-tiled versions.
Question #2

Part A

The 3 cloudsuite traces are:

1. Trace1: nutch_phase2_core1.trace.xz
2. Trace2: cloud9_phase1_core1.trace.xz
3. Trace3: classification_phase0_core3.trace.xz

Note: We are using 10M warmup and 500M actual simulation instructions.
The observed parameters are:

Trace MPKI IPC Prediction Predictor Used

Accuracy
Trace1 16.5163 1.36976 93.499% Bimodal
Trace1 0.005922 2.05911 99.9977% GShare
Trace1 0.00844 2.059 99.9967% Perceptron
Trace1 0.006368 2.05922 99.9975% TAGE

Trace MPKI IPC Prediction Predictor Used

Accuracy
Trace2 4.374 0.393449 97.9909% Bimodal
Trace2 0.73988 0.409175 99.6602% GShare
Trace2 0.416594 0.410222 99.8086% Perceptron
Trace2 0.117964 0.409526 99.9458% TAGE

Trace MPKI IPC Prediction Predictor Used

Accuracy
Trace3 4.42255 0.361737 96.6896% Bimodal
Trace3 3.08063 0.364767 97.6941% GShare
Trace3 2.48727 0.366098 98.1382% Perceptron
Trace3 1.6329 0.370232 98.7777% TAGE
Storage Justification:

Bimodal Predictor
/*
-------------------
Storage Cost Estimation:
-------------------
We want to have at most 64KB for our bimodal predictor per CPU
We will use 2-bit saturating counter
So entry size = 2 bits
Number of entries = 64K Bytes / 2 bits
= 64 * 8 K bits / 2 bits
= 256 K
= 262144
Largest prime less than 262144 = 262139
*/

#define BIMODAL_TABLE_SIZE 262144

#define BIMODAL_PRIME 262139
#define MAX_COUNTER 3
int bimodal_table[NUM_CPUS][BIMODAL_TABLE_SIZE];
*/

GShare Prdictor
/*
-------------------
Storage Cost Estimation:
-------------------
We want to have at most 64KB for our gshare predictor per CPU
We will use 2-bit saturating counter
So entry size = 2 bits
Number of entries = 64 * 8 K bits / 2 bits
= 256 K
= 262144

*/
#define GLOBAL_HISTORY_LENGTH 16
#define GLOBAL_HISTORY_MASK (1 << GLOBAL_HISTORY_LENGTH) - 1
int branch_history_vector[NUM_CPUS];

#define GS_HISTORY_TABLE_SIZE 262144

int gs_history_table[NUM_CPUS][GS_HISTORY_TABLE_SIZE];
int my_last_prediction[NUM_CPUS];

Perceptron Predictor

/*
—----------------------
Storage Cost Estimation:
-----------------------
Each perceptron has N=32 weights and each weight takes 8 bits
Each perceptron is thus 32 Bytes
We have a table of NUM_PERCEPTRONS=2048 perceptrons
Perceptron Table Size = 2048*32 Bytes = 64 KB

Update Table Entry Size = 8 B (at max) [i.e. 32 + 1 + 1 + 11 = 6 B]

Update Table Size = 256 * 6 B < 2 KB
Total Size Taken is within 64+2 KB

TAGE Predictor
/*
—--------------------------
Storage Budget Justification:
---------------------------
We are using 12 TAGE tables with entry sizes as:
[24, 22, 20, 20, 18, 18, 16, 16, 14, 14, 12, 12]
as the tag_width are:
[19, 17, 15, 15, 13, 13, 11, 11, 9, 9, 7, 7]
and 5 bits for 'ctr' and 'u' fields

There are 2^11 entries in each TAGE table. TAGE_BITS=11

So TAGE Tables Size = (sum[24, 22, 20, 20, 18, 18, 16, 16,
14, 14, 12, 12] bits * 2^11)
= 206 bits * 2048
= 52 KB
Bimodal Table Size = 2^13 entries each of 2 bits
= 2 ^ 14 bits
= 2 ^ 11 B
= 4 KB

Total Size = 52 + 4 KB = 56 KB which is within 64 KB budget

How to build and run these traces:

We are using ChampSim-IISc (modified version of ChampSim, adapted for the purpose) as our simulation
tool.
We navigate to the folder containing all the files:
We run these commands:

# Build 4 predictors as:

# ./build_champsim_iisc.sh bimodal no no no next_line lru 1
# ./build_champsim_iisc.sh gshare no no no next_line lru 1
# ./build_champsim_iisc.sh perceptron no no no next_line lru 1
# ./build_champsim_iisc.sh tage no no no next_line lru 1

# Run 3 traces with bimodal

# sudo ./run_cloudsuite_iisc.sh bimodal-no-no-no-next_line-lru-1core 10 500 trace1.xz
# sudo ./run_cloudsuite_iisc.sh bimodal-no-no-no-next_line-lru-1core 10 500 trace2.xz
# sudo ./run_cloudsuite_iisc.sh bimodal-no-no-no-next_line-lru-1core 10 500 trace3.xz

# Run 3 traces with gshare

# sudo ./run_cloudsuite_iisc.sh gshare-no-no-no-next_line-lru-1core 10 500 trace1.xz
# sudo ./run_cloudsuite_iisc.sh gshare-no-no-no-next_line-lru-1core 10 500 trace2.xz
# sudo ./run_cloudsuite_iisc.sh gshare-no-no-no-next_line-lru-1core 10 500 trace3.xz

# Run 3 traces with perceptron

# sudo ./run_cloudsuite_iisc.sh perceptron-no-no-no-next_line-lru-1core 10 500 trace1.xz
# sudo ./run_cloudsuite_iisc.sh perceptron-no-no-no-next_line-lru-1core 10 500 trace2.xz
# sudo ./run_cloudsuite_iisc.sh perceptron-no-no-no-next_line-lru-1core 10 500 trace3.xz

# Run 3 traces with tage

# sudo ./run_cloudsuite_iisc.sh tage-no-no-no-next_line-lru-1core 10 500 trace1.xz
# sudo ./run_cloudsuite_iisc.sh tage-no-no-no-next_line-lru-1core 10 500 trace2.xz
# sudo ./run_cloudsuite_iisc.sh tage-no-no-no-next_line-lru-1core 10 500 trace3.xz

Part B

We are first using a fixed number of TAGE Tables (say 8).

Case-1
We are using the different sizes of TAGE Tables (say max 2^13), and history lengths as specified below.
We will first try to see any improvement, with this setting.

// Global history lengths according to the adjustment

History lengths = [4 6 10 16 26 42 67 107]

// number of bits used to index into each table, indicating table size
const uint8_t TAGE_INDEX_BITS[TAGE_NUM_COMPONENTS] = {13, 13, 12, 12, 11, 11, 10, 10};

// Clearly there are 8 TAGE Tables

// Assume each has different table size (max is 2^13 entries)
When we simulate with 510M instructions:

Trace MPKI IPC Prediction Predictor Used

Accuracy
Trace1 0.00455686 2.05965 99.9982% TAGE
Trace2 0.0957255 0.409588 99.956% TAGE
Trace3 1.97284 0.368013 98.523% TAGE
Best Trace
Case-2
We are using the different sizes of TAGE Tables (say max 2^13 again but with different sizes as before),
and history lengths in geometric progression. We will then try to see any improvement, with this setting.

// Global history lengths according to the GP

History lengths = [4 8 16 32 64 128 256 512]

// number of bits used to index into each table, indicating table size
const uint8_t TAGE_INDEX_BITS[TAGE_NUM_COMPONENTS]={13, 13, 12, 12, 11, 11, 11, 10};

// We still assume there are 12 TAGE Tables

// And each table has different size (max is 2^13 entries)

When we simulate with 500M instructions:

Trace MPKI IPC Prediction Predictor Used

Accuracy
Trace1 TAGE
Trace2 TAGE
Trace3 TAGE
Best Trace
Part C

Ada Textbook
50% (2)
Ada Textbook
409 pages
Dsa 1
No ratings yet
Dsa 1
12 pages
2 Cache Complexity
No ratings yet
2 Cache Complexity
100 pages
Program For Practice
No ratings yet
Program For Practice
37 pages
4 MM in CUDA
No ratings yet
4 MM in CUDA
38 pages
Assign 01
No ratings yet
Assign 01
14 pages
NUISANCE
No ratings yet
NUISANCE
133 pages
Program Optimization
No ratings yet
Program Optimization
63 pages
DS Lab File 3rd Sem
No ratings yet
DS Lab File 3rd Sem
42 pages
How To Send SMS
100% (1)
How To Send SMS
31 pages
Lab Report 8
No ratings yet
Lab Report 8
28 pages
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
No ratings yet
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
57 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
511vwb Service Manual
0% (1)
511vwb Service Manual
45 pages
Daa File
No ratings yet
Daa File
18 pages
Optimize Matrix Multiplication Utilizing Opencl Fpga Kernel
No ratings yet
Optimize Matrix Multiplication Utilizing Opencl Fpga Kernel
8 pages
Assign 01
No ratings yet
Assign 01
12 pages
LEC12-Optimization and New Trends
No ratings yet
LEC12-Optimization and New Trends
23 pages
Blue Team Tools 1677860442
No ratings yet
Blue Team Tools 1677860442
29 pages
Strassen Matrix DAA
No ratings yet
Strassen Matrix DAA
14 pages
Daa 5-10
No ratings yet
Daa 5-10
12 pages
BADI, BAPI and RFC
50% (2)
BADI, BAPI and RFC
2 pages
Bitwise Operator in C
No ratings yet
Bitwise Operator in C
12 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
Logcat Home Fota Update Log
No ratings yet
Logcat Home Fota Update Log
858 pages
Assignment No-1
No ratings yet
Assignment No-1
13 pages
Dse Assign05
No ratings yet
Dse Assign05
4 pages
Matrix Multiplication Algorithm
No ratings yet
Matrix Multiplication Algorithm
9 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
Efficiient Matrix Multiply C#
No ratings yet
Efficiient Matrix Multiply C#
9 pages
JAVA MCQ (Answers) - 5
No ratings yet
JAVA MCQ (Answers) - 5
8 pages
C Assignment
No ratings yet
C Assignment
30 pages
4.1 Strassens Algorithm
No ratings yet
4.1 Strassens Algorithm
7 pages
Some Basic C++ Programs
No ratings yet
Some Basic C++ Programs
32 pages
Lab2
No ratings yet
Lab2
10 pages
MestReNova - User Manual
No ratings yet
MestReNova - User Manual
868 pages
20 Quiz 14
No ratings yet
20 Quiz 14
12 pages
Computer Science Discrete Mathematics
No ratings yet
Computer Science Discrete Mathematics
11 pages
DAA Lab File-1
No ratings yet
DAA Lab File-1
8 pages
Daa Lab 1
No ratings yet
Daa Lab 1
5 pages
Lab3
No ratings yet
Lab3
11 pages
Matrix Multiplication-Javan.
No ratings yet
Matrix Multiplication-Javan.
6 pages
DAA Mini Project
No ratings yet
DAA Mini Project
6 pages
Advanced Computer Architecture 1
No ratings yet
Advanced Computer Architecture 1
14 pages
Program Matrix
No ratings yet
Program Matrix
7 pages
Lab 3
No ratings yet
Lab 3
8 pages
Prog Assignment
No ratings yet
Prog Assignment
7 pages
OpenMP Matrix
No ratings yet
OpenMP Matrix
6 pages
GCD and Matrix Programs
No ratings yet
GCD and Matrix Programs
3 pages
Lecture 4 - Math
No ratings yet
Lecture 4 - Math
12 pages
Nscreen DC Nscreen I91 20090216 111413 Nscreen I91 Service Manual
100% (1)
Nscreen DC Nscreen I91 20090216 111413 Nscreen I91 Service Manual
93 pages
Introduction To PowerApps-1
No ratings yet
Introduction To PowerApps-1
20 pages
LinearAlgebra Matlab HW3 V2s
No ratings yet
LinearAlgebra Matlab HW3 V2s
5 pages
Aes-z7pz-Sdr2-Dev-g-fmc Revc Schem 02 039799c Top 0
No ratings yet
Aes-z7pz-Sdr2-Dev-g-fmc Revc Schem 02 039799c Top 0
12 pages
CSEC IT May 2015 Annotated
No ratings yet
CSEC IT May 2015 Annotated
18 pages
CS2209 - Oops Lab Manual
100% (1)
CS2209 - Oops Lab Manual
62 pages
Joint Matrix Bfloat16 Modified
No ratings yet
Joint Matrix Bfloat16 Modified
4 pages
CS201P Assignment 2 Solution Spring 2022
No ratings yet
CS201P Assignment 2 Solution Spring 2022
8 pages
9.2 Notes 2DArray Challenges - Watermark
No ratings yet
9.2 Notes 2DArray Challenges - Watermark
11 pages
Linux Command Line Cheat Sheet
No ratings yet
Linux Command Line Cheat Sheet
1 page
Salary Management System
No ratings yet
Salary Management System
11 pages
Operating System Concepts Test
No ratings yet
Operating System Concepts Test
11 pages
MIT18 335JF10 Lec2a Hand
No ratings yet
MIT18 335JF10 Lec2a Hand
7 pages
DBMS-Module 5
No ratings yet
DBMS-Module 5
15 pages
Department of Computer Scienc2
No ratings yet
Department of Computer Scienc2
5 pages
Product Selection Guide: Displays, Memory and Storage
No ratings yet
Product Selection Guide: Displays, Memory and Storage
28 pages
Semester of MCS-BU: Name: Enayatullah Atal ID: K1S20MCS0003 Date: 4/12/2020 Class: 1 Subj: Advanced Algorithm D&A
No ratings yet
Semester of MCS-BU: Name: Enayatullah Atal ID: K1S20MCS0003 Date: 4/12/2020 Class: 1 Subj: Advanced Algorithm D&A
5 pages
Array
No ratings yet
Array
1 page
Blocked Matrix Multiply
No ratings yet
Blocked Matrix Multiply
6 pages
Matrix Multiplication Using SIMD Technologies
No ratings yet
Matrix Multiplication Using SIMD Technologies
13 pages
Archiving IDocs
No ratings yet
Archiving IDocs
10 pages
Experiment 4: University of Engineering and Technology, Taxila
No ratings yet
Experiment 4: University of Engineering and Technology, Taxila
5 pages
Chapter 9 How To Convert Basic C Code To MIPS Assembly Language
No ratings yet
Chapter 9 How To Convert Basic C Code To MIPS Assembly Language
25 pages
C++ Matrix Multiplication Program - The Crazy Programmer
No ratings yet
C++ Matrix Multiplication Program - The Crazy Programmer
1 page
WF - Telemetry - Application Notes - R1.0
No ratings yet
WF - Telemetry - Application Notes - R1.0
31 pages
MCS-51 Instructions Set
No ratings yet
MCS-51 Instructions Set
17 pages
Citra Log
No ratings yet
Citra Log
9 pages
USB Mini Projector User Manual
No ratings yet
USB Mini Projector User Manual
11 pages
Ser7 Ic: Serial To Seven Segment Controller Expandable To 32 Digits
No ratings yet
Ser7 Ic: Serial To Seven Segment Controller Expandable To 32 Digits
11 pages
Performance Experiments With Matrix Multiplication A Trivial Problem?
No ratings yet
Performance Experiments With Matrix Multiplication A Trivial Problem?
1 page
Chapter 1 PDF
No ratings yet
Chapter 1 PDF
26 pages
Experiments With Cache-Oblivious Matrix Multiplication For 18.335 (Optimal) Cache-Oblivious Matrix Multiply
No ratings yet
Experiments With Cache-Oblivious Matrix Multiplication For 18.335 (Optimal) Cache-Oblivious Matrix Multiply
1 page
O'Connor - Matrix Chain Multiplication
No ratings yet
O'Connor - Matrix Chain Multiplication
5 pages
Tema 5 Programación Orientada A Objetos Con BlueJ
No ratings yet
Tema 5 Programación Orientada A Objetos Con BlueJ
11 pages
Data Structure N Algorithm
No ratings yet
Data Structure N Algorithm
2 pages
2) PLC History
No ratings yet
2) PLC History
3 pages
Bitbox TERA E1200-8G-4S+
No ratings yet
Bitbox TERA E1200-8G-4S+
2 pages
SDK60 Product Brief PB003 PDF
No ratings yet
SDK60 Product Brief PB003 PDF
2 pages
Profound Python Libraries
From Everand
Profound Python Libraries
Onder Teker
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet

Assign 01

Uploaded by

Assign 01

Uploaded by

Question #1

Time Taken => 32.407530000 seconds for 2048x2048 matrix

Loop Order: (i, j, k)

Time Taken => 28.018704000 seconds 2048x2048 matrix

Loop Order: (j, i, k)

Time Taken => 23.626690000 seconds 2048x2048 matrix

Loop Order: (k, i, j)

For the 2048x2048 matrix multiplication using huge pages

For the 8192x8192 matrix multiplication using huge pages

// allocate memory dynamically to matrices A, B and C

// unallocate memory allocated by nmap() using munmap()

Time Taken => 34.480703000 seconds for 2048x2048 matrix

Loop Order: (i, j, k)

Time Taken => 32.050390000 seconds 2048x2048 matrix

Loop Order: (j, i, k)

Time Taken => 19.787064000 seconds 2048x2048 matrix

Loop Order: (k, i, j)

Time Taken => 29.406290000 seconds 2048x2048 matrix

// do matrix multiplication in a tiled fashion of tile-size=64

for(int kk = k; kk < k + BLOCK_SIZE; kk++){

Time Taken => 28.364288000 seconds 2048x2048 matrix

for(int jj = j; jj < j + BLOCK_SIZE; jj++){

Time Taken => 29.968686000 seconds 2048x2048 matrix

The 3 cloudsuite traces are:

Trace MPKI IPC Prediction Predictor Used

Trace MPKI IPC Prediction Predictor Used

Trace MPKI IPC Prediction Predictor Used

#define BIMODAL_TABLE_SIZE 262144

#define GS_HISTORY_TABLE_SIZE 262144

Update Table Entry Size = 8 B (at max) [i.e. 32 + 1 + 1 + 11 = 6 B]

There are 2^11 entries in each TAGE table. TAGE_BITS=11

Total Size = 52 + 4 KB = 56 KB which is within 64 KB budget

How to build and run these traces:

# Build 4 predictors as:

# Run 3 traces with bimodal

# Run 3 traces with gshare

# Run 3 traces with perceptron

# Run 3 traces with tage

We are first using a fixed number of TAGE Tables (say 8).

// Global history lengths according to the adjustment

// Clearly there are 8 TAGE Tables

Trace MPKI IPC Prediction Predictor Used

// Global history lengths according to the GP

// We still assume there are 12 TAGE Tables

When we simulate with 500M instructions:

Trace MPKI IPC Prediction Predictor Used

You might also like