Assign 01
Assign 01
Part A
The first variant uses (i, j, k) as the loop order
// do the matrix multiplication C = A * B
// loop order is (i, j, k)
for(int i = 0; i < ROWS; i++){
for(int j = 0; j < COLS; j++){
for(int k = 0; k < SIZE; k++){
matrixC[i][j] += (matrixA[i][k]*matrixB[k][j]);
}
}
}
/*
Some Calculation on large pages (2MB)
matrix size : 2048 * 2048 elements
: 2^22 ints
: 2^24 B
: 16 MB
: 8 huge pages per matrix
total size : 3 matrices A, B, C
: 3 * 8 huge pages each
: 24 huge pages
That means, we must have 24 huge pages each of size 2MB allocated
*/
}
}
}
(j, i, k) order
// do matrix multiplication in a tiled fashion of tile-size=64
// loop order: (j, i, k)
for(int j = 0; j < COLS; j += BLOCK_SIZE){
for(int i = 0; i < ROWS; i += BLOCK_SIZE){
for(int k = 0; k < SIZE; k += BLOCK_SIZE){
}
}
}
However, surprisingly, (i, j, k) performs better than (j, i, k), while (k, i, j) takes the least amount of time.
This is contrary to the fact that (i, j, k) was worst performing in non-tiled versions.
Question #2
Part A
Note: We are using 10M warmup and 500M actual simulation instructions.
The observed parameters are:
Bimodal Predictor
/*
-------------------
Storage Cost Estimation:
-------------------
We want to have at most 64KB for our bimodal predictor per CPU
We will use 2-bit saturating counter
So entry size = 2 bits
Number of entries = 64K Bytes / 2 bits
= 64 * 8 K bits / 2 bits
= 256 K
= 262144
Largest prime less than 262144 = 262139
*/
GShare Prdictor
/*
-------------------
Storage Cost Estimation:
-------------------
We want to have at most 64KB for our gshare predictor per CPU
We will use 2-bit saturating counter
So entry size = 2 bits
Number of entries = 64 * 8 K bits / 2 bits
= 256 K
= 262144
*/
#define GLOBAL_HISTORY_LENGTH 16
#define GLOBAL_HISTORY_MASK (1 << GLOBAL_HISTORY_LENGTH) - 1
int branch_history_vector[NUM_CPUS];
Perceptron Predictor
/*
—----------------------
Storage Cost Estimation:
-----------------------
Each perceptron has N=32 weights and each weight takes 8 bits
Each perceptron is thus 32 Bytes
We have a table of NUM_PERCEPTRONS=2048 perceptrons
Perceptron Table Size = 2048*32 Bytes = 64 KB
*/
TAGE Predictor
/*
—--------------------------
Storage Budget Justification:
---------------------------
We are using 12 TAGE tables with entry sizes as:
[24, 22, 20, 20, 18, 18, 16, 16, 14, 14, 12, 12]
as the tag_width are:
[19, 17, 15, 15, 13, 13, 11, 11, 9, 9, 7, 7]
and 5 bits for 'ctr' and 'u' fields
We are using ChampSim-IISc (modified version of ChampSim, adapted for the purpose) as our simulation
tool.
We navigate to the folder containing all the files:
We run these commands:
Part B
// number of bits used to index into each table, indicating table size
const uint8_t TAGE_INDEX_BITS[TAGE_NUM_COMPONENTS] = {13, 13, 12, 12, 11, 11, 10, 10};
// number of bits used to index into each table, indicating table size
const uint8_t TAGE_INDEX_BITS[TAGE_NUM_COMPONENTS]={13, 13, 12, 12, 11, 11, 11, 10};