0% found this document useful (0 votes)
9 views19 pages

Assign 01

Uploaded by

Akash Maji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views19 pages

Assign 01

Uploaded by

Akash Maji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Question #1

Part A
The first variant uses (i, j, k) as the loop order
// do the matrix multiplication C = A * B
// loop order is (i, j, k)
for(int i = 0; i < ROWS; i++){
for(int j = 0; j < COLS; j++){
for(int k = 0; k < SIZE; k++){
matrixC[i][j] += (matrixA[i][k]*matrixB[k][j]);
}
}
}

Time Taken => 32.407530000 seconds for 2048x2048 matrix


Time Taken => 3020.189251000 seconds for 8192x8192 matrix

Loop Order: (i, j, k)


The second variant uses (j, i, k) as the loop order
// do the matrix multiplication C = A * B
// loop order is (j, i, k)
for(int j = 0; j < COLS; j++){
for(int i = 0; i < ROWS; i++){
for(int k = 0; k < SIZE; k++){
matrixC[i][j] += (matrixA[i][k] * matrixB[k][j]);
}
}
}

Time Taken => 28.018704000 seconds 2048x2048 matrix


Time Taken => 2519.410180000 seconds 8192x8192 matrix

Loop Order: (j, i, k)


The third variant uses (k, i, j) as the loop order
// do the matrix multiplication C = A * B
// loop order is (k, i, j)
for(int k = 0; k < SIZE; k++){
for(int i = 0; i < ROWS; i++){
for(int j = 0; j < COLS; j++){
matrixC[i][j] += (matrixA[i][k] * matrixB[k][j]);
}
}
}

Time Taken => 23.626690000 seconds 2048x2048 matrix


Time Taken => 1630.848939000 seconds 8192x8192 matrix

Loop Order: (k, i, j)


The order of efficiency in terms of time consumption is:
[I, J, K] < [J, I, K] < [K, I, J]
Part B

For the 2048x2048 matrix multiplication using huge pages

/*
Some Calculation on large pages (2MB)
matrix size : 2048 * 2048 elements
: 2^22 ints
: 2^24 B
: 16 MB
: 8 huge pages per matrix
total size : 3 matrices A, B, C
: 3 * 8 huge pages each
: 24 huge pages
That means, we must have 24 huge pages each of size 2MB allocated
*/

For the 8192x8192 matrix multiplication using huge pages


/*
Some Calculation on large pages (2MB)
matrix size : 8192 * 8192 elements
: 2^26 ints
: 2^28 B
: 256 MB
: 128 huge pages per matrix
total size : 3 matrices A, B, C
: 3 * 128 huge pages each
: 384 huge pages
That means, we must have 384 huge pages each of size 2MB allocated
*/

// allocate memory dynamically to matrices A, B and C


// using nmap with appropriate size and flags
int *matrixA = (int*)mmap(NULL,
NMAP_ALLOC_SIZE,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB,
-1, 0);
/*
Do your work here
*/

// unallocate memory allocated by nmap() using munmap()


munmap(matrixA, NMAP_ALLOC_SIZE);
The first variant uses (i, j, k) as the loop order
// do the matrix multiplication C = A * B
// loop order is (i, j, k)
for(int i = 0; i < ROWS; i++){
for(int j = 0; j < COLS; j++){
for(int k = 0; k < SIZE; k++){
matrixC[i * COLS + j] += (matrixA[i * COLS + k] * matrixB[k * COLS + j]);
}
}
}

Time Taken => 34.480703000 seconds for 2048x2048 matrix


Time Taken => 5426.222405000 seconds for 8192x8192 matrix

Loop Order: (i, j, k)


The second variant uses (j, i, k) as the loop order
// do the matrix multiplication C = A * B
// loop order is (j, i, k)
for(int j = 0; j < COLS; j++){
for(int i = 0; i < ROWS; i++){
for(int k = 0; k < SIZE; k++){
matrixC[i][j] += (matrixA[i][k] * matrixB[k][j]);
}
}
}

Time Taken => 32.050390000 seconds 2048x2048 matrix


Time Taken => 5272.878529000 seconds 8192x8192 matrix

Loop Order: (j, i, k)


The third variant uses (k, i, j) as the loop order
// do the matrix multiplication C = A * B
// loop order is (k, i, j)
for(int k = 0; k < SIZE; k++){
for(int i = 0; i < ROWS; i++){
for(int j = 0; j < COLS; j++){
matrixC[i][j] += (matrixA[i][k] * matrixB[k][j]);
}
}
}

Time Taken => 19.787064000 seconds 2048x2048 matrix


Time Taken => 1343.409960000 seconds 8192x8192 matrix

Loop Order: (k, i, j)


The order of efficiency in terms of time consumption using huge pages is:
[I, J, K] < [J, I, K] < [K, I, J]
Part C
We have to do matrix multiplication in a tiled fashion as shown (here we only show (i, j, k) loop order)
// do matrix multiplication in a tiled fashion of tile-size=64
// loop order: (i, j, k)
for(int i = 0; i < ROWS; i += BLOCK_SIZE){
for(int j = 0; j < COLS; j += BLOCK_SIZE){
for(int k = 0; k < SIZE; k += BLOCK_SIZE){
for(int ii = i; ii < i + BLOCK_SIZE; ii++){
for(int jj = j; jj < j + BLOCK_SIZE; jj++){
for(int kk = k; kk < k + BLOCK_SIZE; kk++){
matrixC[ii][jj] += (matrixA[ii][kk] * matrixB[kk][jj]);
}
}
}
}
}
}

Time Taken => 29.406290000 seconds 2048x2048 matrix


Time Taken => 2086.445759000 seconds 8192x8192 matrix
(k,i, j) order

// do matrix multiplication in a tiled fashion of tile-size=64


// loop order: (k, i, j)
for(int k = 0; k < SIZE; k += BLOCK_SIZE){
for(int i = 0; i < ROWS; i += BLOCK_SIZE){
for(int j = 0; j < COLS; j += BLOCK_SIZE){

for(int kk = k; kk < k + BLOCK_SIZE; kk++){


for(int ii = i; ii < i + BLOCK_SIZE; ii++){
for(int jj = j; jj < j + BLOCK_SIZE; jj++){
matrixC[ii][jj] += (matrixA[ii][kk] * matrixB[kk][jj]);
}
}
}

}
}
}

Time Taken => 28.364288000 seconds 2048x2048 matrix


Time Taken => 2059.293776000 seconds 8192x8192 matrix

(j, i, k) order
// do matrix multiplication in a tiled fashion of tile-size=64
// loop order: (j, i, k)
for(int j = 0; j < COLS; j += BLOCK_SIZE){
for(int i = 0; i < ROWS; i += BLOCK_SIZE){
for(int k = 0; k < SIZE; k += BLOCK_SIZE){

for(int jj = j; jj < j + BLOCK_SIZE; jj++){


for(int ii = i; ii < i + BLOCK_SIZE; ii++){
for(int kk = k; kk < k + BLOCK_SIZE; kk++){
matrixC[ii][jj] += (matrixA[ii][kk] * matrixB[kk][jj]);
}
}
}

}
}
}

Time Taken => 29.968686000 seconds 2048x2048 matrix


Time Taken => 2103.699815000 seconds 8192x8192 matrix
In this case, we do not see much time savings when we run the three loop orders. There are only
marginal differences in terms of time taken.

However, surprisingly, (i, j, k) performs better than (j, i, k), while (k, i, j) takes the least amount of time.
This is contrary to the fact that (i, j, k) was worst performing in non-tiled versions.
Question #2

Part A

The 3 cloudsuite traces are:


1. Trace1: nutch_phase2_core1.trace.xz
2. Trace2: cloud9_phase1_core1.trace.xz
3. Trace3: classification_phase0_core3.trace.xz

Note: We are using 10M warmup and 500M actual simulation instructions.
The observed parameters are:

Trace MPKI IPC Prediction Predictor Used


Accuracy
Trace1 16.5163 1.36976 93.499% Bimodal
Trace1 0.005922 2.05911 99.9977% GShare
Trace1 0.00844 2.059 99.9967% Perceptron
Trace1 0.006368 2.05922 99.9975% TAGE

Trace MPKI IPC Prediction Predictor Used


Accuracy
Trace2 4.374 0.393449 97.9909% Bimodal
Trace2 0.73988 0.409175 99.6602% GShare
Trace2 0.416594 0.410222 99.8086% Perceptron
Trace2 0.117964 0.409526 99.9458% TAGE

Trace MPKI IPC Prediction Predictor Used


Accuracy
Trace3 4.42255 0.361737 96.6896% Bimodal
Trace3 3.08063 0.364767 97.6941% GShare
Trace3 2.48727 0.366098 98.1382% Perceptron
Trace3 1.6329 0.370232 98.7777% TAGE
Storage Justification:

Bimodal Predictor
/*
-------------------
Storage Cost Estimation:
-------------------
We want to have at most 64KB for our bimodal predictor per CPU
We will use 2-bit saturating counter
So entry size = 2 bits
Number of entries = 64K Bytes / 2 bits
= 64 * 8 K bits / 2 bits
= 256 K
= 262144
Largest prime less than 262144 = 262139
*/

#define BIMODAL_TABLE_SIZE 262144


#define BIMODAL_PRIME 262139
#define MAX_COUNTER 3
int bimodal_table[NUM_CPUS][BIMODAL_TABLE_SIZE];
*/

GShare Prdictor
/*
-------------------
Storage Cost Estimation:
-------------------
We want to have at most 64KB for our gshare predictor per CPU
We will use 2-bit saturating counter
So entry size = 2 bits
Number of entries = 64 * 8 K bits / 2 bits
= 256 K
= 262144

*/
#define GLOBAL_HISTORY_LENGTH 16
#define GLOBAL_HISTORY_MASK (1 << GLOBAL_HISTORY_LENGTH) - 1
int branch_history_vector[NUM_CPUS];

#define GS_HISTORY_TABLE_SIZE 262144


int gs_history_table[NUM_CPUS][GS_HISTORY_TABLE_SIZE];
int my_last_prediction[NUM_CPUS];

Perceptron Predictor

/*
—----------------------
Storage Cost Estimation:
-----------------------
Each perceptron has N=32 weights and each weight takes 8 bits
Each perceptron is thus 32 Bytes
We have a table of NUM_PERCEPTRONS=2048 perceptrons
Perceptron Table Size = 2048*32 Bytes = 64 KB

Update Table Entry Size = 8 B (at max) [i.e. 32 + 1 + 1 + 11 = 6 B]


Update Table Size = 256 * 6 B < 2 KB
Total Size Taken is within 64+2 KB

*/

TAGE Predictor
/*
—--------------------------
Storage Budget Justification:
---------------------------
We are using 12 TAGE tables with entry sizes as:
[24, 22, 20, 20, 18, 18, 16, 16, 14, 14, 12, 12]
as the tag_width are:
[19, 17, 15, 15, 13, 13, 11, 11, 9, 9, 7, 7]
and 5 bits for 'ctr' and 'u' fields

There are 2^11 entries in each TAGE table. TAGE_BITS=11


So TAGE Tables Size = (sum[24, 22, 20, 20, 18, 18, 16, 16,
14, 14, 12, 12] bits * 2^11)
= 206 bits * 2048
= 52 KB
Bimodal Table Size = 2^13 entries each of 2 bits
= 2 ^ 14 bits
= 2 ^ 11 B
= 4 KB

Total Size = 52 + 4 KB = 56 KB which is within 64 KB budget


*/

How to build and run these traces:

We are using ChampSim-IISc (modified version of ChampSim, adapted for the purpose) as our simulation
tool.
We navigate to the folder containing all the files:
We run these commands:

# Build 4 predictors as:


# ./build_champsim_iisc.sh bimodal no no no next_line lru 1
# ./build_champsim_iisc.sh gshare no no no next_line lru 1
# ./build_champsim_iisc.sh perceptron no no no next_line lru 1
# ./build_champsim_iisc.sh tage no no no next_line lru 1

# Run 3 traces with bimodal


# sudo ./run_cloudsuite_iisc.sh bimodal-no-no-no-next_line-lru-1core 10 500 trace1.xz
# sudo ./run_cloudsuite_iisc.sh bimodal-no-no-no-next_line-lru-1core 10 500 trace2.xz
# sudo ./run_cloudsuite_iisc.sh bimodal-no-no-no-next_line-lru-1core 10 500 trace3.xz

# Run 3 traces with gshare


# sudo ./run_cloudsuite_iisc.sh gshare-no-no-no-next_line-lru-1core 10 500 trace1.xz
# sudo ./run_cloudsuite_iisc.sh gshare-no-no-no-next_line-lru-1core 10 500 trace2.xz
# sudo ./run_cloudsuite_iisc.sh gshare-no-no-no-next_line-lru-1core 10 500 trace3.xz

# Run 3 traces with perceptron


# sudo ./run_cloudsuite_iisc.sh perceptron-no-no-no-next_line-lru-1core 10 500 trace1.xz
# sudo ./run_cloudsuite_iisc.sh perceptron-no-no-no-next_line-lru-1core 10 500 trace2.xz
# sudo ./run_cloudsuite_iisc.sh perceptron-no-no-no-next_line-lru-1core 10 500 trace3.xz

# Run 3 traces with tage


# sudo ./run_cloudsuite_iisc.sh tage-no-no-no-next_line-lru-1core 10 500 trace1.xz
# sudo ./run_cloudsuite_iisc.sh tage-no-no-no-next_line-lru-1core 10 500 trace2.xz
# sudo ./run_cloudsuite_iisc.sh tage-no-no-no-next_line-lru-1core 10 500 trace3.xz

Part B

We are first using a fixed number of TAGE Tables (say 8).


Case-1
We are using the different sizes of TAGE Tables (say max 2^13), and history lengths as specified below.
We will first try to see any improvement, with this setting.

// Global history lengths according to the adjustment


History lengths = [4 6 10 16 26 42 67 107]

// number of bits used to index into each table, indicating table size
const uint8_t TAGE_INDEX_BITS[TAGE_NUM_COMPONENTS] = {13, 13, 12, 12, 11, 11, 10, 10};

// Clearly there are 8 TAGE Tables


// Assume each has different table size (max is 2^13 entries)
When we simulate with 510M instructions:

Trace MPKI IPC Prediction Predictor Used


Accuracy
Trace1 0.00455686 2.05965 99.9982% TAGE
Trace2 0.0957255 0.409588 99.956% TAGE
Trace3 1.97284 0.368013 98.523% TAGE
Best Trace
Case-2
We are using the different sizes of TAGE Tables (say max 2^13 again but with different sizes as before),
and history lengths in geometric progression. We will then try to see any improvement, with this setting.

// Global history lengths according to the GP


History lengths = [4 8 16 32 64 128 256 512]

// number of bits used to index into each table, indicating table size
const uint8_t TAGE_INDEX_BITS[TAGE_NUM_COMPONENTS]={13, 13, 12, 12, 11, 11, 11, 10};

// We still assume there are 12 TAGE Tables


// And each table has different size (max is 2^13 entries)

When we simulate with 500M instructions:

Trace MPKI IPC Prediction Predictor Used


Accuracy
Trace1 TAGE
Trace2 TAGE
Trace3 TAGE
Best Trace
Part C

You might also like