Archlab
Archlab
Processor
1 Introduction
In this lab, you will learn about the design and implementation of a pipelined Y86-64
processor, optimizing both it and a benchmark program to maximize performance. You
are allowed to make any semantics-preserving transformation to the benchmark program, or
to make enhancements to the pipelined processor, or both. When you have completed the
lab, you will have a deep understanding of the interactions between code and hardware that
affects the performance of your programs.
The lab is organized into three parts, each with its own handin. In Part A you will write
some simple Y86-64 programs and become familiar with the Y86-64 tools. In Part B, you
will first extend the SEQ simulator with new instructions, then explore the road of CPU
piplining. These two parts will prepare you for Part C, the heart of the lab, where you will
optimize the Y86-64 benchmark program and its architecture design.
2 Logistics
You will work on this lab alone. Any clarifications and revisions to the assignment will
be posted on the course Web page.
3 Handout Instructions
1. You can do the lab on a Linux system.
3. Then give the command: tar xvf archlab-handout.tar. This command extracts
files to the same directory of archlab-handout.tar. You should check README to find
out what these files are for.
4. Go to archlab-project directory and build the Y86-64 simulator tools with the help
of archlab-project/README.md.
1
4 Environment Installation
This project is written in Rust, so you’d have your Rust toolchain installed. If you
haven’t, please execute the following command to install rustup. This installation requires
network access. Therefore make sure to connect to the gateway via lcpu connect before
installation.
curl --proto ’=https’ --tlsv1.2 -sSf https://sh.rustup.rs | sh
You can verify the installation by executing the command rustup.
Now install the Rust toolchain by executing the following command (by the time of
writing, the latest stable version is 1.81):
rustup install 1.81
rustup default 1.81
6 Part A
You will be working in directory archlab-project/misc in this part. Your task is to
write and simulate the following three Y86-64 programs. The required behavior of these
programs is defined by the example C functions in examples.c. Be sure to put your name
and ID(e.g. 2300054321) in a comment at the beginning of each program. You can test your
programs by first assembling them with the program yas and then running them with the
instruction set simulator yis.
In all of your Y86-64 functions, you should follow the x86-64 conventions for passing
function arguments, using registers, and using the stack. This includes saving and restoring
any callee-save registers that you use.
2
sum.ys: Iteratively sum linked list elements
Write a Y86-64 program sum.ys that iteratively sums the elements of a linked list. Your
program should consist of some code that sets up the stack structure, invokes a function,
and then halts. In this case, the function should be Y86-64 code for a function (sum list)
that is functionally equivalent to the C sum list function. Test your program using the
following three-element list:
.align 8
Array:
.quad 0xbca
.quad 0xcba
.quad 0xacb
.quad 0xcab
.quad 0xabc
.quad 0xbac
3
7 Part B
You will be working in directory archlab-project/sim/src/architectures/extra.
To add these instructions, you will modify the file seq full.rs, which implements the
version of SEQ described in the CS:APP3e textbook. The format of iopq looks like a
combination of irmovq and opq. We’ve provide instruction code IOPQ for you. The function
code of iopq is the same as that of opq.
After implementing this extension, you can write iaddq,isubq,ixorq in your Y86 as-
sembly code.
• pipe_s2: An example of 2-stage pipeline, containing instruction fetch stage and all
the rest part as decode stage.
Starting from pipe s2, which is modified from the SEQ+ architecture, each architec-
ture makes some minor improvements based on the previous one, which effectively shows a
possible evolution of CPU pipelining.
Among the above architectures, except pipe s2, some expressions in the HCL descrip-
tions are masked by placeholders. Your task is to read the HCL descriptions of each architec-
ture and replace each placeholder by a correct expression. There are 3 types of placeholders:
BOOL PLACEHOLDER, U8 PLACEHOLDER, and U64 PLACEHOLDER. The prefix of
a placeholder indicates the value type of its expression.
After completing all architectures, you should gain a deeper understanding of pipelining,
which can help you overcome some challenges in part C.
Suggestion: Read the comments around each placeholder carefully.
4
HCL (textbook & previous labs) HCL-rs (current lab)
D icode, E icode, F predPC, ... D.icode, E.icode, F.pred pc, ...
INOP, IHALT, IRRMOVQ, IIRMOVQ NOP, HALT, CMOVQ, IRMOVQ
SADR, SINS, SBUB Adr, Ins, Bub
word f_icode = [...]; u8 f_icode = [...];
word f_predPC = [...]; u64 f_pred_pc = [...];
Notice
In order to provide a better lab experience with greater flexibility and modern develop-
ment practices, we reimplement the HCL parser and simulator in Rust. With the help of the
VSCode extension rust-analyzer, we can now enjoy syntax highlighting, auto-completion,
type checking, and error reporting when writing HCL source code.
As a result, the hardware control language used in this lab (HCL-rs) is a bit different from
previous one, whose specification can be found at archlab-project/assets/hcl-rs.pdf.
To give a brief summary, the table 1 shows the major differences.
You should only pay attention to the HCL description in sim_macros::hcl block, and
pipeline register definitions in crate::defin_stages block.
8 Part C
You will be working in directory archlab-project/misc and
archlab-project/sim/src/architectures/extra in this part.
The ncopy function copies a len-element integer array src to a non-overlapping dst,
returning a count of the number of positive integers contained in src.The C description of
ncopy is in misc/ncopy.c.
Your task in Part C is to modify archlab-project/misc/ncopy.ys and
archlab-project/sim/src/architectures/extra/ncopy.rs with the goal of making
ncopy run as fast as possible.
You will be handing in these two files. Each file should begin with a header comment
with the following information:
• A high-level description of your code. In each case, describe how and why you modified
your code.
5
Coding Rules
You are free to make any modifications you wish, with the following constraints:
• Your ncopy function must work for arbitrary array sizes. You might be tempted to
hardwire your solution for 64-element arrays by simply coding 64 copy instructions,
but this would be a bad idea because we will be grading your solution based on its
performance on arbitrary arrays.
• Your ncopy function must run correctly with yis. By correctly, we mean that it
must correctly copy the src block and return (in %rax) the correct number of positive
integers.
• Size of assembled version of ncopy plus stack size is limited to 4Kb.(A little less than
4Kb in fact, you can check the grader code for an exact value) We will set the stack
register and argument registers well before calling your ncopy.
• Your ncopy.rs implementation must pass the correctness tests for general y86-64 code.
Other than that, you are free to implement other instructions if you think that will help.
You may make any semantics preserving transformations to the ncopy function, such as
reordering instructions, replacing groups of instructions with single instructions, deleting
some instructions, and adding other instructions. You may find it useful to read about loop
unrolling in Section 5.8 of CS:APP3e.
cd archlab-project
cargo build
6
10 Evaluation
The lab is worth 190 points: 30 points for Part A, 60 points for Part B, and 100 points
for Part C. You can run the follow command to grade your implementation locally:
The remote machine uses the same grader and all the checks(including vali-
dation test and length test...) are included. Note that the score here does not include
the credits for your descriptions.
Part A
Part A is worth 30 points, 10 points for each Y86-64 solution program. Each solution pro-
gram will be evaluated for correctness, including proper handling of the stack and registers,
as well as functional equivalence with the example C functions in examples.c.
The programs sum.ys and rsum.ys will be considered correct if the graders do not spot
any errors in them, and their respective sum list and rsum list functions return the sum
0xfed in register %rax.
The program bubble.ys will be considered correct if the graders do not spot any errors
in them, and the bubble sort function sorts the 6 integers correctly in ascending order,
with the results in the same 48 bytes beginning at address Array, and does not corrupt
other memory locations.
Part B
This part is worth 40 points, 5 points for the implementation of seq full.rs, and 5
points for each of pipe s3a,pipe s3b,pipe s3c,pipe s3d,pipe s4a,pipe s4b,pipe s4c.
For seq full.rs, you need to pass the ISA checks extended with the iopq instruction.
For pipe s3a,pipe s3b,pipe s3c,pipe s3d,pipe s4a,pipe s4b, each of the architec-
ture is required to pass ISA checks. Moreover, for each of them, the runtime status of each
CPU cycle is compared with its corresponding ground truth architecture to verify the cor-
rectness of each placeholder expression. Since the ground truths are not provided in the
student’s handout, this check is performed on autolab server.
Part C
This part of the Lab is worth 100 points: You will not receive any credit if either your
code for ncopy.ys or your modified architecture ncopy.rs fails.
• 60 points for performance. To receive credit here, your solution should not fail to pass
ISA checks.
7
We will evaluate the performance based on cpe (cycles per element) and ac (architecture
cost) of your implementation.
• cpe: if the simulated code requires C cycles to copy a block of N elements, then the
CPE is C/N . Since some cycles are used to set up the call to ncopy and to set up
the loop within ncopy, you will find that you will get different values of the CPE for
different block lengths (generally the CPE will drop as N increases). We will therefore
evaluate the performance of your function by computing the average of the CPEs for
blocks ranging from 1 to 64 elements.
Simply run the command
to see what happens. For example, the baseline version of the ncopy function and
architecture has CPE values ranging between 23.00 and 12.04, with an average of
12.82.
• ac: the length of the critical path of the ncopy architecture. Formally, the critical path
of a CPU architecture is the longest path of combinational logic between clocked ele-
ments (like flip-flops). The length of the critical path can be used to mesure the CPU’s
clock frequency, which in turn can be used to estimate the architecture performance.
In this lab, the length of the critical path is simplified as: 1 plus the maximum number
of hardware devices (units) that line up in a path of the architecture. For example,
seq_std has a critical path of length 8, and pipe_std has a critical path of length 4.
You can execute
to inspect the length of the critical path and the devices execution order of an architec-
ture. This command will also generate an HTML file that visualizes the dependency
graph of the architecture.
8
11 Handin Instructions
• You will be handing in three sets of files:
– Part A:
archlab-project/misc/bubble.ys
archlab-project/misc/sum.ys
archlab-project/misc/rsum.ys
– Part B:
archlab-project/sim/src/architectures/extra/seq_full.rs
archlab-project/sim/src/architectures/extra/pipe_s3a.rs
archlab-project/sim/src/architectures/extra/pipe_s3b.rs
archlab-project/sim/src/architectures/extra/pipe_s3c.rs
archlab-project/sim/src/architectures/extra/pipe_s3d.rs
archlab-project/sim/src/architectures/extra/pipe_s4a.rs
archlab-project/sim/src/architectures/extra/pipe_s4b.rs
archlab-project/sim/src/architectures/extra/pipe_s4c.rs
– Part C:
archlab-project/misc/ncopy.ys
archlab-project/sim/src/architectures/extra/ncopy.rs
• Make sure you have included your name and ID in a comment at the top of each of
your handin files.
• To create your handin files for the lab, go to your archlab-handout. Run make handin
to create archlab-handin.tar. Upload this tar to autolab for grading.
12 Hints
• ysim -A [arch] -I can generate an HTML file that visualizes the architecture com-
putational dependency graph. -v can be used to print detailed information of each
cycle. --max-cpu-cycle can be used to limit the number of CPU cycles.
• yis simulates your program w.r.t. the Y86 ISA specification. Its output can be seen
as ground truth.
• In part B, you can rely on editor features to display the differences between two files.
For vim users you can use vimdiff. For VSCode users you can first open one of the
source files, then goto “Help > Show All Commands” and type “Compare Active File
9
With...”, and select another file for comparison (at this point you’re able to edit both
files concurrently).
10