Large x86 - 64 Assembly Programs
Large x86 - 64 Assembly Programs
Rodrigo Robles
[email protected]
1. Abstract
This paper describes the analysis of viability of large programs being written in x86_64
Assembly, considering technical limitations, development and maintenance costs, processor
and memory usage.
2. Introduction
Just a few years after the first computers came out, people started to design programming
languages, aiming for better ways to write complex programs. But the better abstraction
came with the price of a lower performance, both in speed and in memory consumption.
Due to the hardware limitations, It was usual to write complete programs in Assembly until
the late eighties. It was hard to get good performance on personal computers with 8-bit
processors and 64k mem. Assembly was mandatory these days to get good results for
commercial applications and games.
In the nineties the hardware boosted the performance fueled by Moore’s Law. These new
powerful machines allow less efficient code to be viable, trading expensive programmer time
for cheap hardware. Since then large programs are not written in Assembly anymore.
Hardware continued to evolve, bringing new architectures with a much more powerful
Assembly. It’s the case of x86_64, with 16 general purpose 64-bit integer registers, other 16
256-bit SIMD registers and a powerful instruction set. Writing Assembly programs with this
architecture is so much easier than with the old architectures. For example 6502 has a half
dozen specialized 8-bit registers and no floating point support.
With this amount of registers in x86_64 it is possible to write large functions only using
register variables, much larger than recommended sizes (Martin suggests 20 for high level
languages1, I would suggest 60 for Assembly). Register variables boosts performance a lot
because it avoids the delay of memory access.
All these features make it too easy to write Assembly code and allow us to dare: Writing
Assembly code today is almost as easy as writing high-level languages code.
3. Experiment
To validate this hypothesis, I decided to write a program with almost the same scope as
others I have written in the past. In the last few years I have written three little retro games in
javascript. The project can be found on gitlab2. I always measure time used for development
so I have some data for comparisons.
Table I. Retro games written in javascript
Average 78 hours
4. Tools
Tool Description
NASM Assembler
5. Architecture
To minimize dependencies, the ideal would be not using any external library. But in modern
Linux it is hard to access video and audio without some basic libraries, so I choose the most
basic possible: OpenGL/GLUT for graphics and keyboard input, OpenAL for sound. The
program only depends on these two libraries. Joystick input is handled directly by O.S. calls.
To ensure better code quality, a conservative approach was chosen. The code was
structured in functions and no jump was allowed outside the function.
Functions were limited to approximately 60 lines to fit into SRP principle (The function must
have only one responsibility6). Each function has a unique endpoint to guarantee the
execution of the epilogue.
The calling convention for Linux x86_64 is the System V AMD64 ABI, however this project
disrespected the convention for some functions.
System V AMD64 ABI states that the integer parameters are passed in the registers RDI,
RSI, RDX, RCX, R8, R9. Additional registers can be passed by the stack, but modern clean
code directives state that 6 parameters is the acceptable limit anyway. The first eight floating
point parameters are passed in registers XMM0, XMM1, XMM2, XMM3, XMM4, XMM5,
XMM6 and XMM7.
A real time game like this has a lot of global states, so the program must rely on a lot of
global variables.
extern glClearColor;
Structs are very useful, they are declared using the struc and endstruc macros:
struc ship
.x resq 1
.y resq 1
.prior resq 1
.next resq 1
.destroyed resq 1
.goalx resq 1
.goaly resq 1
.timedestroyed resq 1
endstruc
Example of typical procedure(function that returns nothing). In this case there are no
parameters too.
Notice that the function starts with the epilogue (sub rsp, 8) and ends with the prologue (add
rsp, 8). There are many instructions of x86_64 that require a stack aligned in 16, but every
function call pushes a qword in the stack, so sp must be realigned in the beginning of each
function. A function does not require this alignment if it does not have any of these
instructions, but is a good practice to have this prologue/epilogue in all the functions,
because the aligned instructions will issue a SIGSEGV that can confuse the developer, and
even worse, functions called by the misaligned one will issue a SIGSEGV turning even
harder the debugging.
Here is an example of function that take two parameters and return an integer value:
;rdi - integer
;rsi - pointer to string
;return - strlen
inttostr: sub rsp, 8
mov rax, rdi
mov rdi, rsi
xor rcx, rcx
call initshipgoal
To call dynamic linked external functions, we need to add the wrt ..plt suffix:
This last function is also an example of the use of structs. The construction
STRUCTNAME.DATA evaluates to the offset of the data inside the struct. If rsi is the pointer
to the struct, [rsi + enemy.destroyed] points to enemy.destroyed.
System calls are made by the syscall instruction. Notice that syscall has a different ABI, the
parameters are passed by rdi, rsi, rbx, r10, r8 and r9.
6. Negative scope
During the experiment the following downsides of developing in full Assembly x86_64 Linux
could be detected:
The most recommended IDE for Assembly for Linux today is SASM. Unfortunately at some
time integrated debugging stopped working, and it does not work well with multifile
programs. Quickly I jumped to the first generic text editor available.
NASM is the most recommended Assembler for Linux today. The debugging broke at
version 2.15 and I could’t downgrade due to dependencies, so I was forced to use a virtual
machine with 2.14 to have some debugging.
NASM forum have little movement, even less for the platform x86_64 Linux.
Debugging directly in gdb is very primitive. DDD offers a better interface, but still primitive,
and somewhat buggy. The 80’s Turbo C was at a higher level than the tools of today for
Assembly x86_64 Linux.
Of course Assembly has no type checking, so it’s easy to accidentally mix integers, floats
and doubles. It could be avoided by a good IDE, but this IDE does not exist yet.
The advantage of multiple instruction sets is also a problem, because you will have many
ways to do the same thing with different instructions, and you will need good knowledge on
all the sets to decide the better option to choose for each need. Probably ARC and RISC-V
architectures can allow faster decisions due to the simpler design. The same will apply to the
learning curve.
The obligation to manually align the stack for some instructions is annoying for beginners. In
the case of a non-aligned stack the developer will get a segmentation fault and can lose
some time until he gets used with this thing.
7.4. OpenGL
This is a downside of the programs using OpenGL and is not related with Assembly itself.
OpenGL is a complex state machine and is very tricky to use. I believe the major problems I
had during the development was related with OpenGL and not with Assembly.
8. Upsides
During the experiment the following upsides of developing in full Assembly x86_64 Linux
could be detected:
The basics of Assembly are easy to learn. A C programmer can quickly learn how to
translate C to Assembly, and some time after he will be doing speed optimizations.
8.2. Good productivity
After learning the basics, the time needed to write a function became close to the time
needed to write it in C language.
With a little experience the Assembly programmer can read a function like he was reading a
C function. Is also perfectly possible to keep functions short to enhance readability. The only
exception are speed optimized functions where breaking in smaller functions could prejudice
performance.
The majority of the bugs were quickly solved. The only exceptions were a couple of
segmentation faults, caused by stack misalignment, that took more time due to my
inexperience with this architecture.
I just needed debugging for a few bugs and after some time I could beat new bugs just
reading the code.
8.5 Performance
The performance got impressive numbers as expected. With a i7-3632QM 2.2 GHZ, running
at resolution 1366x768, it gets 670 FPS consuming average 4% of CPU and 34 Mb of
memory. The program executable has 90 Kb. Probably memory consumption can be
reduced using proper dynamic allocation instead of static memory for textures.
To run one of the comparison javascript games (at 60 FPS and with a much lower resolution
of 284 x 176) Chrome spawned 13 processes, two of them consuming the same amount of
CPU of 4%, and a total memory consumption of 276 Mb.
This study analyzed if it is possible to build a large Assembly x86_64 Linux application with
an acceptable effort.
Building a new Retro Game using pure x86_64 Assembly consumed 98 hours, 26% above
the average and coincidentally the same time I spent in one of my three javascript games.
This version of the game had a total of 2,146 lines of code. Using the table of IFPUG7 to
estimate function points by the average we have 320 lines per function point in Assembly, so
this program has 2,146 / 320 = 6.7 FP.
The hour/FP was 98 / 6.7 = 14.63. It is much less than the number in the same document for
Assembly development (61,18), even less than the number this document presents to C
language (26,27). It’s unlikely that Assembly development can be faster than C
development, but these numbers show that x86 i_64 Assembly is almost as easy to write as
a high level language. Curiously these results were achieved without any kind of framework
or previous code or libraries. Everything was written from scratch.
So the conclusion is that it is possible to make large full Assembly x86_64 programs with a
little extra effort than programs made in high-level languages.
Even when weaker instruction sets were available, like the original x86, experienced
Assembly developers do not sense a great productivity gain writing code in high level
languages. Randall Hyde wrote in his book “The Art of Assembly Language” that
programmers spend only about thirty percent of their time coding. Even if Assembly coding
spend double the time of a high level language, the time saved using a high level language
would not worth the benefits of using Assembly.8
While modern Assembly looks easy to write and maintain, it does not address the problem of
portability. Of course it’s no problem for many projects that aim for a specific platform.
The most visible advantage is the performance boost. Faster applications with very low CPU
and memory consumption. With a much higher resolution, the Assembly game performed
about 10 times faster than Javascript, using a tenth of the memory.
An average Assembly programmer will always generate code that consumes less CPU and
memory resources than any language. There is a busted myth that a modern compiler can
surpass an Assembly programmer. The fact is a compiler and a programmer do different
tasks. A compiler just bureaucratically translates the high level language to Assembly, limited
by the high level language abstractions, and applies limited optimizations. An Assembly
programmer is not constrained by the language abstractions nor by a fixed set of
optimizations.
In times of expensive cloud or in house resources, this can bring a huge economy. Just to
have a reference, a Java program can use double of CPU resources than a C program, and
consume 6 times more memory than a Pascal program9. Assembly can do even faster with
even less memory.
The main proof that handwritten Assembly is faster than compiler generated code is that
even modern compilers have several built-in functions written in Assembly. MSVC 2017 has
a lot of that, for example for memcpy function. Last Delphi versions also have a lot of
handwritten Assembly to get better performance. This applies to almost every compiler,
proprietary ou opensource. If even the compilers don't trust themselves that they can
generate better Assembly code than handwritten, why should you believe?
Considering the productivity of Assembly is not bad, on the other side several studies and
specialists point that the gain of productivity of modern languages is from small to none,
realizing that new abstractions bring more new problems than solutions.. Myrtveit analyzed
C++ and C projects from the famous ISBSG database and did not find any empirical
evidence that C++ improved productivity over C language15. Linus Torvalds demonized C++
language: “C++ is a horrible language”16. Briand made an extensive analysis on studies
about OOP advantages and found no evidence of improvement: “...technology adoption is
mostly the result of marketing forces, not scientific evidence.”17
Several researches point out that just a little improvement in software speed can lead to a
great rise of productivity and user satisfaction. In a study by Thadani it was found that a
decrease in response time from 2.2 seconds to .8 seconds increased programmer output by
58 percent, and the code quality improved by more than a factor of two18.
Flattening the stack is also an advantage. Modern high level languages always bring a lot of
official and third-party libraries to bloat the software. And many of them will also bring their
VM environment. Using full Assembly drives the Architecture to be minimal. In this example
of an Assembly game we avoid the navigator and all his bloat, avoiding third-party game
engines too, interacting directly with OpenGL, OpenAL, and the operating system.
But comparing it with javascript is unfair. Probably a C game without depending on a game
engine or other libraries can stand close to the Assembly performance.
It’s important to consider that it is hard to reach a precise mensuration of the productivity of
a specific programming language, because of the impact of several factors like specific
complexity of the programs measured or the experience of the programmer in the language.
Having more experiments in this subject could bring us to more precise conclusions. As a
final word of wisdom I must quote Brian Kernighan’s article “An elementary C Cost Model”19:
“Benchmarking is a difficult art, and it is all too easy to read more into a set of numbers than
is really there. Don’t make too much of these numbers, and don’t use them to settle
arguments, or even to start them. It’s much more important to appreciate the approach and
its limitations than to believe these values just because they are printed with two decimal
places.”
11.References
2. https://gitlab.com/RodrigoRobles/trevaskas-2
3. https://gitlab.com/RodrigoRobles/RiverRaidRunner
4. https://gitlab.com/RodrigoRobles/RetroJump
5. https://gitlab.com/RodrigoRobles/mazylife
6. R. C. Martin (2005). "The Single Responsibility Principle". The Clean Code Blog.
7.https://www.ifpug.org/wp-content/uploads/2017/04/IYSM.-Thirty-years-of-IFPUG.-Software-
Economics-and-Function-Point-Metrics-Capers-Jones.pdf
9. Pereira, R. et al. ‘Energy efficiency across programming languages: how do energy, time,
and memory relate’ (2017).
12. W. M. Taliaffero. Modularity. the key to system growth potential. IEEE Software,
1(3):245–257, July 1971.
14. D. P. Delorey et al. ‘Do Programming Languages Affect Productivity? A Case Study
Using Data from Open Source Projects’, FLOSS 07 (2007)
15. I. Myrtveit. “An empirical study of software development productivity in C and C++”, 1999
16. http://harmful.cat-v.org/software/c++/linus
19. J. L. Bentley, B. W. Kernighan and C. J. Van Wyk, ‘An elementary C cost model’, Unix
Review (1991).