0% found this document useful (0 votes)
85 views10 pages

Large x86 - 64 Assembly Programs

The document describes an experiment to analyze the viability of writing large programs in x86_64 assembly language. The author wrote a retro game in assembly and compared the development time to previous games written in JavaScript. Key findings were that modern hardware allows efficient assembly code, but tooling and debugging support for assembly on Linux is still primitive compared to high-level languages. Writing optimized assembly code requires avoiding external libraries and managing all resources manually.

Uploaded by

pulp noir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views10 pages

Large x86 - 64 Assembly Programs

The document describes an experiment to analyze the viability of writing large programs in x86_64 assembly language. The author wrote a retro game in assembly and compared the development time to previous games written in JavaScript. Key findings were that modern hardware allows efficient assembly code, but tooling and debugging support for assembly on Linux is still primitive compared to high-level languages. Writing optimized assembly code requires avoiding external libraries and managing all resources manually.

Uploaded by

pulp noir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Large x86_64 Assembly Programs

Rodrigo Robles
[email protected]

1. Abstract

This paper describes the analysis of viability of large programs being written in x86_64
Assembly, considering technical limitations, development and maintenance costs, processor
and memory usage.

2. Introduction

Just a few years after the first computers came out, people started to design programming
languages, aiming for better ways to write complex programs. But the better abstraction
came with the price of a lower performance, both in speed and in memory consumption.

Due to the hardware limitations, It was usual to write complete programs in Assembly until
the late eighties. It was hard to get good performance on personal computers with 8-bit
processors and 64k mem. Assembly was mandatory these days to get good results for
commercial applications and games.

In the nineties the hardware boosted the performance fueled by Moore’s Law. These new
powerful machines allow less efficient code to be viable, trading expensive programmer time
for cheap hardware. Since then large programs are not written in Assembly anymore.

Hardware continued to evolve, bringing new architectures with a much more powerful
Assembly. It’s the case of x86_64, with 16 general purpose 64-bit integer registers, other 16
256-bit SIMD registers and a powerful instruction set. Writing Assembly programs with this
architecture is so much easier than with the old architectures. For example 6502 has a half
dozen specialized 8-bit registers and no floating point support.

With this amount of registers in x86_64 it is possible to write large functions only using
register variables, much larger than recommended sizes (Martin suggests 20 for high level
languages1, I would suggest 60 for Assembly). Register variables boosts performance a lot
because it avoids the delay of memory access.

All these features make it too easy to write Assembly code and allow us to dare: Writing
Assembly code today is almost as easy as writing high-level languages code.

3. Experiment

To validate this hypothesis, I decided to write a program with almost the same scope as
others I have written in the past. In the last few years I have written three little retro games in
javascript. The project can be found on gitlab2. I always measure time used for development
so I have some data for comparisons.
Table I. Retro games written in javascript

Project Time spent

Javascript Retro Game 13 78 hours

Javascript Retro Game 24 98 hours

Javascript Retro Game 35 59 hours

Average 78 hours

4. Tools

The following tools was used for this experiment:

Table II. Tools used

Tool Description

NASM Assembler

SASM IDE for Assembly

gcc Used for linking (internally it calls ld)

gedit Text editor

5. Architecture

To minimize dependencies, the ideal would be not using any external library. But in modern
Linux it is hard to access video and audio without some basic libraries, so I choose the most
basic possible: OpenGL/GLUT for graphics and keyboard input, OpenAL for sound. The
program only depends on these two libraries. Joystick input is handled directly by O.S. calls.

To ensure better code quality, a conservative approach was chosen. The code was
structured in functions and no jump was allowed outside the function.

Functions were limited to approximately 60 lines to fit into SRP principle (The function must
have only one responsibility6). Each function has a unique endpoint to guarantee the
execution of the epilogue.

The calling convention for Linux x86_64 is the System V AMD64 ABI, however this project
disrespected the convention for some functions.

System V AMD64 ABI states that the integer parameters are passed in the registers RDI,
RSI, RDX, RCX, R8, R9. Additional registers can be passed by the stack, but modern clean
code directives state that 6 parameters is the acceptable limit anyway. The first eight floating
point parameters are passed in registers XMM0, XMM1, XMM2, XMM3, XMM4, XMM5,
XMM6 and XMM7.
A real time game like this has a lot of global states, so the program must rely on a lot of
global variables.

External functions are declared via the extern directive:

extern glClearColor;

Structs are very useful, they are declared using the struc and endstruc macros:

struc ship
.x resq 1
.y resq 1
.prior resq 1
.next resq 1
.destroyed resq 1
.goalx resq 1
.goaly resq 1
.timedestroyed resq 1
endstruc

Example of typical procedure(function that returns nothing). In this case there are no
parameters too.

manageshot: sub rsp, 8


movq xmm0, [shot_y]
movq xmm1, [delta]
mulsd xmm1, [SHOT_SPEED]
subsd xmm0, xmm1
movq [shot_y], xmm0
add rsp, 8
ret

Notice that the function starts with the epilogue (sub rsp, 8) and ends with the prologue (add
rsp, 8). There are many instructions of x86_64 that require a stack aligned in 16, but every
function call pushes a qword in the stack, so sp must be realigned in the beginning of each
function. A function does not require this alignment if it does not have any of these
instructions, but is a good practice to have this prologue/epilogue in all the functions,
because the aligned instructions will issue a SIGSEGV that can confuse the developer, and
even worse, functions called by the misaligned one will issue a SIGSEGV turning even
harder the debugging.
Here is an example of function that take two parameters and return an integer value:

;rdi - integer
;rsi - pointer to string
;return - strlen
inttostr: sub rsp, 8
mov rax, rdi
mov rdi, rsi
xor rcx, rcx

loop1: xor rdx, rdx


mov rbx, 10
div rbx
add rdx, [FIRST_ASCII_LETTER]
push rdx
inc rcx
cmp rax, 0
jz reverse
jmp loop1

reverse: xor rdx, rdx


loop3: pop rax
mov [rdi + rdx], al
inc rdx
dec rcx
jnz loop3

mov byte [rdi + rdx], 10 ;LINE FEED


mov rax, rdx
add rsp, 8
ret

Calling a function with no parameters require just one line of Assembly:

call initshipgoal

To call dynamic linked external functions, we need to add the wrt ..plt suffix:

call alutExit wrt ..plt

Examples of calling functions with parameters:

mov rdi, [dq_i_screenwidth]


mov rsi, [dq_i_screenheight]
call glutInitWindowSize wrt ..plt

movd xmm0, dword [dd_one]


call glClearDepth wrt ..plt
A typical loop will use the rcx register as the counter, and the pair cmp/jxx to jump to the start
of the loop, like in this example:

create_enemies: sub rsp, 8


xor rcx, rcx
loopcrenemies: mov rax, rcx
mov rdx, enemy_size
mul rdx
add rax, enemies
mov rsi, rax
mov qword [rsi + enemy.kind], 0
mov qword [rsi + enemy.destroyed], 1
mov qword [rsi + enemy.timedestroyed], 0
inc rcx
cmp rcx, [ENEMIES_COUNT]
jne loopcrenemies
add rsp, 8
ret

This last function is also an example of the use of structs. The construction
STRUCTNAME.DATA evaluates to the offset of the data inside the struct. If rsi is the pointer
to the struct, [rsi + enemy.destroyed] points to enemy.destroyed.

System calls are made by the syscall instruction. Notice that syscall has a different ABI, the
parameters are passed by rdi, rsi, rbx, r10, r8 and r9.

loadtime: sub rsp, 8


xor rax, rax
mov [tv], rax
mov [tv + 4], rax
mov [tz], rax
mov [tz + 4], rax
mov rax, [SYS_GETTIMEOFDAY]
mov rdi, tv
mov rsi, tz
syscall
add rsp, 8
ret

6. Negative scope

● Speed optimizations and SIMD instructions was not used;


● Modern OpenGL or Vulcan was also not used.
7. Downsides

During the experiment the following downsides of developing in full Assembly x86_64 Linux
could be detected:

7.1. Weak tools and community

The most recommended IDE for Assembly for Linux today is SASM. Unfortunately at some
time integrated debugging stopped working, and it does not work well with multifile
programs. Quickly I jumped to the first generic text editor available.

NASM is the most recommended Assembler for Linux today. The debugging broke at
version 2.15 and I could’t downgrade due to dependencies, so I was forced to use a virtual
machine with 2.14 to have some debugging.

NASM forum have little movement, even less for the platform x86_64 Linux.

Debugging directly in gdb is very primitive. DDD offers a better interface, but still primitive,
and somewhat buggy. The 80’s Turbo C was at a higher level than the tools of today for
Assembly x86_64 Linux.

7.2. No type checking

Of course Assembly has no type checking, so it’s easy to accidentally mix integers, floats
and doubles. It could be avoided by a good IDE, but this IDE does not exist yet.

7.3. Twisted Intel Architecture

The advantage of multiple instruction sets is also a problem, because you will have many
ways to do the same thing with different instructions, and you will need good knowledge on
all the sets to decide the better option to choose for each need. Probably ARC and RISC-V
architectures can allow faster decisions due to the simpler design. The same will apply to the
learning curve.

The obligation to manually align the stack for some instructions is annoying for beginners. In
the case of a non-aligned stack the developer will get a segmentation fault and can lose
some time until he gets used with this thing.

7.4. OpenGL

This is a downside of the programs using OpenGL and is not related with Assembly itself.
OpenGL is a complex state machine and is very tricky to use. I believe the major problems I
had during the development was related with OpenGL and not with Assembly.

8. Upsides

During the experiment the following upsides of developing in full Assembly x86_64 Linux
could be detected:

8.1. Fast learning curve

The basics of Assembly are easy to learn. A C programmer can quickly learn how to
translate C to Assembly, and some time after he will be doing speed optimizations.
8.2. Good productivity

After learning the basics, the time needed to write a function became close to the time
needed to write it in C language.

8.3. Readable and clean code

With a little experience the Assembly programmer can read a function like he was reading a
C function. Is also perfectly possible to keep functions short to enhance readability. The only
exception are speed optimized functions where breaking in smaller functions could prejudice
performance.

8.4 Easy maintenance

The majority of the bugs were quickly solved. The only exceptions were a couple of
segmentation faults, caused by stack misalignment, that took more time due to my
inexperience with this architecture.

I just needed debugging for a few bugs and after some time I could beat new bugs just
reading the code.

8.5 Performance

The performance got impressive numbers as expected. With a i7-3632QM 2.2 GHZ, running
at resolution 1366x768, it gets 670 FPS consuming average 4% of CPU and 34 Mb of
memory. The program executable has 90 Kb. Probably memory consumption can be
reduced using proper dynamic allocation instead of static memory for textures.

To run one of the comparison javascript games (at 60 FPS and with a much lower resolution
of 284 x 176) Chrome spawned 13 processes, two of them consuming the same amount of
CPU of 4%, and a total memory consumption of 276 Mb.

9. Discussion and Conclusions

This study analyzed if it is possible to build a large Assembly x86_64 Linux application with
an acceptable effort.

Building a new Retro Game using pure x86_64 Assembly consumed 98 hours, 26% above
the average and coincidentally the same time I spent in one of my three javascript games.

This version of the game had a total of 2,146 lines of code. Using the table of IFPUG7 to
estimate function points by the average we have 320 lines per function point in Assembly, so
this program has 2,146 / 320 = 6.7 FP.

The hour/FP was 98 / 6.7 = 14.63. It is much less than the number in the same document for
Assembly development (61,18), even less than the number this document presents to C
language (26,27). It’s unlikely that Assembly development can be faster than C
development, but these numbers show that x86 i_64 Assembly is almost as easy to write as
a high level language. Curiously these results were achieved without any kind of framework
or previous code or libraries. Everything was written from scratch.
So the conclusion is that it is possible to make large full Assembly x86_64 programs with a
little extra effort than programs made in high-level languages.

Even when weaker instruction sets were available, like the original x86, experienced
Assembly developers do not sense a great productivity gain writing code in high level
languages. Randall Hyde wrote in his book “The Art of Assembly Language” that
programmers spend only about thirty percent of their time coding. Even if Assembly coding
spend double the time of a high level language, the time saved using a high level language
would not worth the benefits of using Assembly.8

While modern Assembly looks easy to write and maintain, it does not address the problem of
portability. Of course it’s no problem for many projects that aim for a specific platform.

The most visible advantage is the performance boost. Faster applications with very low CPU
and memory consumption. With a much higher resolution, the Assembly game performed
about 10 times faster than Javascript, using a tenth of the memory.

An average Assembly programmer will always generate code that consumes less CPU and
memory resources than any language. There is a busted myth that a modern compiler can
surpass an Assembly programmer. The fact is a compiler and a programmer do different
tasks. A compiler just bureaucratically translates the high level language to Assembly, limited
by the high level language abstractions, and applies limited optimizations. An Assembly
programmer is not constrained by the language abstractions nor by a fixed set of
optimizations.

In times of expensive cloud or in house resources, this can bring a huge economy. Just to
have a reference, a Java program can use double of CPU resources than a C program, and
consume 6 times more memory than a Pascal program9. Assembly can do even faster with
even less memory.

The main proof that handwritten Assembly is faster than compiler generated code is that
even modern compilers have several built-in functions written in Assembly. MSVC 2017 has
a lot of that, for example for memcpy function. Last Delphi versions also have a lot of
handwritten Assembly to get better performance. This applies to almost every compiler,
proprietary ou opensource. If even the compilers don't trust themselves that they can
generate better Assembly code than handwritten, why should you believe?

Another important question to consider is that programmer productivity is affected by a lot of


other factors that can cause much more impact than the choice of the language. Vyhmeister
lists six categories that impact programmer productivity (Financial Constraints, Time
Constraints, Software Specifications, Programming Methodology, Corporate Environment
and Uncontrollable Environment). These categories explode in 28 factors10. The slightly
higher hour/FP of modern Assembly can have a low impact on the total productivity if we
consider all of these factors.

Brooks states, “Productivity seems constant in terms of elementary statements, a conclusion


that is reasonable in terms of the thought a statement requires and the errors it may
include.”11 Of course there is a difference in productivity between languages, but he
suggested that this difference is small. He cites multiple authors to support this conclusion
including Taliaffero12 and Wolverton13. There is little data supporting this conclusion, so
Delorey did an extensive research in open source projects and proved that yes, the amount
of productivity in LOC/time varies between languages14. But the results show a variation of
only 60% from the worst to the best, and what it actually proves is that higher abstraction
languages generate less lines of code, which is quite obvious. But LOC/time is not the better
metric available. Measuring productivity in time/Function Point is a fair choice, like in the
IFPUG data referenced above.

Considering the productivity of Assembly is not bad, on the other side several studies and
specialists point that the gain of productivity of modern languages is from small to none,
realizing that new abstractions bring more new problems than solutions.. Myrtveit analyzed
C++ and C projects from the famous ISBSG database and did not find any empirical
evidence that C++ improved productivity over C language15. Linus Torvalds demonized C++
language: “C++ is a horrible language”16. Briand made an extensive analysis on studies
about OOP advantages and found no evidence of improvement: “...technology adoption is
mostly the result of marketing forces, not scientific evidence.”17

Several researches point out that just a little improvement in software speed can lead to a
great rise of productivity and user satisfaction. In a study by Thadani it was found that a
decrease in response time from 2.2 seconds to .8 seconds increased programmer output by
58 percent, and the code quality improved by more than a factor of two18.

Flattening the stack is also an advantage. Modern high level languages always bring a lot of
official and third-party libraries to bloat the software. And many of them will also bring their
VM environment. Using full Assembly drives the Architecture to be minimal. In this example
of an Assembly game we avoid the navigator and all his bloat, avoiding third-party game
engines too, interacting directly with OpenGL, OpenAL, and the operating system.

But comparing it with javascript is unfair. Probably a C game without depending on a game
engine or other libraries can stand close to the Assembly performance.

It’s important to consider that it is hard to reach a precise mensuration of the productivity of
a specific programming language, because of the impact of several factors like specific
complexity of the programs measured or the experience of the programmer in the language.
Having more experiments in this subject could bring us to more precise conclusions. As a
final word of wisdom I must quote Brian Kernighan’s article “An elementary C Cost Model”19:

“Benchmarking is a difficult art, and it is all too easy to read more into a set of numbers than
is really there. Don’t make too much of these numbers, and don’t use them to settle
arguments, or even to start them. It’s much more important to appreciate the approach and
its limitations than to believe these values just because they are printed with two decimal
places.”

11.References

1. R. C. Martin, “Clean Code”, First Edition, Chapter 3 (2008).

2. https://gitlab.com/RodrigoRobles/trevaskas-2

3. https://gitlab.com/RodrigoRobles/RiverRaidRunner

4. https://gitlab.com/RodrigoRobles/RetroJump
5. https://gitlab.com/RodrigoRobles/mazylife

6. R. C. Martin (2005). "The Single Responsibility Principle". The Clean Code Blog.

7.https://www.ifpug.org/wp-content/uploads/2017/04/IYSM.-Thirty-years-of-IFPUG.-Software-
Economics-and-Function-Point-Metrics-Capers-Jones.pdf

8. R Hyde, “The Art of Assembly Language”, First Edition (1996).

9. Pereira, R. et al. ‘Energy efficiency across programming languages: how do energy, time,
and memory relate’ (2017).

10. Vyhmeister, R. “Programmer Productivity” (1996) Available online at


http://www.andrews.edu/~vyhmeisr/papers/progprod.html

11. F. P. Brooks. The Mythical Man-Month: Essays on Software Engineering. Addison


Wesley, Boston, MA (1995).

12. W. M. Taliaffero. Modularity. the key to system growth potential. IEEE Software,
1(3):245–257, July 1971.

13. R. W. Wolverton. The cost of developing large-scale software. IEEE Transactions on


Computers, C-23(6):615–636, June 1974

14. D. P. Delorey et al. ‘Do Programming Languages Affect Productivity? A Case Study
Using Data from Open Source Projects’, FLOSS 07 (2007)

15. I. Myrtveit. “An empirical study of software development productivity in C and C++”, 1999

16. http://harmful.cat-v.org/software/c++/linus

17. Briand, LC, E Arisholm, S Counsell, F Houdek, and P Thevenod-Fosse, “Empirical


studies of object-oriented artifacts, methods, and processes: state of the art and future
directions”, Empirical Software Engineering, vol. 4, no. 4, pp. 387-404 (1999)

18. Thadhani, A. J. "Factors Affecting Programmer Productivity During Application


Development" in IBM Systems Journal, 23:1 (1984), 19-35.

19. J. L. Bentley, B. W. Kernighan and C. J. Van Wyk, ‘An elementary C cost model’, Unix
Review (1991).

You might also like