0% found this document useful (0 votes)
6 views12 pages

Some Other Nice Things

The document discusses advanced techniques in 3D coding, focusing on frame skipping to ensure consistent performance across different hardware, and optimizing assembly code for Pentium processors. It explains the concept of pairing instructions to enhance execution speed and introduces palette quantization methods for optimizing color representation in graphics. The author, Henri 'RoDeX' Tuhkanen, shares insights on coding practices and techniques for efficient memory handling and instruction pairing.

Uploaded by

Duc Le
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views12 pages

Some Other Nice Things

The document discusses advanced techniques in 3D coding, focusing on frame skipping to ensure consistent performance across different hardware, and optimizing assembly code for Pentium processors. It explains the concept of pairing instructions to enhance execution speed and introduces palette quantization methods for optimizing color representation in graphics. The author, Henri 'RoDeX' Tuhkanen, shares insights on coding practices and techniques for efficient memory handling and instruction pairing.

Uploaded by

Duc Le
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 12

3DICA v2.

22b
The Ultimate 3D Coding Tutorial (C) Ica /Hubris 1996,1997,1998
Over 150k of pure sh...er, 3d coding power !

7. Some Other Nice Things


7.1 Frame skipping

I’m describing only one technique here. If you don’t like it, try reading the
Midas documents. The idea of frameskip is to make a program run at the same
speed on all machines. The main idea is very straightforward : when the objects
have been updated once, we’ve drawn one frame. A 386 spends a lot of more
time drawing a frame than a pentium. This means that a 386 should skip some
frames and draw thus less frames per a time unit. An example : we suppose that
the rotation angle of an object should increase by 9 degrees per second. We
should now do as follows :

Frame Angle on a 386 Angle on a pentium

1 0 0

2 3 1

3 6 2

4 9 3

5 - 4

6 - 5

7 - 6

8 - 7

9 - 8

10 - 9

Why ? A pentium is fast enough to draw 10 fps when a 386 can make
only four. Now your object rotates at the same speed with all computers, just
more smoothly on a pentium than a 386.
fstart=read_time
loop
< calculate >
< draw >
< flip >
< do whatever you want >
fend=read_time
angle=angle+speed*(fend-fstart)
fstart=fend
until 1=2

All variables are of course real or fixed numbers. And read_time should
be accurate, the basic 18.2Hz clock won’t work.

Another technique : create an interrupt which updates all variables 70


times per second and the routines draw as fast as they can without going over
that 70fps. Not bad, this one, either.

7.2 Optimizing in Assembly

(
Author : Henri 'RoDeX' Tuhkanen
Email : [email protected]
Groups : CyberVision, Embrace, the Damned, Hard Spoiled, tAAt,
Regeneration, Magic Visions and a couple of others I can’t remember 
Achievements : 7th prize at asm’96 4k intro compo
A brief portrait : I’m a 19-year-old 3D/gfx coder and code about always.
I can give away a lot of coding-related material but mainly algos; I don’t like
rippers. I’m not afraid of being wrong and I want to learn everything about
coding and computers 

(text debugged by Chem and translated by Ica)


)

Optimizing assembly code became quite complicated when the pentium


came out. This piece of text was written to clarify pentium tricks and to tell how
we can produce code that can even be twice as fast as one could think to be
possible.

With pentium came the concepts pairing and fast math coprocessor
known in the PC world. Now I’ll tell how pairing works, and how you can take
the advantage of it in your own programs.
There are actually two parallel processors inside a pentium processor.
Only one of these is complete. This is called the U pipe, the other (less
complete) one is the V pipe. The V pipe can only perform jumps and some basic
commands like mov, add, and lea. Pairing is these processors working
simultaneously at the same clock frequency. Pairing works only in special
circumstances.

Example. We suppose that the situation is neutral at the beginning of each


series of commands. That means that the first command is performed in the U
pipe (this can be achieved by placing an sti before the series).
mov eax,ebx ; U
add ecx,edx ; V
mov eax,ebx ; U
add eax,2 ; U (doesn't pair -- needs the new result of eax)

So the first thing which should be correct is the use of registers.


Generally: if a register is first written and then read, the commands do not pair.
But if a register is read, the following command pairs even if it used the same
register (for read or write). The supposition of course is that the commands
should pair with each other. Example :
mov eax,ebx ; 1 clock; U
add eax,ecx ; 1 clock; U
; total 2 clocks
add ebx,eax ; 1 clock; U
mov edx,eax ; 0 clocks; V
; total 1 clock

In the commands above we’ve supposed that both of the commands pair
in both pipes. Even though this is the general situation, things are not always
like that.
Example :
mov ebx,edx ; 1
shl eax,1 ; 1 (doesn't pair in V)
; total 2
But :
shl eax,1 ; 1
mov ebx,edx ; 0
; total 1
(shl pairs only in U pipe)
cmp eax,ebx ; 1 U
jne (somewhere) ; 0 (pairs only in V)

The flags are updated so fast that they’re available for use at the same
clock, so the above example works. There are many instructions which pair only
in the U or V pipe or not at all. SHL and ROL pair only in the U pipe. Jumps
pair only in the V pipe, when CLI and MUL don't pair at all (they reserve both
pipes). When you start finding out what instructions pair with each other, you
notice that you begin forming instruction pairs and moving instructions to places
where they pair. Many instructions work depending on the situation, though.
The pairing rules can be found at least in the pentium update of HelpPC.

Now the theory should be quite clear so it’s time to try optimizing a real
piece of code. In the following example we’ll optimize a slow line drawing inner
loop.
 UV : pairs in both pipes
 NU : pairs only in U
 NV : pairs only in V
 NP : doesn’t pair at all
 ? : The speed of memory references depends on many things. We
suppose that the situation is ideal.

Linedraw, innerloop in the mode 320*200*256. When we come into the


loop, the registers are the following :
 eax = start_x*256
 edx = start_y
 [xp]=x_coefficient*256 (dd, 32 bit)
 [yp]=y_coefficient (dd, 32 bit)
 ebx = 0

Code :

; Clocks Pipe Pairing Comment

@@inner:

lea edi,[edx+edx*4] ; 1 U UV edi=edx*5

mov bl,ah ; 0 V UV ebx=ax/256

shl edi,6 ; 1 U NU edi*=64

add edi,ebx ; 1 U UV edi+=ebx

add edi,0a0000h ; 0 U UV edi+=screenstart

mov [edi],b 10d ; 1? U UV [edi]=10

add edx,[yp] ; 0? V UV edx+=[yp]

add eax,[xp] ; 1? U UV eax+=[xp]

dec ecx ; 0 U UV ecx-=1

jnz short @@inner ; 1 V NV jump if not zero


; 6?

Normally we use pmode in flat mode which means that we must add the
starting point of the screen memory into edi. The loop is best used as short
because the distance to @@inner is less than 128; we save two bytes from the
compiled version of the command. Now we can get rid of 'add edi,ebx':n by
changing the calculation to ‘mov [edi+ebx+0a0000h],b 10d’. We arrange also
the instructions in a way that they pair whenever possible.

@@inner:

lea edi,[edx+edx*4] ; 1 U UV edi=edx*5

mov bl,ah ; 0 V UV ebx=ax/256

shl edi,6 ; 1 U NU edi*=64

add edx,[yp] ; 0? V UV edx+=[yp]

mov [edi+ebx+0a0000h],b 10d ; 1? U UV []=10

add eax,[xp] ; 0? V UV eax+=[xp]

dec ecx ; 1 U UV ecx-=1

jnz short @@inner ; 0 V NV jump

; 4?

The difference in speed is remarkable. The negative side is that the code
gets more messy but it’s a cheap price for speed. Well, the clocks of lines where
registers are added by [variables] are very questionable and probably something
very else than zeros. In any case, the loop is now pairing efficiently. I just find it
useless to multiply y every time by 320. So :
 eax=start_x*256
 edi=start_y*320
 [xp]=x_coeff*256
 [yp]=y_coeff*320
 edi=0a0000h
 ebx=0

@@inner:

mov bl,ah ; 1 U UV ebx=ax/256

add eax,[xp] ; 0? V UV eax+=[xp]


mov [edi+ebx],b 10d ; 1? U UV []=10

add edi,[yp] ; 0? V UV edi+=[yp]

dec cx ; 1 U UV cx-=1

jnz @@inner ; 0 V NV jump if not zero

; 3?

Fast ? This is just an illusion because there are many other things which
have their effects on speed, a few examples of which are cache misses and
interrupts which stop pairing and often cause also a cache miss. A cache miss is
a situation in which the needed data is not found in the processor internal
memory (level 1 cache) but it must be brought from the external cache (level 2
cache). This usually means couple of clock cycles more and the stoppage of
pairing. If the data is not found even in the level 2 cache, it must be brought
from the actual memory resulting in a loss of over 10 clocks. These fetch times
are linearly dependent on the memory which the computer is using. In certain
cases, the differences can be big. For example, it could take 5 clocks of 60ns
multiaccess edo and 15 clocks of normal memory. With the level 2 cache, the
difference between pipeline burst and some other can be two clocks. So at this
stage we notice that everything is really not dependent on the code 

Luckily we have a trick left : we can’t affect the caches straight but we
can speed up memory handling by arranging the data. For example, variables
that are used in the same loop should be arranged so that they would be
consecutively in the memory and in blocks of 32 bytes. This is because pentium
moves data between cache and normal memory in 32-byte blocks, and a block
like this is never split. So it’s worth trying to align the code and the data to 32
bytes. Specially loops and variables that are used in them should be in as few
blocks as possible. It’s always worth using your imagination when coding
critical loops to get the best possible use of level 1 cache.

7.3 Palette Quantisizing

7.3.1 What is it actually ?

Palette quantisizing is a way of trying to get a 256 (or why not 16 )


color mode look as good as possible by optimizing the palette. There are many
quantisizing techniques of which here are described Local K Mean and Median
Cut.
7.3.2 Local K Mean
[The text was originally written by Sampsa Lehtonen (TexMex
/Gigamess, [email protected]), this is an edited version. Thanks.]

7.3.2.1 An abstract approach

Abstractly, LK works as follows : the values of the picture or colors to be


quantisized are imagined as spheres in a cubic-shaped color space (XYZ =
RGB). The bigger amount of a color there is, the bigger is the sphere. Into the
color space, we add palette spheres which can move freely contrary to the color
spheres which don’t move at all. The number of palette spheres is the same as
the size of the desired palette (256 colors -> 256 spheres).

We perform the following process : every color sphere pulls the closest
palette sphere. The bigger a color sphere is, the bigger is its pulling force. Now
about all of the palette spheres are pulled by one or more color spheres. (The
new coordinates of a palette sphere are calculated from the average of the color
values of the color spheres, in other words the sum of colors divided by the
number of color spheres). The palette spheres which are not pulled by any color
sphere telewarp near some color sphere. Now the palette spheres move in the
space like this until their movement is slowed under a defined level (trust me, it
really slows down). Now the new palette can be read from the coordinates of the
palette spheres.

7.3.2.2 A more technical approach

Let’s use the following example : we have a truecolor picture which


should be changed to 256 colors. First we create a histogram out of the picture.
The histogram uses 15bit (or any other 3*x bit) numbers.

Now we create another table in which is the list of the colors the picture
originally has. So we go through the histogram, and in every point where there is
some color (the value being greater than zero), we put the color amount and
value into this new table. The table can for example be like this :
typedef struct
{
unsigned char R,G,B; // color values
unsigned long count; // number of colors in the pic
} colorListStruct;
colorListStruct colorList[32768];

Additionally, the amount of different colors is saved into a variable


(colorListCount). Then we create a basic palette :
unsigned long palette[256][3]; // 3: R,G & B

We need also three other variables :

unsigned long colorSum[256][3]; // 256 colors, 3 = R,G & B,

(the following one could be attached to colorSum, too)

unsigned long colorCount[256],

and then a variable in which we save change in the palette:

unsigned long variance;

Now we go through the following steps :


1) Reset colorSum and colorCount (all zeros), and fill palette with the
colors at the beginning of colorList
2) Go through all colors in colorList (c = 0..colorListCount)
a) take color c from colorList
b) find the closest color in palette for it (we get a number x=0..255)
c) add this color into colorSum, for example
colorSum[x][0] += colorList[c].R;
colorSum[x][1] += colorList[c].G;
colorSum[x][2] += colorList[c].B;
d) increment colorCount at the point x (colorCount[x]++;)
3) variance=0
4) Go through all colors in the basic palette (c = 0..255)
a) if colorCount > 0 calculate the R, G, and B values with the help of
colorSum and colorCount (average color) :
R = colorSum[c][0] / colorCount[c];
etc. else take a random number from colorList
R,G & B <- colorList[RANDOM]
b) calculate the variance :
temp = abs(R-palette[c][0]); //variance in red
variance+=temp; //save it
etc.
c) save the new color :
palette[c][0] = R
etc.
5) reset colorSum and colorCount
6) if variance > MAX_VARIANCE goto 2 (MAX_VARIANCE is the
border when the palette is ready. The smaller number, the slower
process.)

7.3.3 Median Cut


Written by Jari Komppa aka Sol/Trauma ([email protected]). Thanks.

Disclaimer : The information here is based on material I’ve read (can’t


remember any document names) and some sources I’ve poked at. As such I
can’t give any pointers to more information nor do I have the math background
for this algorithm.

7.3.3.1 The definition

RGB colorspace cube

The RGB colorspace can be thought of as a 3-dimensional cube. Every


color is a 3D position within this cube. If you take every point of the RGB cube
and calculate the average, you’ll get a mid-gray color.

Since the palette we’re going to use is not going to include every color in
the colorspace, we only use a "subspace".

Colorspace with a subspace in gray

This subspace is defined by finding out the min/max R, G, and B values.


In this example, blue and red use the whole range and green uses only a part. If
we calculate the average of these colors, we can make simple b/w image out of
the rendered image by checking if a pixel is "brighter than" this averaged color,
and dotting black and white pixels accordingly. But what if we want to use more
than one color ?

Let’s say we cut the subspace in two.


Colorspace with two subspaces

Now, by averaging each subspace, we have two colors that we know are
near the right colors. At this point the target image will look pretty bad, but
after, say, 16 splits, you can already know what the image will really look like,
and at 256 levels you can hardly see the difference (well ok, depends heavily on
the image).

The example image may give you slightly wrong idea of the algorithm.
There will be a gap between the new subspaces, and they will most probably
shrink in other dimensions as well.

7.3.3.2 The algorithm

Basically we do the following:


1. Analyze subspaces
2. Select the largest subspace
3. Sort it by the largest component (R, G or B)
4. Cut it into two subspaces
5. Repeat 1-4 until we have as many subspaces as we want.
6. Average all colors in each subspace into palette

Let’s check each (actually rather simple) step one by one.

1. Analyze subspaces
Find out information about the subspace that we need. This
practically means checking what color component (R, G or B) has
largest range (max-min) and what is the size of the subspace. This
size may mean the number of colors in the subspace, sum of the
values of the biggest component, or just the largest component
range. You’ll need the sum of the values of the biggest component
later in any case.
2. Select largest subspace
This is simple; just check the values from each subspace and find
out which is the largest.
3. Sort it by the largest component
This is by far the most power-eating part of this algorithm. You’ll
need to sort the colors by the component you found out in the
analyze phase.
4. Cut it in two subspaces
How you do this depends on your way of implementing the whole
thing. You might cut a linked list into two, or just define that the old
subspace ends in index N and the new one starts at N+1. The tricky
thing is to know where to cut.

You could cut the subspace in half (by leaving as many colors on
one side as the other, or by leaving as many color intensities on one
side as the other), but we’ll cut it on the median of the color values.
You calculated the sum of all color values of the component you
sorted the subspace by.
Now you’ll need to find the median, or, the position where the sum
of values reaches the middle point of the whole sum.

Example : We have the following values : 1, 5, 7, 9, 10, 11, 17, 21. The
sum of these is 81. Half of the sum is 40.5. Now to find the median, we’ll start
calculating a new sum, until it reaches 40.5: 1+5+7+9+10+11=43, so the first
half will have 5 values and the second, 3.

5. Repeat 1-4 until we have as many subspaces as we want.


As we now have one new subspace, we need to analyze it, and
reanalyze the one it was clipped from. When we have enough
subspaces for our needs, we go on.
6. Average all colors in each subspace into palette
Just plain & simple averaging.

7.3.3.3 Implementation hints

There are some problems, workarounds, and speed issues.

Since many pictures tend to have black in them, and if we are using VGA,
we usually like to have the color 0 to be black, so the borders won’t annoy us
too much. I solved this by separating all black colors from the input values into
separate list, and forced the color 0 to be black. Otherwise there will not be a
completely black color !

If you wish to remap graphics (sprites, textures, whatever), performing a


nearest-color search for every color you put into reduction is a bit waste. You
could, instead, drag the original color index with the color itself (so that it will
be sorted with the original color etc.) and then when averaging the subspaces
you’d directly know the color indices that should be of that color.

Remember not to do any things repeatedly without a reason. (After


analyzing, store the values and reanalyze only if there is need; also, sort only if
it really is needed).

The biggest power-consuming part (about 60%+ in my implementation) in


the algorithm is the sort. Killing off duplicates might help, but I didn’t bother
trying that.

My implementation used a singly-linked list for the colors since I wanted


it to be dynamic. There were couple problems with this approach : allocating
and freeing memory for 250000+ colors took more time than the color reduction
itself, which was easily solved by allocating enough memory for one whole
colormap at a time. Another problem was sorting, which I solved by making a
list of pointers and re-linking the list after the sort. Making a static list of colors
should be just as easy, and quite probably faster. I used radix sorting. On a p150
it reduced 640*400 colors into 255 in about 4.5 seconds. Coding it took about
one day (most of it wasted debugging).

You might also like