Some Other Nice Things
Some Other Nice Things
22b
The Ultimate 3D Coding Tutorial (C) Ica /Hubris 1996,1997,1998
Over 150k of pure sh...er, 3d coding power !
I’m describing only one technique here. If you don’t like it, try reading the
Midas documents. The idea of frameskip is to make a program run at the same
speed on all machines. The main idea is very straightforward : when the objects
have been updated once, we’ve drawn one frame. A 386 spends a lot of more
time drawing a frame than a pentium. This means that a 386 should skip some
frames and draw thus less frames per a time unit. An example : we suppose that
the rotation angle of an object should increase by 9 degrees per second. We
should now do as follows :
1 0 0
2 3 1
3 6 2
4 9 3
5 - 4
6 - 5
7 - 6
8 - 7
9 - 8
10 - 9
Why ? A pentium is fast enough to draw 10 fps when a 386 can make
only four. Now your object rotates at the same speed with all computers, just
more smoothly on a pentium than a 386.
fstart=read_time
loop
< calculate >
< draw >
< flip >
< do whatever you want >
fend=read_time
angle=angle+speed*(fend-fstart)
fstart=fend
until 1=2
All variables are of course real or fixed numbers. And read_time should
be accurate, the basic 18.2Hz clock won’t work.
(
Author : Henri 'RoDeX' Tuhkanen
Email : [email protected]
Groups : CyberVision, Embrace, the Damned, Hard Spoiled, tAAt,
Regeneration, Magic Visions and a couple of others I can’t remember
Achievements : 7th prize at asm’96 4k intro compo
A brief portrait : I’m a 19-year-old 3D/gfx coder and code about always.
I can give away a lot of coding-related material but mainly algos; I don’t like
rippers. I’m not afraid of being wrong and I want to learn everything about
coding and computers
With pentium came the concepts pairing and fast math coprocessor
known in the PC world. Now I’ll tell how pairing works, and how you can take
the advantage of it in your own programs.
There are actually two parallel processors inside a pentium processor.
Only one of these is complete. This is called the U pipe, the other (less
complete) one is the V pipe. The V pipe can only perform jumps and some basic
commands like mov, add, and lea. Pairing is these processors working
simultaneously at the same clock frequency. Pairing works only in special
circumstances.
In the commands above we’ve supposed that both of the commands pair
in both pipes. Even though this is the general situation, things are not always
like that.
Example :
mov ebx,edx ; 1
shl eax,1 ; 1 (doesn't pair in V)
; total 2
But :
shl eax,1 ; 1
mov ebx,edx ; 0
; total 1
(shl pairs only in U pipe)
cmp eax,ebx ; 1 U
jne (somewhere) ; 0 (pairs only in V)
The flags are updated so fast that they’re available for use at the same
clock, so the above example works. There are many instructions which pair only
in the U or V pipe or not at all. SHL and ROL pair only in the U pipe. Jumps
pair only in the V pipe, when CLI and MUL don't pair at all (they reserve both
pipes). When you start finding out what instructions pair with each other, you
notice that you begin forming instruction pairs and moving instructions to places
where they pair. Many instructions work depending on the situation, though.
The pairing rules can be found at least in the pentium update of HelpPC.
Now the theory should be quite clear so it’s time to try optimizing a real
piece of code. In the following example we’ll optimize a slow line drawing inner
loop.
UV : pairs in both pipes
NU : pairs only in U
NV : pairs only in V
NP : doesn’t pair at all
? : The speed of memory references depends on many things. We
suppose that the situation is ideal.
Code :
@@inner:
Normally we use pmode in flat mode which means that we must add the
starting point of the screen memory into edi. The loop is best used as short
because the distance to @@inner is less than 128; we save two bytes from the
compiled version of the command. Now we can get rid of 'add edi,ebx':n by
changing the calculation to ‘mov [edi+ebx+0a0000h],b 10d’. We arrange also
the instructions in a way that they pair whenever possible.
@@inner:
; 4?
The difference in speed is remarkable. The negative side is that the code
gets more messy but it’s a cheap price for speed. Well, the clocks of lines where
registers are added by [variables] are very questionable and probably something
very else than zeros. In any case, the loop is now pairing efficiently. I just find it
useless to multiply y every time by 320. So :
eax=start_x*256
edi=start_y*320
[xp]=x_coeff*256
[yp]=y_coeff*320
edi=0a0000h
ebx=0
@@inner:
dec cx ; 1 U UV cx-=1
; 3?
Fast ? This is just an illusion because there are many other things which
have their effects on speed, a few examples of which are cache misses and
interrupts which stop pairing and often cause also a cache miss. A cache miss is
a situation in which the needed data is not found in the processor internal
memory (level 1 cache) but it must be brought from the external cache (level 2
cache). This usually means couple of clock cycles more and the stoppage of
pairing. If the data is not found even in the level 2 cache, it must be brought
from the actual memory resulting in a loss of over 10 clocks. These fetch times
are linearly dependent on the memory which the computer is using. In certain
cases, the differences can be big. For example, it could take 5 clocks of 60ns
multiaccess edo and 15 clocks of normal memory. With the level 2 cache, the
difference between pipeline burst and some other can be two clocks. So at this
stage we notice that everything is really not dependent on the code
Luckily we have a trick left : we can’t affect the caches straight but we
can speed up memory handling by arranging the data. For example, variables
that are used in the same loop should be arranged so that they would be
consecutively in the memory and in blocks of 32 bytes. This is because pentium
moves data between cache and normal memory in 32-byte blocks, and a block
like this is never split. So it’s worth trying to align the code and the data to 32
bytes. Specially loops and variables that are used in them should be in as few
blocks as possible. It’s always worth using your imagination when coding
critical loops to get the best possible use of level 1 cache.
We perform the following process : every color sphere pulls the closest
palette sphere. The bigger a color sphere is, the bigger is its pulling force. Now
about all of the palette spheres are pulled by one or more color spheres. (The
new coordinates of a palette sphere are calculated from the average of the color
values of the color spheres, in other words the sum of colors divided by the
number of color spheres). The palette spheres which are not pulled by any color
sphere telewarp near some color sphere. Now the palette spheres move in the
space like this until their movement is slowed under a defined level (trust me, it
really slows down). Now the new palette can be read from the coordinates of the
palette spheres.
Now we create another table in which is the list of the colors the picture
originally has. So we go through the histogram, and in every point where there is
some color (the value being greater than zero), we put the color amount and
value into this new table. The table can for example be like this :
typedef struct
{
unsigned char R,G,B; // color values
unsigned long count; // number of colors in the pic
} colorListStruct;
colorListStruct colorList[32768];
Since the palette we’re going to use is not going to include every color in
the colorspace, we only use a "subspace".
Now, by averaging each subspace, we have two colors that we know are
near the right colors. At this point the target image will look pretty bad, but
after, say, 16 splits, you can already know what the image will really look like,
and at 256 levels you can hardly see the difference (well ok, depends heavily on
the image).
The example image may give you slightly wrong idea of the algorithm.
There will be a gap between the new subspaces, and they will most probably
shrink in other dimensions as well.
1. Analyze subspaces
Find out information about the subspace that we need. This
practically means checking what color component (R, G or B) has
largest range (max-min) and what is the size of the subspace. This
size may mean the number of colors in the subspace, sum of the
values of the biggest component, or just the largest component
range. You’ll need the sum of the values of the biggest component
later in any case.
2. Select largest subspace
This is simple; just check the values from each subspace and find
out which is the largest.
3. Sort it by the largest component
This is by far the most power-eating part of this algorithm. You’ll
need to sort the colors by the component you found out in the
analyze phase.
4. Cut it in two subspaces
How you do this depends on your way of implementing the whole
thing. You might cut a linked list into two, or just define that the old
subspace ends in index N and the new one starts at N+1. The tricky
thing is to know where to cut.
You could cut the subspace in half (by leaving as many colors on
one side as the other, or by leaving as many color intensities on one
side as the other), but we’ll cut it on the median of the color values.
You calculated the sum of all color values of the component you
sorted the subspace by.
Now you’ll need to find the median, or, the position where the sum
of values reaches the middle point of the whole sum.
Example : We have the following values : 1, 5, 7, 9, 10, 11, 17, 21. The
sum of these is 81. Half of the sum is 40.5. Now to find the median, we’ll start
calculating a new sum, until it reaches 40.5: 1+5+7+9+10+11=43, so the first
half will have 5 values and the second, 3.
Since many pictures tend to have black in them, and if we are using VGA,
we usually like to have the color 0 to be black, so the borders won’t annoy us
too much. I solved this by separating all black colors from the input values into
separate list, and forced the color 0 to be black. Otherwise there will not be a
completely black color !