0% found this document useful (0 votes)
25 views7 pages

Download

This document discusses algorithms for computing transcendental functions like exp, log, sin, and atan on the IA-64 architecture. It describes a standard approach using argument reduction, approximation with polynomials, and reconstruction. To optimize for IA-64, the authors choose a simple reduction method to minimize latency, favor small tables to reduce memory overhead, and take advantage of parallelism and extended precision to efficiently evaluate long polynomials. The goal is to provide accurate double-precision transcendental functions in 50-70 clock cycles.

Uploaded by

Nazmul Ahsan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views7 pages

Download

This document discusses algorithms for computing transcendental functions like exp, log, sin, and atan on the IA-64 architecture. It describes a standard approach using argument reduction, approximation with polynomials, and reconstruction. To optimize for IA-64, the authors choose a simple reduction method to minimize latency, favor small tables to reduce memory overhead, and take advantage of parallelism and extended precision to efficiently evaluate long polynomials. The goal is to provide accurate double-precision transcendental functions in 50-70 clock cycles.

Uploaded by

Nazmul Ahsan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

The Computation of Transcendental Functions on the IA-64

Architecture

John Harrison, Microprocessor Software Labs, Intel Corporation


Ted Kubaska, Microprocessor Software Labs, Intel Corporation
Shane Story, Microprocessor Software Labs, Intel Corporation
Ping Tak Peter Tang, Microprocessor Software Labs, Intel Corporation

Index words: floating point, mathematical software, transcendental functions


ABSTRACT
The fast and accurate evaluation of transcendental To solve this problem, we express x as
functions (e.g. exp, log, sin, and atan) is vitally important
in many fields of scientific computing. Intel provides a x = N ln(2) / 2 K + r
software library of these functions that can be called from for some integer K chosen beforehand (more about how to
both the C∗ and FORTRAN* programming languages. By choose later). If N ln(2)/2K is made as close to x as
exploiting some of the key features of the IA-64 floating- possible, then |r| never exceeds ln(2)/2K+1 . The
point architecture, we have been able to provide double- mathematical identity
precision transcendental functions that are highly accurate K
yet can typically be evaluated in between 50 and 70 clock exp(x) = exp( N ln(2) / 2 K + r ) = 2N/2 exp(r)
cycles. In this paper, we discuss some of the design shows that the problem is transformed to that of
principles and implementation details of these functions. calculating the exp function at an argument whose
magnitude is confined. The transformation to r from x is
INTRODUCTION called the reduction step; the calculation of exp(r), usually
Transcendental functions can be computed in software by performed by computing an approximating polynomial, is
a variety of algorithms. The algorithms that are most called the approximation step; and the composition of the
suitable for implementation on modern computer final result based on exp(r) and the constant related to N
architectures usually comprise three steps: reduction, and K is called the reconstruction step.
approximation, and reconstruction. In a more traditional approach [1], K is chosen to be 1 and
These steps are best illustrated by an example. Consider thus the approximation step requires a polynomial with
the calculation of the exponential function exp(x). One accuracy good to IEEE double precision for the range of
may first attempt an evaluation using the familiar |r| ≤ ln(2)/2. This choice of K leads to reconstruction via
Maclaurin series expansion: multiplication of exp(r) by 2N, which is easily
implementable, for example, by scaling the exponent field
exp(x) = 1 + x + x2 /2! + x3 /3! + … + xk/k! + … . of a floating-point number. One drawback of this
When x is small, computing a few terms of this series approach is that when |r| is near ln(2)/2, a large number of
gives a reasonably good approximation to exp(x) up to, terms of the Maclaurin series expansion is still needed.
for example, IEEE double precision (which is More recently, a framework known as table-driven
approximately 17 significant decimal digits). However, algorithms [8] suggested the use of K > 1. When for
when x is large, many more terms of the series are needed example K=5, the argument r after the reduction step
to satisfy the same accuracy requirement. Increasing the would satisfy |r| ≤ log(2)/64. As a result, a much shorter
number of terms not only lengthens the calculation, but it polynomial can satisfy the same accuracy requirement.
also introduces more accumulated rounding errors that The tradeoff is a more complex reconstruction step,
may degrade the accuracy of the answer. requiring the multiplication with a constant of the form
2N/32 = 2M 2 d/32 , d = 0, 1, …, 31,

All other brands and names are the property of their
respective owners.

The Computation of Transcendental Functions on the IA-64 Architecture 1


Intel Technology Journal Q4, 1999

where N = 32M + d . This constant can be obtained rather 64-bit precision on IA-64 delivers 11 extra bits of
easily, provided all the 32 possible values of the second accuracy on basic arithmetic operations.
factor are computed beforehand and stored in a table
(hence the name table-driven). This framework works • Parallelism: Each operation is fully pipelined, and
multiple floating-point units are present.
well for modern machines not only because tables (even
large ones) can be accommodated, but also because As stated, these architectural features affect our choices of
parallelism, such as the presence of pipelined arithmetic design tradeoffs. We enumerate several key points:
units, allow most of the extra work in the reconstruction
step to be carried out while the approximating polynomial • Argument reduction usually involves a number of
is being evaluated. This extra work includes, for serial computation steps that cannot take advantage
example, the calculation of d, the indexing into and the of parallelism. In contrast, the approximation and
fetching from the table of constants, and the reconstruction steps can naturally exploit parallelism.
multiplication to form 2 N/32. Consequently, the Consequently, the reduction step is often a
performance gain due to a shorter polynomial is fully bottleneck. We should, therefore, favor a simple
realized. reduction method even at the price of a more
complex reconstruction step.
In practice, however, we do not use the Maclaurin series.
Rather, we use the lowest-degree polynomial p(r) whose • Argument reduction usually requires the use of some
worst deviation |p(r) − exp(r)| within the reduced range in constants. The short floating-point latency can make
question is minimized and stays below an acceptable the memory latency incurred in loading such
threshold. This polynomial is called the minimax constants a significant portion of the total latency.
polynomial. Its coefficients can be determined Consequently, any novel reduction techniques that do
numerically, most commonly by the Remez algorithm [5]. away with memory latency are welcome.

Another fine point is that we may also want to retain some • Long memory latency has two implications for table
convenient properties of the Maclaurin series, such as the size. First, large tables that exceed even the lowest-
leading coefficient being exactly 1. It is possible to find level cache size should be avoided. Second, even if a
minimax polynomials even subject to such constraints; table fits in cache, it still takes a number of repeated
some examples using the commercial computer algebra calls to a transcendental function at different
system Maple are given in reference [3]. arguments to bring the whole table into cache. Thus,
small tables are favored.
DESIGN PRINCIPLES ON THE IA-64 • Extended precision and parallelism together have an
ARCHITECTURE important implication for the approximation step.
There are tradeoffs to designing an algorithm following Traditionally, the polynomial terms used in core
the table-driven approach: approximations are evaluated in some well specified
order so as to minimize the undesirable effect of
• Different argument reduction methods lead to rounding error accumulation. The availability of
tradeoffs between the complexity of the reduction extended precision implies that the order of
and the reconstruction computation. evaluation of a polynomial becomes unimportant.
When a polynomial can be evaluated in an arbitrary
• Different table sizes lead to tradeoffs between
order, parallelism can be fully utilized. The
memory requirements (size and latency
consequence is that even long polynomials can be
characteristics) and the complexity of polynomial
evaluated in short latency.
computation.
Roughly speaking, latency grows logarithmically in the
Several key architectural features of IA-64 have a bearing
degree of polynomials. The permissive environment that
on which choices are made:
allows for functions that return accurate 53-bit results
• Short floating-point latency: On IA-64, the generic should be contrasted with that which is required for
floating-point operation is a “fused multiply add” that functions that return accurate 64-bit results. Some
calculates A×B + C per instruction. Not only is the functions returning accurate 64-bit results are provided in
latency of this floating-point operation much shorter a special double-extended libm as well as in IA-32
than memory references, but this floating-point compatibility operations [6]. In both, considerable effort
operation consists of two basic arithmetic operations. was taken to minimize rounding error. Often,
computations were carefully choreographed into a
• Extended precision: Because our target is IEEE dominant part that was calculated exactly and a smaller
double-precision with 53 significant bits, the native part that was subject to rounding error. We frequently
stored precomputed values in two pieces to maintain

The Computation of Transcendental Functions on the IA-64 Architecture 2


Intel Technology Journal Q4, 1999

intermediate accuracy beyond the underlying precision of The worst relative error (which occurs when the result of
64 bits. All these costly implementation techniques are the trigonometric function is its smallest, 2-60, and N is
unnecessary in our present double-precision context. close to 210+K) is about 270+K ε. If we store P as two
We summarize the above as four simple principles: double-extended precision numbers, P_1 + P_2, then we
can make ε < 2-130-K sufficient to make the relative error in
1. Use a simple reduction scheme, even if such a the final result negligible.
scheme only works for a subset of the argument
domain, provided this subset represents the most One technique to provide an accurate reduced argument
common situations. on IA-64 is to apply two successive fma operations

2. Consider novel reduction methods that avoid memory r0 = x − N P_1; r= r0 − N P_2.


latency. The first operation introduces no rounding error because
3. Use tables of moderate size. of the well known phenomenon of cancellation.

4. Do not fear long polynomials. Instead, work hard at For sin and cos, we pick K to be 4, so the reconstruction
using parallelism to minimize latency. has the form

In the next sections, we show these four principles in sin(x)= sin(N π/16) cos(r) + cos(N π/16)sin(r)
action on Merced, the first implementation of the IA-64 and
architecture.
cos(x)= cos(N π/16) cos(r) − sin(N π/16)sin(r).
SIMPLE AND FAST RANGE REDUCTION Periodicity implies that we need only tabulate sin(Nπ/16)
A common reduction step involves the calculation of the and cos(Nπ/16) for N = 0, 1, …, 31.
form
The case for the exponential function is similar. Here
r = x − N ρ. ln(2)/2K (K is chosen to be 7 in this case) is approximated
by two machine numbers P_1 + P_2, and the argument is
This includes the forward trigonometric functions sin, cos,
reduced in a similar fashion.
tan, and the exponential function exp, where ρ is of the
form π/2 K for the trigonometric functions and of the form
ln(2)/2 K for the exponential function. We exploit the fact NOVEL REDUCTION
that the overwhelming majority of arguments will be in a Some mathematical functions f have the property that
limited range. For example, the evaluation of
f(u v) = g(f(u),f(v))
trigonometric functions like sin for very large arguments
is known to be costly. This is because to perform a range where g is a simple function such as the sum or product
reduction accurately by subtracting a multiple of π/2 K, we operator. For example, for the logarithm, we have (for
need to implicitly have a huge number of bits of positive u and v)
π available. But for inputs of less than 210 in magnitude,
ln(u v) = ln(u) + ln(v) (g is the sum operator)
the reduction can be performed accurately and efficiently.
The overwhelming majority of cases will fall within this while for the cube root, we have
limited range. Other more time-consuming procedures
(u v)1/3 = u 1/3 v1/3 (g is the product operator).
are well known and are required when arguments exceed
210 in magnitude (see [3] and [6]). In such situations, we can perform an argument reduction
very quickly using IA-64's basic floating-point reciprocal
The general difficulty of range reduction implementation
approximation (frcpa) instruction, which is primarily
is that ρ is not a machine number. If we compute:
intended to support floating-point division. According to
r = x −N P its definition, frcpa(a) is a floating-point with 11
significant bits that approximates 1/a using a lookup on
where the machine number P approximates π/2 K, then if x the top 8 bits of the (normalized) input number a. This
is close to a root of the specific trigonometric function, 11-bit floating-point number approximates 1/a to about 8
the small error, ε = |P – π/2 K|, scaled up by N, constitutes significant bits of accuracy. The exact values returned are
a large relative error in the final result. However, by specified in the IA-64 architecture definition. By
using number-theoretic arguments, one can see that when enumeration of the approximate reciprocal values, one
reduction is really required for double-precision numbers can show that for all input values a,
in the specified range, the result of any of the
trigonometric functions, sin, cos, and tan, cannot be frcpa (a) = (1/a) (1 − β), |β| ≤ 2−8.86.
smaller in magnitude than about 2−60 (see [7]). We can write f(x) as

The Computation of Transcendental Functions on the IA-64 Architecture 3


Intel Technology Journal Q4, 1999

f(x) = f(x frcpa (x) / frcpa (x)) Table 1: Table sizes used in the algorithms

= g( f(x frcpa (x)), f(1/frcpa (x)) ). Table 1 does not include the number of constants for
argument reduction nor does it include the number of
The f(1/frcpa (x)) terms can be stored in precomputed coefficients needed for evaluating the polynomial.
tables, and they can be obtained by an index based on the
top 8 bits of x (which uniquely identifies the
corresponding frcpa (x)).
OPTIMAL EVALUATION OF
POLYNOMIALS
Because the f’s we are considering here have a natural
The traditional Horner's rule of evaluation of a
expansion around 1,
polynomial is efficient on serial machines. Nevertheless,
f(x frcpa (x)) a general degree-n polynomial requires a latency of n
fma’s. When more parallelism is available, it is possible
is most naturally approximated by a polynomial evaluated
to be more efficient by splitting the polynomial into parts,
at the argument r = x frcpa(x) − 1. Hence, a single fma evaluating the parts in parallel, and then combining them.
constitutes our argument reduction computation, and the We employ this technique to the polynomial
value frcpa (x) is obtained without any memory latency.
approximation steps for all the functions. The enhanced
We apply this strategy to f(x) = ln(x). performance is crucial to the cases of tan and atan where
the polynomials involved are of degrees 15 and 22. Even
ln(x) = ln(1/frcpa (x)) + ln(frcpa (x) x) for the other functions where the polynomials are varying
= ln(1/frcpa (x)) + ln( 1 + r) in degree from 4 to 8, our technique also contributes to a
noticeable gain over the straightforward Horner’s method.
The first value on the right-hand side is obtained from a We now describe this technique in more detail.
table, and the second value is computed by a minimax
polynomial approximating ln(1+r) o n |r| ≤ 2-8.8 . The Merced has two floating-point execution units, so there is
quantity 2−8.8 is characteristic of the accuracy of the IA-64 certainly some parallelism to be exploited. Even more
frcpa instruction. important, both floating-point units are fully pipelined in
five stages. Thus, two new operations can be issued every
The case for the cube root function cbrt is similar. cycle, even though the results are then not available for a
(x)1/3 = (1/frcpa (x)) 1/3 ( frcpa (x) x) 1/3 further five cycles. This gives much of the same benefit
as more parallel execution units. Therefore, as noted by
= (1/frcpa (x)) 1/3 ( 1 + r) 1/3 . the author in reference [3], one can use more sophisticated
techniques for polynomial evaluation intended for highly
The first value on the right-hand side is obtained from a
parallel machines. For example, Estrin's method [2]
table, and the second value is computed by a minimax
breaks the evaluation down into a balanced binary tree.
polynomial approximating (1+r) 1/3 on |r| ≤ 2-8.8 .
We can easily place a lower bound on the latency with
MODERATE TABLE SIZES which a polynomial can be computed: if we start with x
and the coefficients ci , then by induction, in n serial fma
We tabulate here the number of double-extended table
operations, we cannot create a polynomial that is a degree
entries used in each function. The trigonometric functions
higher than 2 n , and we can only equal 2 n if the term of the
sin and cos share the same table, and the functions tan and n
highest degree is simply x2 with unity as its coefficient.
atan do not use a table at all.
For example, in one operation we can reach c0 + c1 x or
x + x2 but not x + c0 x2 . Our goal is to find an actual
Function Number of Double-Extended Entries scheduling that comes as close as possible to this lower
bound.
cbrt 256 (3072 bytes)
Simple heuristics based on binary chopping normally give
exp 24 (288 bytes) a good evaluation strategy, but it is not always easy to
visualize all the possibilities. When the polynomial can
Ln 256 (3072 bytes)
be split asymmetrically, or where certain coefficients are
sin, cos 64 (768 bytes) special, such as 1 or 0, there are often ways of doing
slightly better than one might expect in overall latency or
tan None at least in the number of instructions required to attain
atan None that latency (and hence in throughput). Besides, doing the
scheduling by hand is tedious. We search automatically
for the best scheduling using a program that exhaustively
examines all essentially different scheduling. One simply

The Computation of Transcendental Functions on the IA-64 Architecture 4


Intel Technology Journal Q4, 1999

enters a polynomial, and the program returns the best The next stage is to take each tree (in some of the
latency and throughput attainable, and it lists the main examples below there are as many as 10000 of them) and
ways of scheduling the operations to attain this. calculate the optimal scheduling. The optimal scheduling
is computed backwards by a fairly naive greedy
Even with various intelligent pruning approaches and
algorithm, but with a few simple refinements based on
heuristics, the search space is large. We restrict it
stratifying the nodes from the top as well as from the
somewhat by considering only fma combinations of the
bottom.
form p 1 (x) + xkp 2(x). That is, we do not consider
multiplying two polynomials with nontrivial coefficients. The following table gives the evaluation strategy found by
Effectively, we allow only solutions that work for the program for the polynomial:
arbitrary coefficients, without considering special
x + c2 x2 + c3 x3 + … + c9 x9
factorization properties. However, for polynomials where
all the coefficients are 1, these results may not be optimal Table 2 shows that it can be scheduled in 20 cycles, and
because of the availability of nontrivial factorizations that we have attained the lower bound. However, if the first
we have ruled out. For example, we can calculate: term were c1 x we would need 21.
1 + x + x2 + x3 + x4 + x5 + x6 Cycle FMA Unit 1 FMA Unit 2
as 0 v 1 = c2 + x c3 v2 = x x
2 2 2
1 + (1 + (x + x )) (x + (x ) (x )) 3 v 3 = c6 + x c7 v 4 = c8 + x c9
which can be scheduled in 15 cycles. However, if the 4 v 5 = c4 + x c5
restriction on fma operations is observed, then 16 cycles
is the best attainable. 5 v 6 = x + v2 v 1 v7 = v2 v2

The optimization program works in two stages. First, all 9 v8 = v3 + v2 v4


possible evaluation orders using these restricted fma
10 v9 = v6 + v7 v5 v 10 = v 2 v 7
operations are computed. These evaluation orders ignore
scheduling, being just “abstract syntax” tree structures 15 v 11 = v 9 + v 10 v8
indicating the dependencies of subexpressions, with
Table 2: An optimal scheduling
interior nodes representing fma operations of the form
p 1(x) + x k p 2(x):
OUTLINE OF ALGORITHMS
2 3
c 0 + c 1x + c 2x + c 3x We outline each of the seven algorithms discussed here.
We concentrate only on the numeric cases and ignore
situations such as when the input is out of the range of the
functions’ domains or non-numeric (NaN for example).
2
c0 + c 1 x x c 2 + c 3x Cbrt
1. Reduction: Given x, compute r = x frcpa(x) − 1.
2. Approximation: Compute a polynomial p(r) of the
form p(r)=p 1 r + p 2 r2 + … + p 6 r6 that approximates
c0 x c1 0 x x c2 x c3
(1+r)1/3 −1.
Figure 1: A dependency tree 3. Reconstruction: Compute the result T + Tp(r) where
However, because of the enormous explosion in the T is (1/frcpa(x))1/3 . This value T is obtained via
possibilities, we limit the search to the smallest possible a tabulation of (1/frcpa(y))1/3 , where y=1+k/256, k
tree depth. This tree depth corresponds to the minimum ranges from 0 to 255 and a tabulation of 2−j/3 , and j
number of serial operations than can possibly be used to ranges from 0 to 2.
evaluate the expression using the order denoted by that Exp
particular tree. Consequently, if the tree depth is d then
we cannot possibly do better than 5d cycles for that 1. Reduction: Given x, compute N, the closest integer to
particular tree. Now, assuming that we can in fact do at the value x (128/ln(2)). Then compute r = (x−N
least as well as 5d + 4, we are justified in ignoring trees of P1 )−N P2 . Here P1 +P2 approximates ln(2)/128 (see
a depth greater than or equal to d + 1, which could not previous discussions).
possibly be scheduled in as few cycles. This turns out to
be the case for all our examples.

The Computation of Transcendental Functions on the IA-64 Architecture 5


Intel Technology Journal Q4, 1999

2. Approximation: Compute a polynomial p(r) of the that approximates −cot(r). The term t is r2 . We
form p(r) = r + p 1 r2 + … + p 4 r5 that approximates emphasize the fact that parallelism is fully utilized.
exp(r) − 1.
3. Reconstruction: If N is even, return p. If N is odd,
3. Reconstruction: Compute the result T + Tp(r) where return q.
T is 2N/128 . This value T is obtained as follows. First, Atan
N is expressed as N = 128 M + 16 K + J, where I1
ranges from 0 to 15, and I2 ranges from 0 to 7. 1. Reduction: No reduction is needed.
Clearly 2N/128 = 2M 2K/8 2J/128 . The first of the three 2. Approximation: If |x| is less than 1, compute a
factors can be obtained by scaling the exponent; the polynomial p(x) = x + x3 (p 0 + p 1 y + … + p 22 y22) that
remaining two factors are fetched from tables with 8
approximates atan(x), y is x2 . If x| | > 1, compute
entries and 16 entries, respectively. several quantities, fully utilizing parallelism. First,
Ln compute q(x) = q 0 + q 1 y + … + q 22 y22 , y=x2 , that
approximates x45 atan(1/x). Second, compute c45
1. Reduction: Given x, compute r = x frcpa(x) − 1.
where c = frcpa(x). Third, compute another
2. Approximation: Compute a polynomial p(r) of the polynomial r(β) = 1+r1 β + … + r10 β10 , where β is the
form p(r)= p 1 r2 + … + p 5 r6 that approximates quantity x frcpa(x) − 1 and r(β) approximates the
ln(1+r) − r. value (1−β) −45 .
3. Reconstruction: Compute the result T + r + p(r) 3. Reconstruction: If x| | is less than 1, return p(x).
where the T is ln(1/frcpa(x)). This value T is Otherwise, return sign(x)π/2 − c45 r(β)q(x). This is
obtained via a tabulation of ln(1/frcpa(y)), where due to the identity atan(x) = sign(x)π/2 − atan(1/x).
y=1+k/256, k ranges from 0 to 255, and a calculation
of the form N ln(2). SPEED AND ACCURACY
Sin and Cos These new double-precision elementary functions are
designed to be both fast and accurate. We present the
We first consider the case of sin(x).
speed of the functions in terms of latency for arguments
1. Reduction: Given x, compute N, the closest integer to that fall through the implementation in a path that is
the value x (16/π). Then compute r = (x−N P1 )−N P2 . deemed most likely. As far as accuracy is concerned, we
Here P1 +P2 approximates π/16 (see previous report the largest observed error after extensive testing in
discussions). terms of units of last place (ulps). This error measure is
standard in this field. Let f be the mathematical function
2. Approximation: Compute two polynomials: p(r) of to be implemented and F be the actual implementation in
the form r + p 1 r3 + … + p 4 r9 that approximates double precision. When 2L < |f(x)| ≤ 2L+1 , the error in ulps
sin(r) and q(r) of the form q 1 r2 + q 2 r4 + … + q 4 r8 is defined as |f(x) − F(x)| / (2L−52 ). Note that the smallest
that approximates cos(r) − 1. worst-case error that one can possibly attain is 0.5 ulps.
3. Reconstruction: Return the result as Cp(r)+(S+Sq(r)) Table 3 tabulates the latency and maximum error
observed.
where C is cos(N π/16) and S is sin(N π/16) obtained
from a table. Function Latency (cycles) Max. Error (ulps)
The case of cos(x) is almost identical. Add 8 to N just cbrt 60 0.51
after it is first obtained. This works because of the
identity cos(x) = sin(x+π/2). exp 60 0.51

Tan ln 52 0.53

1. Reduction: Given x, compute N, the closest integer to sin 70 0.51


the value x (2/π). Then compute r = (x−N P1 )−N P2 . cos 70 0.51
Here P1 +P2 approximates π/2 (see previous
discussions). tan 72 0.51

2. Approximation: When N is even, compute a atan 66 0.51


polynomial p(r) = r + r t (p0 + p 1 t + … + p 15 t15) that
approximates tan(r). When N is odd, compute a
polynomial q(r) = (−r) -1+ r(q0 + q 1 t + … + q 10 t10) Table 3: Speed and accuracy of functions

The Computation of Transcendental Functions on the IA-64 Architecture 6


Intel Technology Journal Q4, 1999

CONCLUSIONS [7] Smith, Roger A., “A Continued-Fraction Analysis of


Trigonometric Argument Reduction,” IEEE
We have shown how certain key features of the IA-64
Transactions on Computers, pp. 1348-1351, Vol. 44,
architecture can be exploited to design transcendental
No. 11, November 1995.
functions featuring an excellent combination of speed and
accuracy. All of these functions performed over twice as [8] Tang, P.T.P., “Table-driven implementation of the
fast as the ones based on the simple conversion of a exponential function in IEEE floating-point
library tailored for double-extended precision. In one arithmetic,” ACM Transactions on Mathematical
instance, the ln function described here contributed to a Software, vol. 15, pp. 144-157, 1989.
two point increment of SpecFp benchmark run under
simulation. AUTHORS’ BIOGRAPHIES
The features of the IA-64 architecture that are exploited John Harrison has been with Intel for just over one year.
include parallelism and the fused multiply add as well as less He obtained his Ph.D. degree from Cambridge University
obvious features such as the reciprocal approximation in England and is a specialist in formal validation and
instruction. When abundant resources for parallelism are theorem proving. His e-mail is [email protected].
available, it is not always easy to visualize how to take
full advantage of them. We have searched for optimal Ted Kubaska is a senior software engineer with Intel
Corporation in Hillsboro, Oregon. He has a M.S. degree
instruction schedules. Although our search method is
in physics from the University of Maine at Orono and a
sufficient to handle the situations we have faced so far,
more sophisticated techniques are needed to handle more M.S. degree in computer science from the Oregon
Graduate Institute. He works in the MSL Numerics
complex situations. First, polynomials of a higher degree
Group where he implements and tests floating-point
may be needed in more advanced algorithms. Second,
more general expressions that can be considered as algorithms. His e-mail is
[email protected].
multivariate polynomials are also anticipated. Finally, our
current method does not handle the full generality of Shane Story has worked on numerical and floating-point
microarchitectural constraints, which also vary in future related issues since he began working for Intel eight years
implementations on the IA-64 roadmap. We believe this ago. His e-mail is [email protected].
optimal scheduling problem to be important not only
because it yields high-performance implementation, but Ping Tak Peter Tang (his friends call him Peter) joined
also because it may offer a quantitative analysis on the Intel very recently as an applied mathematician working
balance of microarchitectural parameters. Currently we in the Computational Software Lab of MSL. Peter
are considering an integer programming framework to received his Ph.D. degree in mathematics from the
tackle this problem. We welcome other suggestions as University of California at Berkeley. His interest is in
well. floating-point issues as well as fast and accurate
numerical computation methods. Peter has consulted for
Intel in the past on such issues as the design of the
REFERENCES transcendental algorithms on the Pentium, and he
[1] Cody Jr., William J. and Waite, William, Software contributed a software solution to the Pentium division
Manual for the Elementary Functions, Prentice Hall, problem. His e-mail is [email protected].
1980.
[2] Knuth, D.E., The Art of Computer Programming vol.
2: Seminumerical Algorithms, Addison-Welsey, 1969.
[3] Muller, J. M., Elementary functions: algorithms and
implementation, Birkhaüser, 1997.
[4] Payne, M., “An Argument Reduction Scheme on the
DEC VAX,” Signum Newsletter, January 1983.
[5] Powell, M.J.D., Approximation Theory and Methods,
Cambridge University Press, 1981.
[6] Story, S. and Tang, P.T.P., “New algorithms for
improved transcendental functions on IA-64,” in
Proceedings of 14th IEEE symposium on computer
arithmetic, IEEE Computer Society Press, 1999.

The Computation of Transcendental Functions on the IA-64 Architecture 7

You might also like