0% found this document useful (0 votes)

25 views7 pages

Download

This document discusses algorithms for computing transcendental functions like exp, log, sin, and atan on the IA-64 architecture. It describes a standard approach using argument reduction, approximation with polynomials, and reconstruction. To optimize for IA-64, the authors choose a simple reduction method to minimize latency, favor small tables to reduce memory overhead, and take advantage of parallelism and extended precision to efficiently evaluate long polynomials. The goal is to provide accurate double-precision transcendental functions in 50-70 clock cycles.

Uploaded by

Nazmul Ahsan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views7 pages

Download

Uploaded by

Nazmul Ahsan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

The Computation of Transcendental Functions on the IA-64

Architecture

John Harrison, Microprocessor Software Labs, Intel Corporation

Ted Kubaska, Microprocessor Software Labs, Intel Corporation
Shane Story, Microprocessor Software Labs, Intel Corporation
Ping Tak Peter Tang, Microprocessor Software Labs, Intel Corporation

Index words: floating point, mathematical software, transcendental functions

ABSTRACT
The fast and accurate evaluation of transcendental To solve this problem, we express x as
functions (e.g. exp, log, sin, and atan) is vitally important
in many fields of scientific computing. Intel provides a x = N ln(2) / 2 K + r
software library of these functions that can be called from for some integer K chosen beforehand (more about how to
both the C∗ and FORTRAN* programming languages. By choose later). If N ln(2)/2K is made as close to x as
exploiting some of the key features of the IA-64 floating- possible, then |r| never exceeds ln(2)/2K+1 . The
point architecture, we have been able to provide double- mathematical identity
precision transcendental functions that are highly accurate K
yet can typically be evaluated in between 50 and 70 clock exp(x) = exp( N ln(2) / 2 K + r ) = 2N/2 exp(r)
cycles. In this paper, we discuss some of the design shows that the problem is transformed to that of
principles and implementation details of these functions. calculating the exp function at an argument whose
magnitude is confined. The transformation to r from x is
INTRODUCTION called the reduction step; the calculation of exp(r), usually
Transcendental functions can be computed in software by performed by computing an approximating polynomial, is
a variety of algorithms. The algorithms that are most called the approximation step; and the composition of the
suitable for implementation on modern computer final result based on exp(r) and the constant related to N
architectures usually comprise three steps: reduction, and K is called the reconstruction step.
approximation, and reconstruction. In a more traditional approach [1], K is chosen to be 1 and
These steps are best illustrated by an example. Consider thus the approximation step requires a polynomial with
the calculation of the exponential function exp(x). One accuracy good to IEEE double precision for the range of
may first attempt an evaluation using the familiar |r| ≤ ln(2)/2. This choice of K leads to reconstruction via
Maclaurin series expansion: multiplication of exp(r) by 2N, which is easily
implementable, for example, by scaling the exponent field
exp(x) = 1 + x + x2 /2! + x3 /3! + … + xk/k! + … . of a floating-point number. One drawback of this
When x is small, computing a few terms of this series approach is that when |r| is near ln(2)/2, a large number of
gives a reasonably good approximation to exp(x) up to, terms of the Maclaurin series expansion is still needed.
for example, IEEE double precision (which is More recently, a framework known as table-driven
approximately 17 significant decimal digits). However, algorithms [8] suggested the use of K > 1. When for
when x is large, many more terms of the series are needed example K=5, the argument r after the reduction step
to satisfy the same accuracy requirement. Increasing the would satisfy |r| ≤ log(2)/64. As a result, a much shorter
number of terms not only lengthens the calculation, but it polynomial can satisfy the same accuracy requirement.
also introduces more accumulated rounding errors that The tradeoff is a more complex reconstruction step,
may degrade the accuracy of the answer. requiring the multiplication with a constant of the form
2N/32 = 2M 2 d/32 , d = 0, 1, …, 31,
∗
All other brands and names are the property of their
respective owners.

The Computation of Transcendental Functions on the IA-64 Architecture 1

Intel Technology Journal Q4, 1999

where N = 32M + d . This constant can be obtained rather 64-bit precision on IA-64 delivers 11 extra bits of
easily, provided all the 32 possible values of the second accuracy on basic arithmetic operations.
factor are computed beforehand and stored in a table
(hence the name table-driven). This framework works • Parallelism: Each operation is fully pipelined, and
multiple floating-point units are present.
well for modern machines not only because tables (even
large ones) can be accommodated, but also because As stated, these architectural features affect our choices of
parallelism, such as the presence of pipelined arithmetic design tradeoffs. We enumerate several key points:
units, allow most of the extra work in the reconstruction
step to be carried out while the approximating polynomial • Argument reduction usually involves a number of
is being evaluated. This extra work includes, for serial computation steps that cannot take advantage
example, the calculation of d, the indexing into and the of parallelism. In contrast, the approximation and
fetching from the table of constants, and the reconstruction steps can naturally exploit parallelism.
multiplication to form 2 N/32. Consequently, the Consequently, the reduction step is often a
performance gain due to a shorter polynomial is fully bottleneck. We should, therefore, favor a simple
realized. reduction method even at the price of a more
complex reconstruction step.
In practice, however, we do not use the Maclaurin series.
Rather, we use the lowest-degree polynomial p(r) whose • Argument reduction usually requires the use of some
worst deviation |p(r) − exp(r)| within the reduced range in constants. The short floating-point latency can make
question is minimized and stays below an acceptable the memory latency incurred in loading such
threshold. This polynomial is called the minimax constants a significant portion of the total latency.
polynomial. Its coefficients can be determined Consequently, any novel reduction techniques that do
numerically, most commonly by the Remez algorithm [5]. away with memory latency are welcome.

Another fine point is that we may also want to retain some • Long memory latency has two implications for table
convenient properties of the Maclaurin series, such as the size. First, large tables that exceed even the lowest-
leading coefficient being exactly 1. It is possible to find level cache size should be avoided. Second, even if a
minimax polynomials even subject to such constraints; table fits in cache, it still takes a number of repeated
some examples using the commercial computer algebra calls to a transcendental function at different
system Maple are given in reference [3]. arguments to bring the whole table into cache. Thus,
small tables are favored.
DESIGN PRINCIPLES ON THE IA-64 • Extended precision and parallelism together have an
ARCHITECTURE important implication for the approximation step.
There are tradeoffs to designing an algorithm following Traditionally, the polynomial terms used in core
the table-driven approach: approximations are evaluated in some well specified
order so as to minimize the undesirable effect of
• Different argument reduction methods lead to rounding error accumulation. The availability of
tradeoffs between the complexity of the reduction extended precision implies that the order of
and the reconstruction computation. evaluation of a polynomial becomes unimportant.
When a polynomial can be evaluated in an arbitrary
• Different table sizes lead to tradeoffs between
order, parallelism can be fully utilized. The
memory requirements (size and latency
consequence is that even long polynomials can be
characteristics) and the complexity of polynomial
evaluated in short latency.
computation.
Roughly speaking, latency grows logarithmically in the
Several key architectural features of IA-64 have a bearing
degree of polynomials. The permissive environment that
on which choices are made:
allows for functions that return accurate 53-bit results
• Short floating-point latency: On IA-64, the generic should be contrasted with that which is required for
floating-point operation is a “fused multiply add” that functions that return accurate 64-bit results. Some
calculates A×B + C per instruction. Not only is the functions returning accurate 64-bit results are provided in
latency of this floating-point operation much shorter a special double-extended libm as well as in IA-32
than memory references, but this floating-point compatibility operations [6]. In both, considerable effort
operation consists of two basic arithmetic operations. was taken to minimize rounding error. Often,
computations were carefully choreographed into a
• Extended precision: Because our target is IEEE dominant part that was calculated exactly and a smaller
double-precision with 53 significant bits, the native part that was subject to rounding error. We frequently
stored precomputed values in two pieces to maintain

The Computation of Transcendental Functions on the IA-64 Architecture 2

Intel Technology Journal Q4, 1999

intermediate accuracy beyond the underlying precision of The worst relative error (which occurs when the result of
64 bits. All these costly implementation techniques are the trigonometric function is its smallest, 2-60, and N is
unnecessary in our present double-precision context. close to 210+K) is about 270+K ε. If we store P as two
We summarize the above as four simple principles: double-extended precision numbers, P_1 + P_2, then we
can make ε < 2-130-K sufficient to make the relative error in
1. Use a simple reduction scheme, even if such a the final result negligible.
scheme only works for a subset of the argument
domain, provided this subset represents the most One technique to provide an accurate reduced argument
common situations. on IA-64 is to apply two successive fma operations

2. Consider novel reduction methods that avoid memory r0 = x − N P_1; r= r0 − N P_2.

latency. The first operation introduces no rounding error because
3. Use tables of moderate size. of the well known phenomenon of cancellation.

4. Do not fear long polynomials. Instead, work hard at For sin and cos, we pick K to be 4, so the reconstruction
using parallelism to minimize latency. has the form

In the next sections, we show these four principles in sin(x)= sin(N π/16) cos(r) + cos(N π/16)sin(r)
action on Merced, the first implementation of the IA-64 and
architecture.
cos(x)= cos(N π/16) cos(r) − sin(N π/16)sin(r).
SIMPLE AND FAST RANGE REDUCTION Periodicity implies that we need only tabulate sin(Nπ/16)
A common reduction step involves the calculation of the and cos(Nπ/16) for N = 0, 1, …, 31.
form
The case for the exponential function is similar. Here
r = x − N ρ. ln(2)/2K (K is chosen to be 7 in this case) is approximated
by two machine numbers P_1 + P_2, and the argument is
This includes the forward trigonometric functions sin, cos,
reduced in a similar fashion.
tan, and the exponential function exp, where ρ is of the
form π/2 K for the trigonometric functions and of the form
ln(2)/2 K for the exponential function. We exploit the fact NOVEL REDUCTION
that the overwhelming majority of arguments will be in a Some mathematical functions f have the property that
limited range. For example, the evaluation of
f(u v) = g(f(u),f(v))
trigonometric functions like sin for very large arguments
is known to be costly. This is because to perform a range where g is a simple function such as the sum or product
reduction accurately by subtracting a multiple of π/2 K, we operator. For example, for the logarithm, we have (for
need to implicitly have a huge number of bits of positive u and v)
π available. But for inputs of less than 210 in magnitude,
ln(u v) = ln(u) + ln(v) (g is the sum operator)
the reduction can be performed accurately and efficiently.
The overwhelming majority of cases will fall within this while for the cube root, we have
limited range. Other more time-consuming procedures
(u v)1/3 = u 1/3 v1/3 (g is the product operator).
are well known and are required when arguments exceed
210 in magnitude (see [3] and [6]). In such situations, we can perform an argument reduction
very quickly using IA-64's basic floating-point reciprocal
The general difficulty of range reduction implementation
approximation (frcpa) instruction, which is primarily
is that ρ is not a machine number. If we compute:
intended to support floating-point division. According to
r = x −N P its definition, frcpa(a) is a floating-point with 11
significant bits that approximates 1/a using a lookup on
where the machine number P approximates π/2 K, then if x the top 8 bits of the (normalized) input number a. This
is close to a root of the specific trigonometric function, 11-bit floating-point number approximates 1/a to about 8
the small error, ε = |P – π/2 K|, scaled up by N, constitutes significant bits of accuracy. The exact values returned are
a large relative error in the final result. However, by specified in the IA-64 architecture definition. By
using number-theoretic arguments, one can see that when enumeration of the approximate reciprocal values, one
reduction is really required for double-precision numbers can show that for all input values a,
in the specified range, the result of any of the
trigonometric functions, sin, cos, and tan, cannot be frcpa (a) = (1/a) (1 − β), |β| ≤ 2−8.86.
smaller in magnitude than about 2−60 (see [7]). We can write f(x) as

The Computation of Transcendental Functions on the IA-64 Architecture 3

Intel Technology Journal Q4, 1999

f(x) = f(x frcpa (x) / frcpa (x)) Table 1: Table sizes used in the algorithms

= g( f(x frcpa (x)), f(1/frcpa (x)) ). Table 1 does not include the number of constants for
argument reduction nor does it include the number of
The f(1/frcpa (x)) terms can be stored in precomputed coefficients needed for evaluating the polynomial.
tables, and they can be obtained by an index based on the
top 8 bits of x (which uniquely identifies the
corresponding frcpa (x)).
OPTIMAL EVALUATION OF
POLYNOMIALS
Because the f’s we are considering here have a natural
The traditional Horner's rule of evaluation of a
expansion around 1,
polynomial is efficient on serial machines. Nevertheless,
f(x frcpa (x)) a general degree-n polynomial requires a latency of n
fma’s. When more parallelism is available, it is possible
is most naturally approximated by a polynomial evaluated
to be more efficient by splitting the polynomial into parts,
at the argument r = x frcpa(x) − 1. Hence, a single fma evaluating the parts in parallel, and then combining them.
constitutes our argument reduction computation, and the We employ this technique to the polynomial
value frcpa (x) is obtained without any memory latency.
approximation steps for all the functions. The enhanced
We apply this strategy to f(x) = ln(x). performance is crucial to the cases of tan and atan where
the polynomials involved are of degrees 15 and 22. Even
ln(x) = ln(1/frcpa (x)) + ln(frcpa (x) x) for the other functions where the polynomials are varying
= ln(1/frcpa (x)) + ln( 1 + r) in degree from 4 to 8, our technique also contributes to a
noticeable gain over the straightforward Horner’s method.
The first value on the right-hand side is obtained from a We now describe this technique in more detail.
table, and the second value is computed by a minimax
polynomial approximating ln(1+r) o n |r| ≤ 2-8.8 . The Merced has two floating-point execution units, so there is
quantity 2−8.8 is characteristic of the accuracy of the IA-64 certainly some parallelism to be exploited. Even more
frcpa instruction. important, both floating-point units are fully pipelined in
five stages. Thus, two new operations can be issued every
The case for the cube root function cbrt is similar. cycle, even though the results are then not available for a
(x)1/3 = (1/frcpa (x)) 1/3 ( frcpa (x) x) 1/3 further five cycles. This gives much of the same benefit
as more parallel execution units. Therefore, as noted by
= (1/frcpa (x)) 1/3 ( 1 + r) 1/3 . the author in reference [3], one can use more sophisticated
techniques for polynomial evaluation intended for highly
The first value on the right-hand side is obtained from a
parallel machines. For example, Estrin's method [2]
table, and the second value is computed by a minimax
breaks the evaluation down into a balanced binary tree.
polynomial approximating (1+r) 1/3 on |r| ≤ 2-8.8 .
We can easily place a lower bound on the latency with
MODERATE TABLE SIZES which a polynomial can be computed: if we start with x
and the coefficients ci , then by induction, in n serial fma
We tabulate here the number of double-extended table
operations, we cannot create a polynomial that is a degree
entries used in each function. The trigonometric functions
higher than 2 n , and we can only equal 2 n if the term of the
sin and cos share the same table, and the functions tan and n
highest degree is simply x2 with unity as its coefficient.
atan do not use a table at all.
For example, in one operation we can reach c0 + c1 x or
x + x2 but not x + c0 x2 . Our goal is to find an actual
Function Number of Double-Extended Entries scheduling that comes as close as possible to this lower
bound.
cbrt 256 (3072 bytes)
Simple heuristics based on binary chopping normally give
exp 24 (288 bytes) a good evaluation strategy, but it is not always easy to
visualize all the possibilities. When the polynomial can
Ln 256 (3072 bytes)
be split asymmetrically, or where certain coefficients are
sin, cos 64 (768 bytes) special, such as 1 or 0, there are often ways of doing
slightly better than one might expect in overall latency or
tan None at least in the number of instructions required to attain
atan None that latency (and hence in throughput). Besides, doing the
scheduling by hand is tedious. We search automatically
for the best scheduling using a program that exhaustively
examines all essentially different scheduling. One simply

The Computation of Transcendental Functions on the IA-64 Architecture 4

Intel Technology Journal Q4, 1999

enters a polynomial, and the program returns the best The next stage is to take each tree (in some of the
latency and throughput attainable, and it lists the main examples below there are as many as 10000 of them) and
ways of scheduling the operations to attain this. calculate the optimal scheduling. The optimal scheduling
is computed backwards by a fairly naive greedy
Even with various intelligent pruning approaches and
algorithm, but with a few simple refinements based on
heuristics, the search space is large. We restrict it
stratifying the nodes from the top as well as from the
somewhat by considering only fma combinations of the
bottom.
form p 1 (x) + xkp 2(x). That is, we do not consider
multiplying two polynomials with nontrivial coefficients. The following table gives the evaluation strategy found by
Effectively, we allow only solutions that work for the program for the polynomial:
arbitrary coefficients, without considering special
x + c2 x2 + c3 x3 + … + c9 x9
factorization properties. However, for polynomials where
all the coefficients are 1, these results may not be optimal Table 2 shows that it can be scheduled in 20 cycles, and
because of the availability of nontrivial factorizations that we have attained the lower bound. However, if the first
we have ruled out. For example, we can calculate: term were c1 x we would need 21.
1 + x + x2 + x3 + x4 + x5 + x6 Cycle FMA Unit 1 FMA Unit 2
as 0 v 1 = c2 + x c3 v2 = x x
2 2 2
1 + (1 + (x + x )) (x + (x ) (x )) 3 v 3 = c6 + x c7 v 4 = c8 + x c9
which can be scheduled in 15 cycles. However, if the 4 v 5 = c4 + x c5
restriction on fma operations is observed, then 16 cycles
is the best attainable. 5 v 6 = x + v2 v 1 v7 = v2 v2

The optimization program works in two stages. First, all 9 v8 = v3 + v2 v4

possible evaluation orders using these restricted fma
10 v9 = v6 + v7 v5 v 10 = v 2 v 7
operations are computed. These evaluation orders ignore
scheduling, being just “abstract syntax” tree structures 15 v 11 = v 9 + v 10 v8
indicating the dependencies of subexpressions, with
Table 2: An optimal scheduling
interior nodes representing fma operations of the form
p 1(x) + x k p 2(x):
OUTLINE OF ALGORITHMS
2 3
c 0 + c 1x + c 2x + c 3x We outline each of the seven algorithms discussed here.
We concentrate only on the numeric cases and ignore
situations such as when the input is out of the range of the
functions’ domains or non-numeric (NaN for example).
2
c0 + c 1 x x c 2 + c 3x Cbrt
1. Reduction: Given x, compute r = x frcpa(x) − 1.
2. Approximation: Compute a polynomial p(r) of the
form p(r)=p 1 r + p 2 r2 + … + p 6 r6 that approximates
c0 x c1 0 x x c2 x c3
(1+r)1/3 −1.
Figure 1: A dependency tree 3. Reconstruction: Compute the result T + Tp(r) where
However, because of the enormous explosion in the T is (1/frcpa(x))1/3 . This value T is obtained via
possibilities, we limit the search to the smallest possible a tabulation of (1/frcpa(y))1/3 , where y=1+k/256, k
tree depth. This tree depth corresponds to the minimum ranges from 0 to 255 and a tabulation of 2−j/3 , and j
number of serial operations than can possibly be used to ranges from 0 to 2.
evaluate the expression using the order denoted by that Exp
particular tree. Consequently, if the tree depth is d then
we cannot possibly do better than 5d cycles for that 1. Reduction: Given x, compute N, the closest integer to
particular tree. Now, assuming that we can in fact do at the value x (128/ln(2)). Then compute r = (x−N
least as well as 5d + 4, we are justified in ignoring trees of P1 )−N P2 . Here P1 +P2 approximates ln(2)/128 (see
a depth greater than or equal to d + 1, which could not previous discussions).
possibly be scheduled in as few cycles. This turns out to
be the case for all our examples.

The Computation of Transcendental Functions on the IA-64 Architecture 5

Intel Technology Journal Q4, 1999

2. Approximation: Compute a polynomial p(r) of the that approximates −cot(r). The term t is r2 . We
form p(r) = r + p 1 r2 + … + p 4 r5 that approximates emphasize the fact that parallelism is fully utilized.
exp(r) − 1.
3. Reconstruction: If N is even, return p. If N is odd,
3. Reconstruction: Compute the result T + Tp(r) where return q.
T is 2N/128 . This value T is obtained as follows. First, Atan
N is expressed as N = 128 M + 16 K + J, where I1
ranges from 0 to 15, and I2 ranges from 0 to 7. 1. Reduction: No reduction is needed.
Clearly 2N/128 = 2M 2K/8 2J/128 . The first of the three 2. Approximation: If |x| is less than 1, compute a
factors can be obtained by scaling the exponent; the polynomial p(x) = x + x3 (p 0 + p 1 y + … + p 22 y22) that
remaining two factors are fetched from tables with 8
approximates atan(x), y is x2 . If x| | > 1, compute
entries and 16 entries, respectively. several quantities, fully utilizing parallelism. First,
Ln compute q(x) = q 0 + q 1 y + … + q 22 y22 , y=x2 , that
approximates x45 atan(1/x). Second, compute c45
1. Reduction: Given x, compute r = x frcpa(x) − 1.
where c = frcpa(x). Third, compute another
2. Approximation: Compute a polynomial p(r) of the polynomial r(β) = 1+r1 β + … + r10 β10 , where β is the
form p(r)= p 1 r2 + … + p 5 r6 that approximates quantity x frcpa(x) − 1 and r(β) approximates the
ln(1+r) − r. value (1−β) −45 .
3. Reconstruction: Compute the result T + r + p(r) 3. Reconstruction: If x| | is less than 1, return p(x).
where the T is ln(1/frcpa(x)). This value T is Otherwise, return sign(x)π/2 − c45 r(β)q(x). This is
obtained via a tabulation of ln(1/frcpa(y)), where due to the identity atan(x) = sign(x)π/2 − atan(1/x).
y=1+k/256, k ranges from 0 to 255, and a calculation
of the form N ln(2). SPEED AND ACCURACY
Sin and Cos These new double-precision elementary functions are
designed to be both fast and accurate. We present the
We first consider the case of sin(x).
speed of the functions in terms of latency for arguments
1. Reduction: Given x, compute N, the closest integer to that fall through the implementation in a path that is
the value x (16/π). Then compute r = (x−N P1 )−N P2 . deemed most likely. As far as accuracy is concerned, we
Here P1 +P2 approximates π/16 (see previous report the largest observed error after extensive testing in
discussions). terms of units of last place (ulps). This error measure is
standard in this field. Let f be the mathematical function
2. Approximation: Compute two polynomials: p(r) of to be implemented and F be the actual implementation in
the form r + p 1 r3 + … + p 4 r9 that approximates double precision. When 2L < |f(x)| ≤ 2L+1 , the error in ulps
sin(r) and q(r) of the form q 1 r2 + q 2 r4 + … + q 4 r8 is defined as |f(x) − F(x)| / (2L−52 ). Note that the smallest
that approximates cos(r) − 1. worst-case error that one can possibly attain is 0.5 ulps.
3. Reconstruction: Return the result as Cp(r)+(S+Sq(r)) Table 3 tabulates the latency and maximum error
observed.
where C is cos(N π/16) and S is sin(N π/16) obtained
from a table. Function Latency (cycles) Max. Error (ulps)
The case of cos(x) is almost identical. Add 8 to N just cbrt 60 0.51
after it is first obtained. This works because of the
identity cos(x) = sin(x+π/2). exp 60 0.51

Tan ln 52 0.53

1. Reduction: Given x, compute N, the closest integer to sin 70 0.51

the value x (2/π). Then compute r = (x−N P1 )−N P2 . cos 70 0.51
Here P1 +P2 approximates π/2 (see previous
discussions). tan 72 0.51

2. Approximation: When N is even, compute a atan 66 0.51

polynomial p(r) = r + r t (p0 + p 1 t + … + p 15 t15) that
approximates tan(r). When N is odd, compute a
polynomial q(r) = (−r) -1+ r(q0 + q 1 t + … + q 10 t10) Table 3: Speed and accuracy of functions

The Computation of Transcendental Functions on the IA-64 Architecture 6

Intel Technology Journal Q4, 1999

CONCLUSIONS [7] Smith, Roger A., “A Continued-Fraction Analysis of

Trigonometric Argument Reduction,” IEEE
We have shown how certain key features of the IA-64
Transactions on Computers, pp. 1348-1351, Vol. 44,
architecture can be exploited to design transcendental
No. 11, November 1995.
functions featuring an excellent combination of speed and
accuracy. All of these functions performed over twice as [8] Tang, P.T.P., “Table-driven implementation of the
fast as the ones based on the simple conversion of a exponential function in IEEE floating-point
library tailored for double-extended precision. In one arithmetic,” ACM Transactions on Mathematical
instance, the ln function described here contributed to a Software, vol. 15, pp. 144-157, 1989.
two point increment of SpecFp benchmark run under
simulation. AUTHORS’ BIOGRAPHIES
The features of the IA-64 architecture that are exploited John Harrison has been with Intel for just over one year.
include parallelism and the fused multiply add as well as less He obtained his Ph.D. degree from Cambridge University
obvious features such as the reciprocal approximation in England and is a specialist in formal validation and
instruction. When abundant resources for parallelism are theorem proving. His e-mail is [email protected].
available, it is not always easy to visualize how to take
full advantage of them. We have searched for optimal Ted Kubaska is a senior software engineer with Intel
Corporation in Hillsboro, Oregon. He has a M.S. degree
instruction schedules. Although our search method is
in physics from the University of Maine at Orono and a
sufficient to handle the situations we have faced so far,
more sophisticated techniques are needed to handle more M.S. degree in computer science from the Oregon
Graduate Institute. He works in the MSL Numerics
complex situations. First, polynomials of a higher degree
Group where he implements and tests floating-point
may be needed in more advanced algorithms. Second,
more general expressions that can be considered as algorithms. His e-mail is
[email protected].
multivariate polynomials are also anticipated. Finally, our
current method does not handle the full generality of Shane Story has worked on numerical and floating-point
microarchitectural constraints, which also vary in future related issues since he began working for Intel eight years
implementations on the IA-64 roadmap. We believe this ago. His e-mail is [email protected].
optimal scheduling problem to be important not only
because it yields high-performance implementation, but Ping Tak Peter Tang (his friends call him Peter) joined
also because it may offer a quantitative analysis on the Intel very recently as an applied mathematician working
balance of microarchitectural parameters. Currently we in the Computational Software Lab of MSL. Peter
are considering an integer programming framework to received his Ph.D. degree in mathematics from the
tackle this problem. We welcome other suggestions as University of California at Berkeley. His interest is in
well. floating-point issues as well as fast and accurate
numerical computation methods. Peter has consulted for
Intel in the past on such issues as the design of the
REFERENCES transcendental algorithms on the Pentium, and he
[1] Cody Jr., William J. and Waite, William, Software contributed a software solution to the Pentium division
Manual for the Elementary Functions, Prentice Hall, problem. His e-mail is [email protected].
1980.
[2] Knuth, D.E., The Art of Computer Programming vol.
2: Seminumerical Algorithms, Addison-Welsey, 1969.
[3] Muller, J. M., Elementary functions: algorithms and
implementation, Birkhaüser, 1997.
[4] Payne, M., “An Argument Reduction Scheme on the
DEC VAX,” Signum Newsletter, January 1983.
[5] Powell, M.J.D., Approximation Theory and Methods,
Cambridge University Press, 1981.
[6] Story, S. and Tang, P.T.P., “New algorithms for
improved transcendental functions on IA-64,” in
Proceedings of 14th IEEE symposium on computer
arithmetic, IEEE Computer Society Press, 1999.

The Computation of Transcendental Functions on the IA-64 Architecture 7

Cordic 1
No ratings yet
Cordic 1
19 pages
Floating Point Arithmetic
100% (1)
Floating Point Arithmetic
30 pages
CBNST Notes For BCA PU 3rd Sem Based On Syllabus PDF
100% (1)
CBNST Notes For BCA PU 3rd Sem Based On Syllabus PDF
27 pages
The SIAM 100-Digit Challenge: A Study in High-Accuracy Numerical Computing
No ratings yet
The SIAM 100-Digit Challenge: A Study in High-Accuracy Numerical Computing
319 pages
Calculator Project
67% (3)
Calculator Project
30 pages
Fast Inverse Square Root
No ratings yet
Fast Inverse Square Root
12 pages
Engg 407 Notes
No ratings yet
Engg 407 Notes
109 pages
Numerical and Statistical Techniques
No ratings yet
Numerical and Statistical Techniques
162 pages
089871561X
No ratings yet
089871561X
319 pages
Week
No ratings yet
Week
87 pages
Na Notes
No ratings yet
Na Notes
236 pages
Sound Approximation of Programs With Elementary Functions
No ratings yet
Sound Approximation of Programs With Elementary Functions
25 pages
Beating Floating Point
No ratings yet
Beating Floating Point
16 pages
MATH 2160 Numerical Analysis 1 Notes: S. H. Lui Department of Mathematics University of Manitoba
No ratings yet
MATH 2160 Numerical Analysis 1 Notes: S. H. Lui Department of Mathematics University of Manitoba
111 pages
Asna Notes
No ratings yet
Asna Notes
95 pages
BeatingFloatingPoint PDF
No ratings yet
BeatingFloatingPoint PDF
16 pages
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
No ratings yet
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
104 pages
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
No ratings yet
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
104 pages
Quadratic I Ee Etc 0305
No ratings yet
Quadratic I Ee Etc 0305
16 pages
Course Note
No ratings yet
Course Note
121 pages
Final Version
No ratings yet
Final Version
14 pages
Week 2 M1Lessons 2-3
No ratings yet
Week 2 M1Lessons 2-3
41 pages
Maths All Notes
No ratings yet
Maths All Notes
122 pages
Beginners Guide To Build A Gaming PC
No ratings yet
Beginners Guide To Build A Gaming PC
113 pages
Lec 3
No ratings yet
Lec 3
29 pages
Floating-Point Numbers
No ratings yet
Floating-Point Numbers
23 pages
Red Hat Openstack Platform-16.1-Networking Guide-En-Us
No ratings yet
Red Hat Openstack Platform-16.1-Networking Guide-En-Us
208 pages
Cordic 1
No ratings yet
Cordic 1
19 pages
Bessel Functions of Real Argument and Integer Order: David J. Sookne
No ratings yet
Bessel Functions of Real Argument and Integer Order: David J. Sookne
8 pages
WBMT2049-T2/WI2032TH - Numerical Analysis For ODE's
No ratings yet
WBMT2049-T2/WI2032TH - Numerical Analysis For ODE's
30 pages
Chapter 4
No ratings yet
Chapter 4
26 pages
Unit 4 - 2
No ratings yet
Unit 4 - 2
21 pages
Error Analysis in Numerical Methods
No ratings yet
Error Analysis in Numerical Methods
9 pages
Sample Solutions To Assignment 5.: Amath 584, Autumn 2017
No ratings yet
Sample Solutions To Assignment 5.: Amath 584, Autumn 2017
5 pages
Parallel Square and Cube Computations
No ratings yet
Parallel Square and Cube Computations
6 pages
Fast Inverse Square Root
No ratings yet
Fast Inverse Square Root
8 pages
NVIDIA CUDA Floating Point
No ratings yet
NVIDIA CUDA Floating Point
7 pages
Lecture Notes17
No ratings yet
Lecture Notes17
122 pages
Rounding Errors: Course Website
No ratings yet
Rounding Errors: Course Website
34 pages
GSC-320 Numerical Computing: Lecturer:Fasiha Ikram
No ratings yet
GSC-320 Numerical Computing: Lecturer:Fasiha Ikram
17 pages
An Fpga Based 64-Bit Ieee - 754 Double Precision Floating Point Adder/Subtractor and Multiplier Using VHDL
No ratings yet
An Fpga Based 64-Bit Ieee - 754 Double Precision Floating Point Adder/Subtractor and Multiplier Using VHDL
11 pages
Fast Floating Point Square Root: Thomas F. Hain, David B. Mercer
No ratings yet
Fast Floating Point Square Root: Thomas F. Hain, David B. Mercer
7 pages
Floating-Point Arithmetic in The Coq System
No ratings yet
Floating-Point Arithmetic in The Coq System
10 pages
Projective Rational Arithmetic With Floating Point: Vaclav Skala
No ratings yet
Projective Rational Arithmetic With Floating Point: Vaclav Skala
5 pages
Numerical Methods
No ratings yet
Numerical Methods
72 pages
Math 130 Numerical Solution To Ce Problems
0% (1)
Math 130 Numerical Solution To Ce Problems
10 pages
Printer Troubleshooting Host
No ratings yet
Printer Troubleshooting Host
22 pages
HW 2
No ratings yet
HW 2
4 pages
Ieee Arith
No ratings yet
Ieee Arith
3 pages
1.2 Error, Accuracy, and Stability: 1.0, The Notation A (1:m), in Program Comments and in
No ratings yet
1.2 Error, Accuracy, and Stability: 1.0, The Notation A (1:m), in Program Comments and in
4 pages
Numerical Analysis Lecture Notes: 1. Computer Arithmetic
No ratings yet
Numerical Analysis Lecture Notes: 1. Computer Arithmetic
6 pages
Implementation of IEEE 754 Compliant Single Precision Floating-Point Adder Unit Supporting Denormal Inputs On Xilinx FPGA
No ratings yet
Implementation of IEEE 754 Compliant Single Precision Floating-Point Adder Unit Supporting Denormal Inputs On Xilinx FPGA
5 pages
Lesson Exemplar: (If Available, Write The Indicated Melc) (If Available, Write The Attached Enabling Competencies)
No ratings yet
Lesson Exemplar: (If Available, Write The Indicated Melc) (If Available, Write The Attached Enabling Competencies)
4 pages
1.3 Error, Accuracy, and Stability: Preliminaries
No ratings yet
1.3 Error, Accuracy, and Stability: Preliminaries
4 pages
1.3 Error, Accuracy, and Stability: Preliminaries
No ratings yet
1.3 Error, Accuracy, and Stability: Preliminaries
4 pages
A High-Performance Area-Efficient Multifunction Interpolator
No ratings yet
A High-Performance Area-Efficient Multifunction Interpolator
8 pages
Computational Physics I: Luigi Scorzato Lecture 2: Floating Point Arithmetic
No ratings yet
Computational Physics I: Luigi Scorzato Lecture 2: Floating Point Arithmetic
7 pages
Floating Point
No ratings yet
Floating Point
3 pages
Trigo Table
No ratings yet
Trigo Table
4 pages
Energy Efficient High Speed Floating Point Arithmetic Unit: Somya Kumawat, Arpan Shah, Ramesh Bharti
No ratings yet
Energy Efficient High Speed Floating Point Arithmetic Unit: Somya Kumawat, Arpan Shah, Ramesh Bharti
3 pages
Mathematical Functions in C++ and Source Code
No ratings yet
Mathematical Functions in C++ and Source Code
3 pages
Cloud Based Bus Ticket Generation System Ijariie11508
No ratings yet
Cloud Based Bus Ticket Generation System Ijariie11508
3 pages
Firmware Update Sinamics Chassis V25 Eng
No ratings yet
Firmware Update Sinamics Chassis V25 Eng
15 pages
Vehicle Management System
67% (3)
Vehicle Management System
7 pages
7citiesogold Manual
No ratings yet
7citiesogold Manual
44 pages
MX8600S
No ratings yet
MX8600S
2 pages
Optimize Audio in Win
No ratings yet
Optimize Audio in Win
39 pages
Javelin3/Javelin3Pro PDF Readers: Secure PDF Reader Program For The Drumlin Digital Rights Management (DRM) Service
No ratings yet
Javelin3/Javelin3Pro PDF Readers: Secure PDF Reader Program For The Drumlin Digital Rights Management (DRM) Service
27 pages
ML093370294
No ratings yet
ML093370294
58 pages
Pipelining (All Slides)
No ratings yet
Pipelining (All Slides)
45 pages
Quick Start Guide: Cisco Wap150 Wireless-Ac/N Dual Radio Access Point With Poe
No ratings yet
Quick Start Guide: Cisco Wap150 Wireless-Ac/N Dual Radio Access Point With Poe
12 pages
DBMS Packages
No ratings yet
DBMS Packages
3 pages
RAID Overview:: Identifying What RAID Levels Best Meet Customer Needs
No ratings yet
RAID Overview:: Identifying What RAID Levels Best Meet Customer Needs
10 pages
Resume of Legiste905
No ratings yet
Resume of Legiste905
2 pages
Microcontroller Exam Paper
100% (1)
Microcontroller Exam Paper
6 pages
VM Ovs
No ratings yet
VM Ovs
9 pages
T2000 Hardware RAID
No ratings yet
T2000 Hardware RAID
9 pages
Testing Android Studio
No ratings yet
Testing Android Studio
7 pages
Python Tutorial - Keyboard Input
No ratings yet
Python Tutorial - Keyboard Input
1 page
SGT Spares India
No ratings yet
SGT Spares India
8 pages
Syllabus:: FR. Conceicao Rodrigues College of Engineering
No ratings yet
Syllabus:: FR. Conceicao Rodrigues College of Engineering
8 pages
Install GenieACS1.2.9
No ratings yet
Install GenieACS1.2.9
3 pages
Nplot Introduction
No ratings yet
Nplot Introduction
8 pages
BCS 011 June 2016 PDF
No ratings yet
BCS 011 June 2016 PDF
3 pages
Barkoder SDK Datasheet
No ratings yet
Barkoder SDK Datasheet
2 pages
Accessing Crystal Reports
No ratings yet
Accessing Crystal Reports
2 pages
C# Fundamentals For Absolute Beginners: Course Description
No ratings yet
C# Fundamentals For Absolute Beginners: Course Description
3 pages

Download

Uploaded by

Download

Uploaded by

The Computation of Transcendental Functions on the IA-64

John Harrison, Microprocessor Software Labs, Intel Corporation

Index words: floating point, mathematical software, transcendental functions

The Computation of Transcendental Functions on the IA-64 Architecture 1

The Computation of Transcendental Functions on the IA-64 Architecture 2

2. Consider novel reduction methods that avoid memory r0 = x − N P_1; r= r0 − N P_2.

The Computation of Transcendental Functions on the IA-64 Architecture 3

The Computation of Transcendental Functions on the IA-64 Architecture 4

The optimization program works in two stages. First, all 9 v8 = v3 + v2 v4

The Computation of Transcendental Functions on the IA-64 Architecture 5

1. Reduction: Given x, compute N, the closest integer to sin 70 0.51

2. Approximation: When N is even, compute a atan 66 0.51

The Computation of Transcendental Functions on the IA-64 Architecture 6

CONCLUSIONS [7] Smith, Roger A., “A Continued-Fraction Analysis of

The Computation of Transcendental Functions on the IA-64 Architecture 7

You might also like