(f3 + f2 )/2 φ0 (x3 ) φ2 (x3 ) a0

(f4 + f1 )/2 = φ0 (x4 ) φ2 (x4 ) a2

(f3 ’ f2 )/2 φ1 (x3 ) φ3 (x3 ) a1

(f4 ’ f1 )/2 = φ1 (x4 ) φ3 (x4 ) a3

plus

(f1 + f4 )/2 ’ (f4 ’ f1 )/2

f1

(f2 + f3 )/2 ’ (f3 ’ f2 )/2

f2

= (f3 + f2 )/2 + (f3 ’ f2 )/2

f3

(f4 + f1 )/2 + (f4 ’ f1 )/2

f4

Because the terms of a general Fourier series satisfy a double parity symmetry, we can

apply the above reasoning twice to split a 24-point transform (for example) into four 6 — 6

matrix multiplies: one for the even cosines, one for the odd cosines, one for even sines, and

one for the odd sines.

Theorem 28 appeared on pg. 255 of the 1989 edition of this book, but Solomonoff (1992)

published a note on the same idea and factor-of-two savings. It is just as well that he did,

because later references to the PMMT are sparse and generally refer to Solomonoff™s paper.

However, parity-in-transforms has a long history.

Runge (1903, 1905, 1913) used this double-parity reduction as the ¬rst two steps of

his transform, which is a variant of the FFT. Sir Edmund Whittaker had special comput-

ing forms printed for “Runge grouping” for his computing lab at Edinburgh (Carse and

Urquhart, 1914); this was computer programming before there were digital, electronic com-

puters!

Runge™s ideas were forgotten and the FFT was reinvented several times before the re-

discovery by Cooley and Tukey (1965).

No doubt the parity transform will be independently rediscovered many times. Halv-

ing both execution time and storage is a useful savings.

10.5 Costs of the Fast Fourier Transform: FFT versus PMMT

Table 10.3 lists the cost of various algorithms for one-dimensional transforms. All the FFT

variants have a cost which is O(N log2 (N ) while all the Matrix Multiplication Transforms

have costs which are O(N 2 ). Except for the complex FFT, the cost of the FFT is always

O((5/2)N log2 (N ) where N is the number of points which are actually transformed. For

an odd cosine transform which computes cos(x), cos(3x), ..., cos([2N ’ 1x)}, i. e., N odd

cosines, only N points on x ∈ [0, π/2] are needed. The periodic grid spans x ∈ [’π, π] with

a total of 4N points, but it is N , not 4N , that appears in the cost estimate.

The reason that the table has so many entries is that there are several factors which can

reduce the total cost by a factor of two:

• real-valued data

• symmetry with respect to the origin

• symmetry with respect to π/2

• exploiting parity in matrix multiplication transforms

10.5. COSTS OF THE FAST FOURIER TRANSFORM 193

Table 10.3: FFT costs

Note: all costs are expressed in terms of real operations for an N -point transform where N is the num-

ber of basis functions of the proper symmetry which are transformed; complex multiplications and ad-

ditions have been converted to their equivalent costs in real operations. FFT is the Fast Fourier Trans-

form, MMT is the Matrix Multiplication Transform and PMMT is the MMT split into two smaller matrix-

vector multiplies, one for the symmetric part of the function or series and the other for the antisymmetric.

FFT of complex data 5N log2 (N )

8N 2

MMT of complex data

4N 2

PMMT of complex data

FFT of real-valued data (5/2)N log2 (N )

2N 2

MMT of real-valued data

N2

PMMT of real-valued data

FFT cosine,sine or Chebyshev transform (5/2)N log2 (N )

2N 2

MMT cosine,sine or Chebyshev transform

N2

PMMT cosine,sine or Chebyshev transform

FFT odd cosine, odd sine or odd Chebyshev trans. (5/2)N log2 (N )

2N 2

MMT odd cosine, odd sine or odd Chebyshev trans.

To apply the full, complex-valued FFT on 4N points to compute the ¬rst N odd cosines,

for example, costs eight times more than an ef¬cient odd cosine transform. One factor of

two is for real data, another for parity with respect to the origin (a cosine series), and the

third for parity with respect to x = π/2 (odd cosine series).

However, these formal operation counts must be interpreted with caution. First, the

FFT has a lot of nested operations which create overhead that is not apparent in the count of

multiplications and additions. The MMT is just a matrix-vector multiply, which is a simpler

algorithm. Second, both the FFT and the matrix-vector multiply are such fundamental

algorithms that assembly-language, highly optimized codes to perform them are available

for many machines. These are typically faster than a FORTRAN subroutine by a factor of

two. Third, the relative ef¬ciency of the algorithms is hardware-dependent.

The net result is that for small and moderate N , the MMT is much more competitive

than operation counts indicate (Fig. 10.2), at least for single transforms. The graph also

illustrates the strong hardware and software dependence of the relative costs of the algo-

rithm. On the personal computer, the FFT matches the ef¬ciency of MMT at about N = 48,

but on the Cray, not until N = 512, a difference of a factor of 10! Fornberg™s ¬gure contains

another curve, not reproduced here, that shows the relative ef¬ciency and execution times

are strongly dependent on software, too. The FFT in the Cray library is much faster than

its FORTRAN equivalent and lowers the point where the FFT and MMT curves meet from

N = 512 to N = 128.

Clive Temperton (private communication, 1998) has noted that in large models, one

can usually arrange to compute many one-dimensional FFTs simultaneously. For example,

in a multi-level weather forecasting model, the transforms of velocities, temperature and

so on at each level are independent (i. e., logically in parallel). One can then program

so that “the innermost loops of the code step ˜across™ the transform, so that all the messy

indexing is banished to the outermost loops ... [where the cost is amortized over many one-

dimensional transforms]. I bet the break-even point between FFT and PMMT is lower than

N = 8, perhaps even N = 2!” The algorithmic details are given in Temperton(1983a) where

it is shown that the multi-transform strategy is very ef¬cient for both vector (Cray-YMP)

and parallel machines. Indexing and if statements, Because they are logical and integer

operations, are omitted from the usual ¬‚oating point operation counts, but are nevertheless

CHAPTER 10. PARTIAL SUMMATION, THE FFT AND MMT

194

E-1

Matrix Mult.

CPU Time (secs.)

PC FFT

E-2

FFT

E-3

Matrix Mult.

E-4

Cray

E-5

64 128 256 512

32

8 16

N

Figure 10.2: A comparison of the execution times of the Chebyshev (or cosine) transform

by FFT versus Matrix Multiplication Transform (MMT). (Lower is better!) In each pair, the

execution time of the FFT is the solid curve while the MMT is marked with disks. The top

pair of graphs were obtain on an IBM personal computer with an Intel 486DX33 central

processor while the lower pair was run on a Cray X-MP. Data from Fig. F-3.1 of Fornberg

(1996). Note that the FFT curves are for a single one-dimensional transform. When many

transforms are computed simultaneously (logically in parallel, even if not necessarily im-

plemented on parallel hardware), the indexing can be done once and amortized over many

transforms, greatly reducing the cost per transform.

a major expense of one-at-a-time FFTs for small to moderate N .

Thus, simultaneous transforms may be much faster than a single transform, but only

if coded carefully ” off-the-shelf library software is often not optimized to perform many

transforms in a single subroutine call using Temperton™s inner loop/outer loop strategy.

(Exceptions include a routine in the Cray Scienti¬c Subroutine Library and Roland Sweet™s

VFFTPACK, which is available on the Web.) This reiterates the point made earlier: the

relative ef¬ciencies of the MMT and FFT are highly hardware and software dependent.

The cost of the FFT can be greatly reduced by machine coding, simultaneous computation

of many transforms, exploitation of symmetries (such as foreknowledge that the grid point

values are real-valued) and so on.

Spectral element models, with typically N ∼ 8, use a Legendre basis and the MMT

without tears: for such small N , the FFT would probably be slower than the MMT so a

Chebyshev basis has no advantage over Legendre. Orszag (1980) estimated that using the

FFT instead of MMT only reduced the cost of his 1283 turbulence model by a factor of two.2

Still, a factor of two is a signi¬cant savings for a supercomputer code. The virtues of the

Fast Fourier Transform will continue to improve as the relentless march to larger and larger

values of N continues.

2 Secret confession:

I have never used the FFT for a pseudospectral program in BASIC, only for large codes that

needed FORTRAN or MATLAB, and then only sometimes.

10.6. GENERALIZED FFTS AND MULTIPOLE METHODS 195

10.6 Fast Multipole Methods, the Orszag Transform, and Other

Generalized FFTs

Fast Multipole Methods (FMM) were originally developed for problems in stellar dynam-

ics, such as the motion of N stars in a model of an evolving star cluster. The naive approach

is to compute the total force on the j-th star by adding up the forces exerted upon it by each

of the other stars individually. The cost is O(N 2 ) operations per time step, which is very

expensive when N ∼ 100, 000, as would be appropriate for a globular cluster.

The idea behind FMM is illustrated by the schematic (Fig. 10.3) of two interacting star

clusters. The thousands of stars in the right cluster can be replaced by a series expansion

which converges geometrically. Thus, to calculate the forces on the stars in the left cluster

due to stars in the right cluster, we only need to sum a few terms of the multipole series

instead of thousands of individual interactions. If the force of the left cluster on the right is

calculated similarly, then the cost is reduced by a factor of two, assuming that interactions

between stars in the same cluster are still computed individually. By careful partitioning

of the cluster and the use of many overlapping multipole expansions in a nested hierarchy,

one can reduce the cost to O(N log2 (N )) ” an enormous saving.

Boyd (1992c) pointed out that the FMM can also be applied to cardinal function series.

This means that (i) FMM can be used as a fast transform for Hermite, Laguerre, spherical

FMM

1/r + r.r™/r 3

+...

Figure 10.3: Calculating the gravitational forces exerted by two star clusters (top half) is

rather expensive because every star interacts with every other. The multipole expansion

replaces many individual stars by a few terms of a series, greatly reducing the cost of

computing the pull of the right cluster on the stars in the left. 1/r is simply the distance

from a star in the left cluster (different for each star) to the center of mass of the right cluster.

The lowest term is Newton™s approximation: all the mass pulls on external objects as if the

whole cluster were shrunk to a point at its center of mass.

CHAPTER 10. PARTIAL SUMMATION, THE FFT AND MMT

196

Table 10.4: A Selected Bibliography of Generalized FFTs

Note: “FD” is an abbreviation for ¬nite difference.

References Comments

Taylor,Hirsh, Comparison of FFT with MMT & conjugate gradient

Nadworny(1981,1984) to estimate penalty for nonstandard (non-FFT) grid

Orszag (1986) Fast FFT for almost any function with 3-term recurrence

Greengard&Strain (1991) Fast Gauss transform (useful with Green™s functions,etc.)

Alpert & Rokhlin (1991) Fast Legendre-to-Chebyshev transform

¨

Suli&Ware (1991,1992) Nonequispaced fast transform

for semi-Lagrangian time-marching

Boyd (1992c) Showed FFM could be applied to (i) non-Chebyshev basis

(ii) non-equispaced grid for Fourier and Chebyshev

Boyd (1992d) off-grid interpolation through two-step algorithm

(i) FFT-to-¬ne-grid (ii) Lagrange interpolation on ¬ne grid

Moore&Healy&Rockmore(1993) O(N log2 (N )) exact Fourier transform for nonuniform grid

Dutt&Rokhlin (1993,1995) FMM nonequispaced FFT

Jakob-Chien & Alpert FMM for spherical harmonics

(1997), Jakob (1993)

Anderson & Dahleh(1996) Taylor series about nearest equispaced point for

off-grid interpolation. Also do reverse off-grid interpolation

(from irregular grid to Fourier coef¬cients) by iteration

Driscoll&Healy&Rockmore(1997) O(N log2 (N )) transform for any orthogonal polynomials

harmonics and other functions for which the FFT is inapplicable and (ii) for Fourier series

and Chebyshev polynomials, the FMM provides a fast method for off-grid interpolation.

To illustrate the ¬rst application, let the grid points xj be the roots of φN (x) and let

Cj (x) denote the corresponding cardinal function. Then

N

f (x) = fj Cj (xi )

j=1

N

fj /φN,x (xj )

(10.27)

= φN (x)

x ’ xj

j=1

by virtue of the general construction of cardinal functions given in Chapter 5. However,

the lower sum in (10.27) is of the same form as the electrostatic or gravitational potential of

a sum of point charges or masses with (fj /φN,x (xj )) playing the role of the charge of the

j-th object.

It is trivial to evaluate the cardinal function series to compute f (x) on the canonical grid

because the j-th cardinal function is zero at all grid points except x = xj , but the evaluation

of the cardinal series for the ¬rst derivative costs O(N 2 ) operations. Differentiation of the

cardinal series gives

N N

df fj /φN,x (xj ) fj /φN,x (xj )

(xi ) = φN,x (xi ) + φN (x)

xi ’ xj ’ (xi ’ xj )2

dx j=1 j=1

N