. 42
( 136 .)


for the forward transform while the reverse transform is
(f3 + f2 )/2 φ0 (x3 ) φ2 (x3 ) a0
(f4 + f1 )/2 = φ0 (x4 ) φ2 (x4 ) a2

(f3 ’ f2 )/2 φ1 (x3 ) φ3 (x3 ) a1
(f4 ’ f1 )/2 = φ1 (x4 ) φ3 (x4 ) a3


(f1 + f4 )/2 ’ (f4 ’ f1 )/2
(f2 + f3 )/2 ’ (f3 ’ f2 )/2
= (f3 + f2 )/2 + (f3 ’ f2 )/2
(f4 + f1 )/2 + (f4 ’ f1 )/2

Because the terms of a general Fourier series satisfy a double parity symmetry, we can
apply the above reasoning twice to split a 24-point transform (for example) into four 6 — 6
matrix multiplies: one for the even cosines, one for the odd cosines, one for even sines, and
one for the odd sines.
Theorem 28 appeared on pg. 255 of the 1989 edition of this book, but Solomonoff (1992)
published a note on the same idea and factor-of-two savings. It is just as well that he did,
because later references to the PMMT are sparse and generally refer to Solomonoff™s paper.
However, parity-in-transforms has a long history.
Runge (1903, 1905, 1913) used this double-parity reduction as the ¬rst two steps of
his transform, which is a variant of the FFT. Sir Edmund Whittaker had special comput-
ing forms printed for “Runge grouping” for his computing lab at Edinburgh (Carse and
Urquhart, 1914); this was computer programming before there were digital, electronic com-
Runge™s ideas were forgotten and the FFT was reinvented several times before the re-
discovery by Cooley and Tukey (1965).
No doubt the parity transform will be independently rediscovered many times. Halv-
ing both execution time and storage is a useful savings.

10.5 Costs of the Fast Fourier Transform: FFT versus PMMT
Table 10.3 lists the cost of various algorithms for one-dimensional transforms. All the FFT
variants have a cost which is O(N log2 (N ) while all the Matrix Multiplication Transforms
have costs which are O(N 2 ). Except for the complex FFT, the cost of the FFT is always
O((5/2)N log2 (N ) where N is the number of points which are actually transformed. For
an odd cosine transform which computes cos(x), cos(3x), ..., cos([2N ’ 1x)}, i. e., N odd
cosines, only N points on x ∈ [0, π/2] are needed. The periodic grid spans x ∈ [’π, π] with
a total of 4N points, but it is N , not 4N , that appears in the cost estimate.
The reason that the table has so many entries is that there are several factors which can
reduce the total cost by a factor of two:

• real-valued data

• symmetry with respect to the origin

• symmetry with respect to π/2

• exploiting parity in matrix multiplication transforms

Table 10.3: FFT costs
Note: all costs are expressed in terms of real operations for an N -point transform where N is the num-
ber of basis functions of the proper symmetry which are transformed; complex multiplications and ad-
ditions have been converted to their equivalent costs in real operations. FFT is the Fast Fourier Trans-
form, MMT is the Matrix Multiplication Transform and PMMT is the MMT split into two smaller matrix-
vector multiplies, one for the symmetric part of the function or series and the other for the antisymmetric.
FFT of complex data 5N log2 (N )
8N 2
MMT of complex data
4N 2
PMMT of complex data
FFT of real-valued data (5/2)N log2 (N )
2N 2
MMT of real-valued data
PMMT of real-valued data
FFT cosine,sine or Chebyshev transform (5/2)N log2 (N )
2N 2
MMT cosine,sine or Chebyshev transform
PMMT cosine,sine or Chebyshev transform
FFT odd cosine, odd sine or odd Chebyshev trans. (5/2)N log2 (N )
2N 2
MMT odd cosine, odd sine or odd Chebyshev trans.

To apply the full, complex-valued FFT on 4N points to compute the ¬rst N odd cosines,
for example, costs eight times more than an ef¬cient odd cosine transform. One factor of
two is for real data, another for parity with respect to the origin (a cosine series), and the
third for parity with respect to x = π/2 (odd cosine series).
However, these formal operation counts must be interpreted with caution. First, the
FFT has a lot of nested operations which create overhead that is not apparent in the count of
multiplications and additions. The MMT is just a matrix-vector multiply, which is a simpler
algorithm. Second, both the FFT and the matrix-vector multiply are such fundamental
algorithms that assembly-language, highly optimized codes to perform them are available
for many machines. These are typically faster than a FORTRAN subroutine by a factor of
two. Third, the relative ef¬ciency of the algorithms is hardware-dependent.
The net result is that for small and moderate N , the MMT is much more competitive
than operation counts indicate (Fig. 10.2), at least for single transforms. The graph also
illustrates the strong hardware and software dependence of the relative costs of the algo-
rithm. On the personal computer, the FFT matches the ef¬ciency of MMT at about N = 48,
but on the Cray, not until N = 512, a difference of a factor of 10! Fornberg™s ¬gure contains
another curve, not reproduced here, that shows the relative ef¬ciency and execution times
are strongly dependent on software, too. The FFT in the Cray library is much faster than
its FORTRAN equivalent and lowers the point where the FFT and MMT curves meet from
N = 512 to N = 128.
Clive Temperton (private communication, 1998) has noted that in large models, one
can usually arrange to compute many one-dimensional FFTs simultaneously. For example,
in a multi-level weather forecasting model, the transforms of velocities, temperature and
so on at each level are independent (i. e., logically in parallel). One can then program
so that “the innermost loops of the code step ˜across™ the transform, so that all the messy
indexing is banished to the outermost loops ... [where the cost is amortized over many one-
dimensional transforms]. I bet the break-even point between FFT and PMMT is lower than
N = 8, perhaps even N = 2!” The algorithmic details are given in Temperton(1983a) where
it is shown that the multi-transform strategy is very ef¬cient for both vector (Cray-YMP)
and parallel machines. Indexing and if statements, Because they are logical and integer
operations, are omitted from the usual ¬‚oating point operation counts, but are nevertheless

Matrix Mult.

CPU Time (secs.)

Matrix Mult.
64 128 256 512
8 16
Figure 10.2: A comparison of the execution times of the Chebyshev (or cosine) transform
by FFT versus Matrix Multiplication Transform (MMT). (Lower is better!) In each pair, the
execution time of the FFT is the solid curve while the MMT is marked with disks. The top
pair of graphs were obtain on an IBM personal computer with an Intel 486DX33 central
processor while the lower pair was run on a Cray X-MP. Data from Fig. F-3.1 of Fornberg
(1996). Note that the FFT curves are for a single one-dimensional transform. When many
transforms are computed simultaneously (logically in parallel, even if not necessarily im-
plemented on parallel hardware), the indexing can be done once and amortized over many
transforms, greatly reducing the cost per transform.

a major expense of one-at-a-time FFTs for small to moderate N .
Thus, simultaneous transforms may be much faster than a single transform, but only
if coded carefully ” off-the-shelf library software is often not optimized to perform many
transforms in a single subroutine call using Temperton™s inner loop/outer loop strategy.
(Exceptions include a routine in the Cray Scienti¬c Subroutine Library and Roland Sweet™s
VFFTPACK, which is available on the Web.) This reiterates the point made earlier: the
relative ef¬ciencies of the MMT and FFT are highly hardware and software dependent.
The cost of the FFT can be greatly reduced by machine coding, simultaneous computation
of many transforms, exploitation of symmetries (such as foreknowledge that the grid point
values are real-valued) and so on.
Spectral element models, with typically N ∼ 8, use a Legendre basis and the MMT
without tears: for such small N , the FFT would probably be slower than the MMT so a
Chebyshev basis has no advantage over Legendre. Orszag (1980) estimated that using the
FFT instead of MMT only reduced the cost of his 1283 turbulence model by a factor of two.2
Still, a factor of two is a signi¬cant savings for a supercomputer code. The virtues of the
Fast Fourier Transform will continue to improve as the relentless march to larger and larger
values of N continues.

2 Secret confession:
I have never used the FFT for a pseudospectral program in BASIC, only for large codes that
needed FORTRAN or MATLAB, and then only sometimes.

10.6 Fast Multipole Methods, the Orszag Transform, and Other
Generalized FFTs
Fast Multipole Methods (FMM) were originally developed for problems in stellar dynam-
ics, such as the motion of N stars in a model of an evolving star cluster. The naive approach
is to compute the total force on the j-th star by adding up the forces exerted upon it by each
of the other stars individually. The cost is O(N 2 ) operations per time step, which is very
expensive when N ∼ 100, 000, as would be appropriate for a globular cluster.
The idea behind FMM is illustrated by the schematic (Fig. 10.3) of two interacting star
clusters. The thousands of stars in the right cluster can be replaced by a series expansion
which converges geometrically. Thus, to calculate the forces on the stars in the left cluster
due to stars in the right cluster, we only need to sum a few terms of the multipole series
instead of thousands of individual interactions. If the force of the left cluster on the right is
calculated similarly, then the cost is reduced by a factor of two, assuming that interactions
between stars in the same cluster are still computed individually. By careful partitioning
of the cluster and the use of many overlapping multipole expansions in a nested hierarchy,
one can reduce the cost to O(N log2 (N )) ” an enormous saving.
Boyd (1992c) pointed out that the FMM can also be applied to cardinal function series.
This means that (i) FMM can be used as a fast transform for Hermite, Laguerre, spherical


1/r + r.r™/r 3
Figure 10.3: Calculating the gravitational forces exerted by two star clusters (top half) is
rather expensive because every star interacts with every other. The multipole expansion
replaces many individual stars by a few terms of a series, greatly reducing the cost of
computing the pull of the right cluster on the stars in the left. 1/r is simply the distance
from a star in the left cluster (different for each star) to the center of mass of the right cluster.
The lowest term is Newton™s approximation: all the mass pulls on external objects as if the
whole cluster were shrunk to a point at its center of mass.

Table 10.4: A Selected Bibliography of Generalized FFTs
Note: “FD” is an abbreviation for ¬nite difference.

References Comments
Taylor,Hirsh, Comparison of FFT with MMT & conjugate gradient
Nadworny(1981,1984) to estimate penalty for nonstandard (non-FFT) grid
Orszag (1986) Fast FFT for almost any function with 3-term recurrence
Greengard&Strain (1991) Fast Gauss transform (useful with Green™s functions,etc.)
Alpert & Rokhlin (1991) Fast Legendre-to-Chebyshev transform
Suli&Ware (1991,1992) Nonequispaced fast transform
for semi-Lagrangian time-marching
Boyd (1992c) Showed FFM could be applied to (i) non-Chebyshev basis
(ii) non-equispaced grid for Fourier and Chebyshev
Boyd (1992d) off-grid interpolation through two-step algorithm
(i) FFT-to-¬ne-grid (ii) Lagrange interpolation on ¬ne grid
Moore&Healy&Rockmore(1993) O(N log2 (N )) exact Fourier transform for nonuniform grid
Dutt&Rokhlin (1993,1995) FMM nonequispaced FFT
Jakob-Chien & Alpert FMM for spherical harmonics
(1997), Jakob (1993)
Anderson & Dahleh(1996) Taylor series about nearest equispaced point for
off-grid interpolation. Also do reverse off-grid interpolation
(from irregular grid to Fourier coef¬cients) by iteration
Driscoll&Healy&Rockmore(1997) O(N log2 (N )) transform for any orthogonal polynomials

harmonics and other functions for which the FFT is inapplicable and (ii) for Fourier series
and Chebyshev polynomials, the FMM provides a fast method for off-grid interpolation.
To illustrate the ¬rst application, let the grid points xj be the roots of φN (x) and let
Cj (x) denote the corresponding cardinal function. Then
f (x) = fj Cj (xi )
fj /φN,x (xj )
= φN (x)
x ’ xj

by virtue of the general construction of cardinal functions given in Chapter 5. However,
the lower sum in (10.27) is of the same form as the electrostatic or gravitational potential of
a sum of point charges or masses with (fj /φN,x (xj )) playing the role of the charge of the
j-th object.
It is trivial to evaluate the cardinal function series to compute f (x) on the canonical grid
because the j-th cardinal function is zero at all grid points except x = xj , but the evaluation
of the cardinal series for the ¬rst derivative costs O(N 2 ) operations. Differentiation of the
cardinal series gives
df fj /φN,x (xj ) fj /φN,x (xj )
(xi ) = φN,x (xi ) + φN (x)
xi ’ xj ’ (xi ’ xj )2
dx j=1 j=1


. 42
( 136 .)