. 40
( 136 .)


For the RLW equation, the resolution of the apparent paradox is that the differential op-
erator acting on ut drastically slows the phase speed of short waves, allowing a much longer
time step than for KdV without computational instability. The quasi-geostrophic equation
was invented for similar reasons and used for the ¬rst computer weather forecasts in 1950.
It continued in use until 1965 when the ¬rst CDC6600s made it possible to forecast with
the “primitive equations”.
However, modern semi-implicit algorithms allow as long a time step for the primitive
equations as for the quasi-geostrophic equation. The boundary value problem of the semi-
implicit code is just as cheap to solve as that for quasi-geostrophy. The difference is that
the primitive equations model allows more accurate forecasts because it does not ¬lter out
low frequency gravity waves and Kelvin waves, as quasi-geostrophy does.
Similarly, the KdV equation has been exorcised of its former frightfulness by a combina-
tion of fast workstations, which can solve one-dimensional problems quickly even with a
tiny time step, and semi-implicit time-marching algorithms, which allow it to be integrated
as quickly as its understudy, the RLW equation.
One parenthetical note of caution: when the boundary conditions are not periodic and
the operator which is inverted is of lower order than the rest of the equation, some spatial
discretizations of implicitly-implicit equations can generate modes with in¬nite eigenval-
ues due to imposing the full set of boundary conditions on the matrix that discretizes the
lower order operator. For such dangerous “L-lower-order-than-G” problems, there are
simple remedies which are described in the section on the “Spurious Eigenvalue” problem
in Chapter 7.
“Implicitly-implicit” equations, that is, PDEs which require the solution of a boundary
value problem even for explicit time-marching, are now falling from favor. Implicit and
semi-implicit algorithms, which will be discussed at length in later chapters, can do the
same even better. Smart algorithms have replaced clever approximation.
Chapter 10

Partial Summation, the Fast
Fourier and Matrix Multiplication
Transforms and Their Cousins

“The derivative calculation [by Matrix Multiplication Transform] is the single
most computationally intensive step [in my spectral element code]; as a result;
a ¬ve-fold reduction in Navier-Stokes solution time is obtained when standard
vectorized FORTRAN code is substituted with library matrix-matrix product
routines on the Cray X-MP. Similar performance improvements are obtained by
employing hand coded matrix-matrix products on the iPSC/VX and iPSC/860

” Paul F. Fischer in an unpublished report (1991)

10.1 Introduction
Spectral time-marching algorithms are prohibitively expensive for large N unless the pro-
gram jumps back and forth between the cardinal function basis and the Chebyshev polyno-
mial basis at every timestep. The transforms between one representation and the other are
usually the rate-determining parts of a spectral calculation. It is therefore vital to perform
these transforms ef¬ciently.
The Fast Fourier Transform (FFT) is the most ef¬cient algorithm for one-dimensional
transformations. However, the biggest savings are realized by performing multidimen-
sional transforms by “partial summation”. Instead of summing a series, one point at a
time, by a single DO or FOR/NEXT loop running over the entire two-dimensional basis,
the transform is factored into a nested sequence of one-dimensional transforms. Partial
summation and the FFT are closely related because both are based on a “divide and con-
quer” strategy of breaking a sum into smaller parts, evaluating each separately, and then
Both these wonder-transforms impose restrictions. The FFT is applicable only to Fourier
series and Chebyshev polynomials. Partial summation requires that both the basis set and
the grids be tensor products.


We shall also describe alternatives to the FFT. The Matrix Multiplication Transform
(MMT) is simply Gaussian quadrature (to interpolate) and direct summation (to evaluate).
It turns out that the MMT is suprisingly competitive with the FFT for small to moderate N.
By exploiting parity, the cost of the MMT can be halved. This is very important for the non-
FFT basis sets, which is all basis sets except Fourier series and Chebyshev polynomials, and
the images of Fourier and Chebyshev under a change-of-coordinate.
Fast algorithms for non-FFT basis sets have been extensively studied for two different
reasons. The ¬rst is to develop an ef¬cient transform for large N . For example, as the
resolution of weather forecasting models increases ” the 2000 vintage has N = 512 ”
the cost of the MMT transform for Associated Legendre functions becomes an increasingly
large fraction of the total cost of the forecast.
The second reason is to interpolate and sum on a “non-canonical” grid. Even with
Fourier and Chebyshev series, the FFT is useless if the grid points are spaced irregularly.
Unfortunately, semi-Lagrangian time-marching schemes, which have wonderful front-resolving
and stability properties, require just such an “off-grid” interpolation at every time step.
We shall therefore describe non-FFT fast transforms at some length.

10.2 Partial Summation in Two or More Dimensions
Orszag (1980) saved a factor of 10,000 (!!) in the running time of his turbulence code CEN-
TICUBE (128 — 128 — 128 degrees of freedom) merely by evaluating the multidimensional
spectral transforms through partial summation. We will illustrate the idea through a two-
dimensional example.
It is important to note that partial summation applies only when both the basis func-
tions and grid are tensor products of one-dimensional functions and grids, respectively,
as will be assumed throughout the rest of this section. Non-tensor product bases of low
polynomial order are widely (and successfully) employed for ¬nite elements on triangu-
lar subdomains. However, the cost of summations and interpolations rises so rapidly that
triangular spectral elements are often implemented by mapping the physical triangle to a
computational square and applying a tensor product basis on the square. There is only one
reason for the mapping: to gain the enormous cost reductions possible, when N is large,
by partial summation.
To evaluate the sum
M ’1 N ’1
f (x, y) = amn φm (x) ψn (y)
m=0 n=0

at an arbitrary point as a double DO loop, a total of M N multiplications and M N additions
are needed even if the values of the basis functions have been previously computed and
stored1 .
Since there are M N points on the collocation grid, we would seem to require a total
of O(M 2 N 2 ) operations to perform a two-dimensional transform from series coef¬cients
to grid point values. Thus, if M and N are the same order-of-magnitude, the operation
count for each such transform increases as the fourth power of the number of degrees in
x ” and we have to do this at least twice per time step. A ¬nite difference method, in
contrast, requires only O(M N ) operations per time step. Good grief! With M = N = 32,
the pseudospectral method would seem to be 1,000 times more expensive than a ¬nite
difference method with the same number of grid points.
1 When the basis functions are orthogonal polynomials or trigonometric functions, we can exploit three-term
recurrence relations to evaluate the double sum in O(M N ) operations without prior computation of the basis
functions ” but this still gives a cost of O(M N ) per point.

Table 10.1: Partial Summation in Two Dimensions
Goal: To evaluate f (x, y) = m=1 n=1 amn φm (x)ψn (y) at each point of the tensor prod-
uct grid (xi , yj ) given the spectral coef¬cients amn .
Cost ∼ O(M N 2 + M 2 N ) Operations,
O(3M N + M 2 + N 2 ) Storage

%Comment: F(i,j)=f (xi , yj ), A(m,n)= amn , PHI(i,m)=φm (xi ),
% PSI(j,n)=ψn (yj ), B(j,m)=±m
for m=1 to Mmax
for j=1 to Nmax
For n=1 to Nmax
B(j,m)=B(j,m) + PSI(j,n)*A(m,n)
End triple loop in n,j,m
for j=1 to Nmax
for i=1 to Mmax
for m=1 to Mmax
F(i,j)=F(i,j) + PHI(i,m)*B(j,m)
End triple loop in m,i,j

Note: The reverse grid-to-spectral transform may be evaluated
via partial summation in the same way.

Suppose, however, that we rearrange (10.1) as
M ’1 N ’1
f (x, y) = φm (x) amn ψn (y)
m=0 n=0

Let the points of the collocation grid be given by the tensor product (xi , yj ) where i =
0, . . . , M ’ 1 and j = 0, . . . , N ’ 1. A key to the partial summation method is to evaluate
the function on the full two-dimensional grid by ¬rst pursuing the more limited goal of
evaluating f (x, y) along a single line parallel to the x-axis which passes through the M
grid points for which y = yj . De¬ne the “line functions” via
fj (x) ≡ f (x, yj ) [“Line Functions”] (10.3)
It follows from (10.2) that
M ’1
fj (x) = φm (x) ±m

where the constants ±m are simply the value of the braces {} in (10.2) at y = yj :
N ’1
m = 0, . . . , M ’ 1

±m amn ψn (yj )
j = 0, . . . , N ’ 1

There are M N coef¬cients ±m , and each is a sum over N terms as in (10.5), so the ex-
pense of computing the spectral coef¬cients of the auxiliary functions fj (x) is O(M N 2 ).

Each auxiliary function describes how f (x, y) varies with respect to x on a particular grid
line, so we can evaluate f (x, y) everywhere on the grid by evaluating each of the N aux-
iliary functions at each of M grid points in x. However, the auxiliary functions are one-
dimensional, so each can be evaluated at a single point in only O(M ) operations. The cost
for the sums in (10.4) over the whole grid is therefore O(M 2 N ).
Conclusion: by de¬ning auxiliary “line” functions that represent the original function
for a particular y-value on the grid, we have reduced the cost from

O(M 2 N 2 )[direct sum] O(M N 2 ) + O(M 2 N )[partial sums], (10.6)

a savings of a factor of O(N ). Pseudocode for the transform is given in Table 10.1.
In three dimensions, the reward is even greater. Let
L’1 M ’1 N ’1
f (x, y, z) = almn φl (x) φm (y) φn (z)
l=0 m=0 n=0

Direct summation costs O(L2 M 2 N 2 ). The partial summations must now be performed in
two stages. First, de¬ne the two-dimensional auxiliary functions

fk (x, y) ≡ f (x, y, zk ) k = 0, . . . , N ’ 1 [“Plane functions”]
L’1 M ’1
= ±lm φl (x) φm (y)
l=0 m=0

l = 0, . . . , L ’ 1
N ’1
m = 0, . . . , M ’ 1; (10.9)
±lm = almn φn (zk )
k = 0, . . . , N ’ 1

Since there are N auxiliary “plane” functions, each with LM coef¬cients which must be
evaluated by summing over N grid points in z, the cost of forming the fk (x, y) is O(L M N 2 ).
To evaluate these two-dimensional auxiliary functions, use (10.3) “ (10.5), that is, de¬ne

fjk (x) ≡ f (x, yj , zk ) j = 0, . . . , M ’ 1 k = 0, . . . , N ’ 1
[“Line functions”] (10.10)
= βl φl (x)

l = 0, . . . , L ’ 1;
M ’1
(jk) (k)
m = 0, . . . , M ’ 1; (10.11)
βl = ±lm φm (yj )
n = 0, . . . , N ’ 1

Since there are M N one-dimensional “line” functions, and each has L coef¬cients which
must be evaluated by summing over M terms as in (10.10), the expense of setting up the
fjk (x) is O(L M 2 N ). The cost of evaluating these M N “line” functions at each of the L
points on each grid line requires summing over L terms, so the cost of performing the
sums in (10.9) is O(L2 M N ).
Conclusion: in three dimensions,

O(L2 M 2 N 2 ) O(L M N 2 ) + O(L M 2 N ) + O(L2 M N ) (10.12)

When L = M = N , partial summation reduces the cost by O(N 2 /3).
We have deliberately left the basis set unspeci¬ed because this argument is general, and
independent of the basis set. Even if the Fast Fourier Transform is inapplicable, we can still
apply partial summation.
Unfortunately, there is still a “transform” penalty: if we have N degrees of freedom
in each dimension, the cost of the spectral-to-grid transform will be a factor of N greater
than the total number of grid points, N d , where d is the dimension. A pseudospectral
time-marching scheme, alas, will require at least one spectral-to-grid transform and one
grid-to-spectral transform per time step. In contrast, an explicit ¬nite difference method
has a per-time-step cost which is directly proportional to the number of grid points.
For small to moderate N , the smaller N of the pseudospectral method (to achieve a
given error) may outweigh the higher cost per-degree-of-freedom. Unfortunately, the ra-
tio Npseudospectral /N¬nitedifference for a given accuracy is independent of N [typically


. 40
( 136 .)