erator acting on ut drastically slows the phase speed of short waves, allowing a much longer

time step than for KdV without computational instability. The quasi-geostrophic equation

was invented for similar reasons and used for the ¬rst computer weather forecasts in 1950.

It continued in use until 1965 when the ¬rst CDC6600s made it possible to forecast with

the “primitive equations”.

However, modern semi-implicit algorithms allow as long a time step for the primitive

equations as for the quasi-geostrophic equation. The boundary value problem of the semi-

implicit code is just as cheap to solve as that for quasi-geostrophy. The difference is that

the primitive equations model allows more accurate forecasts because it does not ¬lter out

low frequency gravity waves and Kelvin waves, as quasi-geostrophy does.

Similarly, the KdV equation has been exorcised of its former frightfulness by a combina-

tion of fast workstations, which can solve one-dimensional problems quickly even with a

tiny time step, and semi-implicit time-marching algorithms, which allow it to be integrated

as quickly as its understudy, the RLW equation.

One parenthetical note of caution: when the boundary conditions are not periodic and

the operator which is inverted is of lower order than the rest of the equation, some spatial

discretizations of implicitly-implicit equations can generate modes with in¬nite eigenval-

ues due to imposing the full set of boundary conditions on the matrix that discretizes the

lower order operator. For such dangerous “L-lower-order-than-G” problems, there are

simple remedies which are described in the section on the “Spurious Eigenvalue” problem

in Chapter 7.

“Implicitly-implicit” equations, that is, PDEs which require the solution of a boundary

value problem even for explicit time-marching, are now falling from favor. Implicit and

semi-implicit algorithms, which will be discussed at length in later chapters, can do the

same even better. Smart algorithms have replaced clever approximation.

Chapter 10

Partial Summation, the Fast

Fourier and Matrix Multiplication

Transforms and Their Cousins

“The derivative calculation [by Matrix Multiplication Transform] is the single

most computationally intensive step [in my spectral element code]; as a result;

a ¬ve-fold reduction in Navier-Stokes solution time is obtained when standard

vectorized FORTRAN code is substituted with library matrix-matrix product

routines on the Cray X-MP. Similar performance improvements are obtained by

employing hand coded matrix-matrix products on the iPSC/VX and iPSC/860

hypercubes.”

” Paul F. Fischer in an unpublished report (1991)

10.1 Introduction

Spectral time-marching algorithms are prohibitively expensive for large N unless the pro-

gram jumps back and forth between the cardinal function basis and the Chebyshev polyno-

mial basis at every timestep. The transforms between one representation and the other are

usually the rate-determining parts of a spectral calculation. It is therefore vital to perform

these transforms ef¬ciently.

The Fast Fourier Transform (FFT) is the most ef¬cient algorithm for one-dimensional

transformations. However, the biggest savings are realized by performing multidimen-

sional transforms by “partial summation”. Instead of summing a series, one point at a

time, by a single DO or FOR/NEXT loop running over the entire two-dimensional basis,

the transform is factored into a nested sequence of one-dimensional transforms. Partial

summation and the FFT are closely related because both are based on a “divide and con-

quer” strategy of breaking a sum into smaller parts, evaluating each separately, and then

recombining.

Both these wonder-transforms impose restrictions. The FFT is applicable only to Fourier

series and Chebyshev polynomials. Partial summation requires that both the basis set and

the grids be tensor products.

183

CHAPTER 10. PARTIAL SUMMATION, THE FFT AND MMT

184

We shall also describe alternatives to the FFT. The Matrix Multiplication Transform

(MMT) is simply Gaussian quadrature (to interpolate) and direct summation (to evaluate).

It turns out that the MMT is suprisingly competitive with the FFT for small to moderate N.

By exploiting parity, the cost of the MMT can be halved. This is very important for the non-

FFT basis sets, which is all basis sets except Fourier series and Chebyshev polynomials, and

the images of Fourier and Chebyshev under a change-of-coordinate.

Fast algorithms for non-FFT basis sets have been extensively studied for two different

reasons. The ¬rst is to develop an ef¬cient transform for large N . For example, as the

resolution of weather forecasting models increases ” the 2000 vintage has N = 512 ”

the cost of the MMT transform for Associated Legendre functions becomes an increasingly

large fraction of the total cost of the forecast.

The second reason is to interpolate and sum on a “non-canonical” grid. Even with

Fourier and Chebyshev series, the FFT is useless if the grid points are spaced irregularly.

Unfortunately, semi-Lagrangian time-marching schemes, which have wonderful front-resolving

and stability properties, require just such an “off-grid” interpolation at every time step.

We shall therefore describe non-FFT fast transforms at some length.

10.2 Partial Summation in Two or More Dimensions

Orszag (1980) saved a factor of 10,000 (!!) in the running time of his turbulence code CEN-

TICUBE (128 — 128 — 128 degrees of freedom) merely by evaluating the multidimensional

spectral transforms through partial summation. We will illustrate the idea through a two-

dimensional example.

It is important to note that partial summation applies only when both the basis func-

tions and grid are tensor products of one-dimensional functions and grids, respectively,

as will be assumed throughout the rest of this section. Non-tensor product bases of low

polynomial order are widely (and successfully) employed for ¬nite elements on triangu-

lar subdomains. However, the cost of summations and interpolations rises so rapidly that

triangular spectral elements are often implemented by mapping the physical triangle to a

computational square and applying a tensor product basis on the square. There is only one

reason for the mapping: to gain the enormous cost reductions possible, when N is large,

by partial summation.

To evaluate the sum

M ’1 N ’1

(10.1)

f (x, y) = amn φm (x) ψn (y)

m=0 n=0

at an arbitrary point as a double DO loop, a total of M N multiplications and M N additions

are needed even if the values of the basis functions have been previously computed and

stored1 .

Since there are M N points on the collocation grid, we would seem to require a total

of O(M 2 N 2 ) operations to perform a two-dimensional transform from series coef¬cients

to grid point values. Thus, if M and N are the same order-of-magnitude, the operation

count for each such transform increases as the fourth power of the number of degrees in

x ” and we have to do this at least twice per time step. A ¬nite difference method, in

contrast, requires only O(M N ) operations per time step. Good grief! With M = N = 32,

the pseudospectral method would seem to be 1,000 times more expensive than a ¬nite

difference method with the same number of grid points.

1 When the basis functions are orthogonal polynomials or trigonometric functions, we can exploit three-term

recurrence relations to evaluate the double sum in O(M N ) operations without prior computation of the basis

functions ” but this still gives a cost of O(M N ) per point.

10.2. PARTIAL SUMMATION 185

Table 10.1: Partial Summation in Two Dimensions

M N

Goal: To evaluate f (x, y) = m=1 n=1 amn φm (x)ψn (y) at each point of the tensor prod-

uct grid (xi , yj ) given the spectral coef¬cients amn .

Cost ∼ O(M N 2 + M 2 N ) Operations,

O(3M N + M 2 + N 2 ) Storage

%Comment: F(i,j)=f (xi , yj ), A(m,n)= amn , PHI(i,m)=φm (xi ),

(j)

% PSI(j,n)=ψn (yj ), B(j,m)=±m

for m=1 to Mmax

for j=1 to Nmax

B(j,m)=0

For n=1 to Nmax

B(j,m)=B(j,m) + PSI(j,n)*A(m,n)

End triple loop in n,j,m

for j=1 to Nmax

for i=1 to Mmax

F(i,j)=0

for m=1 to Mmax

F(i,j)=F(i,j) + PHI(i,m)*B(j,m)

End triple loop in m,i,j

Note: The reverse grid-to-spectral transform may be evaluated

via partial summation in the same way.

Suppose, however, that we rearrange (10.1) as

M ’1 N ’1

(10.2)

f (x, y) = φm (x) amn ψn (y)

m=0 n=0

Let the points of the collocation grid be given by the tensor product (xi , yj ) where i =

0, . . . , M ’ 1 and j = 0, . . . , N ’ 1. A key to the partial summation method is to evaluate

the function on the full two-dimensional grid by ¬rst pursuing the more limited goal of

evaluating f (x, y) along a single line parallel to the x-axis which passes through the M

grid points for which y = yj . De¬ne the “line functions” via

fj (x) ≡ f (x, yj ) [“Line Functions”] (10.3)

It follows from (10.2) that

M ’1

(j)

(10.4)

fj (x) = φm (x) ±m

m=0

(j)

where the constants ±m are simply the value of the braces {} in (10.2) at y = yj :

N ’1

m = 0, . . . , M ’ 1

≡

(j)

(10.5)

±m amn ψn (yj )

j = 0, . . . , N ’ 1

n=0

(j)

There are M N coef¬cients ±m , and each is a sum over N terms as in (10.5), so the ex-

pense of computing the spectral coef¬cients of the auxiliary functions fj (x) is O(M N 2 ).

CHAPTER 10. PARTIAL SUMMATION, THE FFT AND MMT

186

Each auxiliary function describes how f (x, y) varies with respect to x on a particular grid

line, so we can evaluate f (x, y) everywhere on the grid by evaluating each of the N aux-

iliary functions at each of M grid points in x. However, the auxiliary functions are one-

dimensional, so each can be evaluated at a single point in only O(M ) operations. The cost

for the sums in (10.4) over the whole grid is therefore O(M 2 N ).

Conclusion: by de¬ning auxiliary “line” functions that represent the original function

for a particular y-value on the grid, we have reduced the cost from

’’’

O(M 2 N 2 )[direct sum] O(M N 2 ) + O(M 2 N )[partial sums], (10.6)

a savings of a factor of O(N ). Pseudocode for the transform is given in Table 10.1.

In three dimensions, the reward is even greater. Let

L’1 M ’1 N ’1

(10.7)

f (x, y, z) = almn φl (x) φm (y) φn (z)

l=0 m=0 n=0

Direct summation costs O(L2 M 2 N 2 ). The partial summations must now be performed in

two stages. First, de¬ne the two-dimensional auxiliary functions

fk (x, y) ≡ f (x, y, zk ) k = 0, . . . , N ’ 1 [“Plane functions”]

L’1 M ’1

(k)

(10.8)

= ±lm φl (x) φm (y)

l=0 m=0

where

l = 0, . . . , L ’ 1

N ’1

(k)

m = 0, . . . , M ’ 1; (10.9)

±lm = almn φn (zk )

k = 0, . . . , N ’ 1

n=0

Since there are N auxiliary “plane” functions, each with LM coef¬cients which must be

evaluated by summing over N grid points in z, the cost of forming the fk (x, y) is O(L M N 2 ).

To evaluate these two-dimensional auxiliary functions, use (10.3) “ (10.5), that is, de¬ne

fjk (x) ≡ f (x, yj , zk ) j = 0, . . . , M ’ 1 k = 0, . . . , N ’ 1

;

L’1

(jk)

[“Line functions”] (10.10)

= βl φl (x)

l=0

where

l = 0, . . . , L ’ 1;

M ’1

(jk) (k)

m = 0, . . . , M ’ 1; (10.11)

βl = ±lm φm (yj )

n = 0, . . . , N ’ 1

m=0

Since there are M N one-dimensional “line” functions, and each has L coef¬cients which

must be evaluated by summing over M terms as in (10.10), the expense of setting up the

fjk (x) is O(L M 2 N ). The cost of evaluating these M N “line” functions at each of the L

points on each grid line requires summing over L terms, so the cost of performing the

sums in (10.9) is O(L2 M N ).

Conclusion: in three dimensions,

’’’’’

O(L2 M 2 N 2 ) O(L M N 2 ) + O(L M 2 N ) + O(L2 M N ) (10.12)

10.3. THE FAST FOURIER TRANSFORM: THEORY 187

When L = M = N , partial summation reduces the cost by O(N 2 /3).

We have deliberately left the basis set unspeci¬ed because this argument is general, and

independent of the basis set. Even if the Fast Fourier Transform is inapplicable, we can still

apply partial summation.

Unfortunately, there is still a “transform” penalty: if we have N degrees of freedom

in each dimension, the cost of the spectral-to-grid transform will be a factor of N greater

than the total number of grid points, N d , where d is the dimension. A pseudospectral

time-marching scheme, alas, will require at least one spectral-to-grid transform and one

grid-to-spectral transform per time step. In contrast, an explicit ¬nite difference method

has a per-time-step cost which is directly proportional to the number of grid points.

For small to moderate N , the smaller N of the pseudospectral method (to achieve a

given error) may outweigh the higher cost per-degree-of-freedom. Unfortunately, the ra-

tio Npseudospectral /N¬nitedifference for a given accuracy is independent of N [typically