. 83
( 136 .)


to the pseudospectral grid. Aliasing is not completely eliminated by semi-Lagrangian ad-
vection, but it is greatly weakened, and the inherent dissipation is suf¬cient to suppress
aliasing instability. Nevertheless, we shall describe de-aliased models in what follows.
With a total of 2N + 1 degrees of freedom in longitude, we need at least 2N + 1 grid
points in », but it is customary to use 3N + 1 points so as to eliminate aliasing according
to the 3/2™s rule. In colatitude, each spherical harmonic in a triangular truncation is a
polynomial in cos(θ) [or such a polynomial multiplied by sin(θ)] of degree at most N . For
m = ±N , the degree of the polynomial is furnished entirely by the sin|m| (θ) factor; for
m = 0, it is the degree of the Gegenbauer polynomial, but for all m, the degree is at most N .
In expanding the quadratically nonlinear terms, we must evaluate integrals of the product
of three harmonics, so the total degree of the integrands we must perform is never larger
than 3N . It follows that we need a minimum of (3/2)N grid points in colatitude to evaluate
these integrals exactly and thus again avoid aliasing.

18.12.2 Substitutes and Accelerators for the MMT
Spherical harmonics have one great fault: slow transforms for Associated Legendre func-
tions in latitude. This vice has has inspired a quest for alternatives which has lasted more
than three decades and counting. In longitude, the spherical harmonics are just a Fourier
series. All grid-to-coef¬cient and coef¬cient-to-grid transforms for spherical harmonics
(and alternatives to them) employ a Fourier series in longitude and the Fast Fourier Trans-
form in this coordinate. It is only the transform in latitude which is troublesome because
the FFT is inapplicable to Associated Legendre series.
The reason that the Legendre transform is important is that spectral algorithms require
repeated jumps from grid point values to coef¬cients and back again at each time step. At
the end of the twentieth century, spherical harmonic weather forecasting models use the
Matrix Multiplication Transform (MMT) to jump from Legendre coef¬cients to grid point
values and back again. For each zonal wavenumber m, this multiplication of a vector by a
dense matrix requires O(N 2 ) operations. It follows that the cost is very large compared to
the O(N log2 (N )) expense of an FFT. For many years, there has been a feeling that eventu-

ally N would grow so large that spherical harmonics models would become prohibitively
expensive, forcing a switch to ¬nite elements or other non-spectral algorithms.
Reality is different from expectation. The highest resolution forecasting code at the time
of writing (1998) is the T639 model of the European Centre for Medium-Range Weather
Forecasting (ECMWF), which is being tested but will not be operational for another couple
of years. Although N = 639, the Legendre transforms still consume only 19% of the run-
ning time of the model. How is this possible? The short answer is that even though there is
no algorithm as nifty as the FFT for Legendre series, there are several tricks for accelerating
the MMT which have so far saved spherical harmonics from obsolesence:
1. Parity Matrix Multiplication Transform (PMMT)
2. The Vectorizability and Parallelizability of Matrix Multiplication
3. Reduced Polar Grid
4. Complicated Physics and Chemistry in Weather Forecasting Models
5. Schuster-Dilts Triangular Matrix Acceleration
6. Generalized Fast Fourier Transforms

18.12.3 Parity and Legendre Transforms
Because the Associated Legendre functions are all either symmetric or antisymmetric with
respect to the equator, a transformation from grid point values to N spectral coef¬cients
(or the reverse) can always be split into two subproblems of size N/2. Thus, after ¬rst
applying the FFT to transform from grid point values in latitude and longtitude to the grid
point values of zonal wavenumber m, a set of grid point values fm (θ) can be combined into
components symmetric and antisymmetric with respect to the equator (where colatitude
(θ = latitude ’ π/2) by taking sums and differences:

1 1
fm (θ) ≡ (fm (θ) + fm (π ’ θ)) , fm (θ) ≡ (fm (θ) ’ fm (π ’ θ)) , θ ∈ [0, π/2]
2 2
Each collection of grid point values of de¬nite parity can then be transformed by a matrix
multiplication into the corresponding even or odd Legendre coef¬cients. Each rectangular
transformation matrix has only half the rows and half the columns of the original, non-
parity-exploiting transformation, but there are two smaller transformations to be taken.
If we ignore the O(N ) cost of taking sums and differences compared to the O(N 2 ) of the
matrix-vector multiplications of the transformations themselves, parity saves exactly a fac-
tor of two. Symmetry is discussed in greater detail in Chapter 8.

18.12.4 Hurrah for Matrix/Vector Multiplication
The multiplication of a vector by a matrix is one of the most fundamental operations of
linear algebra. Because of this, it is an operation which is often coded as a blazingly fast
assembly language program in software libraries. An assembly language routine will typ-
ically speed up the matrix/vector multiplication by a factor of two.
Furthermore, supercomputers with vector-processing hardware, such as the Cray 1 and
its arithmurgical descendants, process the matrix/vector multiplication very ef¬ciently.
Indeed, it is not an exaggeration to say that the vector hardware is built primarily to do
this one operation as ef¬ciently as possible. Because this operation requires no conditional
statements, branching, or any other overhead, Legendre transforms can be computed on

vector hardware at very close to the machine™s optimum ¬‚oating point execution rate. In
contrast, the sparse matrix operations of ¬nite differences require a lot of starting-and-
stopping: one cannot exploit the holes in a sparse matrix without working very hard to
inform the hardware of where the holes are. Because of this, when ¬nite difference and
spherical harmonics codes are compared in terms of ¬‚ops (that is, ¬‚oating point operations
per second), the spherical harmonics program usually has a much higher ¬‚op rate, much
closer to the machine™s theoretical optimum, than a ¬nite difference code. Thus, although
the spherical harmonic code may require a much larger number of ¬‚oating point operations
for a given N than a ¬nite difference or ¬nite element code, the wall-clock “grind time” may
be about the same.
Similarly, the multiplication of a dense matrix by a vector is a speedy process on a mas-
sively parallel machine. A matrix/vector multiply is actually just a set of independent
vector-vector multiplications, one for each row. Furthermore, the MMT matrix elements
are computed in a preprocessing step. It follows that on a massively parallel machine, it is
easy to subdivide the matrix/vector multiply among processors by assigning one or more
rows to each processor. To initiate the parallel computation, it is only necessary to broad-
cast the vector of grid point values to each processor simultaneously, and all ¬‚oating point
operations can then be executed in parallel. Again, because the matrix/vector multiplica-
tion is the fundamental operation of linear algebra, there are very good assembly language
routines and carefully designed algorithms for each type of hardware.
The conclusion is that in high performance computing, Legendre transforms-by-matrix
are much more ef¬cient and less costly than operation counts would suggest. In a computer
science course, one may only be asked to add up multiplications and additions, but in the
real world, it is important to look at the clock.

18.12.5 Reduced Grid and Other Tricks
Hortal and Simmons(1991) and Courtier and Naughton (1994) showed that one could re-
duce the cost of a spectral model by systematically deleting points from the tensor product
grid near the poles with little loss of accuracy. The reductions are not large (perhaps 30%),
but to paraphrase the late U. S. Senator Dirksen: “A billion multiplications here, a billion
additions there ” pretty soon you™re talking about real savings.”
Swarztrauber(1996) has carefully compared a large number of different formulations of
spherical harmonic methods to show that by proper organization, one can minimize the
number of Legendre transforms that are needed on each step of the algorithm.

18.12.6 Schuster-Dilts Triangular Matrix Acceleration
Schuster(1903) discovered that the dense matrix/vector multiplication of the usual Gaussian-
quadrature PMMT could be replaced by multiplication by a triangular matrix to reduce the
cost by a factor of two. Dilts(1985) independently reinvented this technique. The ¬rst part
of the algorithm is a two-dimensional FFT to transform from a grid which is evenly spaced
in both latitude and longtitude to the coef¬cients of a two-dimensional Fourier series. The
second step is to multiply the Fourier coef¬cients by an upper triangular matrix to convert
the Fourier coef¬cients to spherical harmonic coef¬cients.
Because of the familiar identity Tj (cos(θ)) ≡ cos(jθ), the Fourier series in latitude can
be equally well be regarded as a Chebyshev series in the variable x ≡ cos(θ) where θ is
colatitude. This interpretation is conceptually useful because it shows that the function
of the triangular matrix is to convert from one polynomial basis (Chebyshev) to another
polynomial basis (associated Legendre). These conversion is done one zonal wavenumber

at a time. To simplify the discussion, we shall illustrate only the case of zonal wavenumber
m = 0 (ordinary Legendre polynomials).
One important point is that the transformation respects parity. Thus, the even degree
Fourier coef¬cients in latitude transform only to even degree spherical harmonics, and
similarly odd transforms only to odd.
First, split the the θ-dependent coef¬cient of the m = 0 Fourier series into components
symmetric and antisymmetric with respect to the equator as in Eq.(18.39). Let the symmet-
ric part have the Chebyshev polynomial expansion

fm (x) = b2j T2j (x)

= b0 + b2 (2x2 ’ 1) + b4 (8x4 ’ 8x2 + 1) + b6 (32x4 ’ 48x4 + 18x2 ’ 1)

which is equivalent to a cosine series in θ where x = cos(θ) where θ is colatitude. (For
simplicity, we take only four terms, but the method can be extended to arbitrary order.)
Our goal is to convert this to the corresponding Legendre series:

fm (x) = a2j P2j (x)

32 1 35 4 15 2 3
x’ x’ x+ (18.41)
= a0 + a2 + a4
2 2 8 4 8
231 6 315 4 105 2 5
x’ x’
+a6 x+
16 16 16 16

The expansion of the j-th Chebyshev polynomial in terms of Legendre polynomials has
the coef¬cients (Tj , Pk )/(Pk , Pk ) where (p, q) denotes the unweighted integral of p(x) q(x)
from -1 to 1. Substituting this into the Chebyshev series and collecting Legendre terms, one

a = Mb

where a and b are column vectors containing the Legendre and Chebyshev coef¬cients,
respectively, and where the elements of the square matrix M are

1 1
Mij ≡ (18.43)
dx T2j’2 (x)P2i’2 (x) / dx P2i’2 (x)P2i’2 (x), i, j = 1, 2, . . .
’1 ’1

Because the expansion of a polynomial of degree j requires only the Legendre polynomials
of degree j and lower, the matrix M is upper triangular.
For example, if the expansions are truncated at sixth degree, only P6 in the Legendre
expansion and T6 in the Chebyshev series contain a term proportional to x6 . It follows that
32x6 b6 = (231/16)x6 a6 or in other words M44 = 512/231. Similarly, only two terms in each
series contain terms proportional to x4 . This implies 8b4 ’ 48b6 = (35/8)a4 ’ (315/16)a6 .
Since a6 is already known, this can be rewritten as a4 = (8/35){8b4 ’ (48/11)b6 }. Similarly,
a2 can be computed as a weighted sum of b2 , b4 and b6 and a0 as a weighted sum of four

coef¬cients. The necessary multipliers are the upper 4 x 4 block of the matrix M :

’1 ’ 15 ’ 35
1 1
a0 b0

’ 16 ’ 21
4 4
a2 0 b2
3 21
’ 384
a4 b4
0 0 35 385

a6 b6
0 0 0 231

Similar matrices give the transform from the Chebyshev (latitudinal Fourier) series for
the antisymmetric m = 0 Legendre polynomials and for the associated Legendre functions
of all higher m. The savings is only about a factor of two; note that the cost of the FFT
in latitude is negligible compared to the cost of the triangular matrix/vector multiply in
the limit N ’ ∞. However, this is an adequate reward for implementing a very simple
algorithm. (Healy, Rockmore, Kostelec and Moore, 1998, amusingly call this the “semi-
naive” algorithm precisely because it is so simple.)
Curiously, this algorithm has been never been used in an operational forecasting or cli-
mate code. One reason is that one must store the elements of a different triangular matrix
for each equatorial parity and each m, which adds up to a lot of storage. A more signi¬-
cant reason is that the Schuster-Dilts algorithm requires grid point values at evenly spaced
points in θ, which is a Chebyshev grid in x = cos(θ). The usual PMMT algorithm uses
points on the Legendre grid. Because a typical forecasting or climate code is O(50, 000)
lines of code, it is not trivial to rewrite a model to accept a different grid, even though
the two grids are very similar and most of the chemistry and hydrologic cycle, etc., are
evaluated point-by-point.

18.12.7 Generalized FFT: Multipoles and All That
As reviewed in Chapter 10, generalizations of the Fast Fourier Transform have been devel-
oped which replace the O(N 2 ) cost of a Legendre transform (for a given zonal wavenum-
ber) by a cost proportional to O(N log2 (N )) operations ” the same as for the Fast Fourier
Transform. Orszag™s (1986) method exploits the three-term recurrence relation satis¬ed by
the Associated Legendre Functions. Boyd(1992c) and Dutt&Rokhlin(1993, 1995) indepen-
dently observed that Fast Multipole Methods could also be applied. Recently, Rockmore et
al.(1998) have developed a very sophisticated algorithm speci¬cally for spherical harmon-
The bad news is that these algorithms have a huge proportionality constant compared
to the FFT. As a result, none of the generalized FFTs has ever been used in an operational
climate or weather forecasting model. However, as forecasting resolutions climb above
T400, these algorithms may become faster than the PMMT, and perhaps in time replace it.

18.12.8 Summary
The relative slowness of the PMMT (Parity-Exploiting Matrix Multiplication Transform)
for transforming the latitudinal basis functions is the great inef¬ciency-maker for spherical
harmonics algorithms. However, clever tricks, as catalogued above, have kept the spherical
harmonics from obsolesence.

Table 18.5: Legendre Transforms Bibliography

References Comments


. 83
( 136 .)