One of the things that C2HS is lacking is a good tutorial. So I’m going to write one (or try to, anyway).
To make this as useful as possible, I’d like to base a large part of the tutorial on a realistic case study of producing Haskell bindings to a C library. My current plan is to break the tutorial into three parts: the basics, the case study and “everything else”, for C2HS features that don’t get covered in the first two parts. To make this even more useful, I’d like to base the case study on a C library that someone actually cares about and wants Haskell bindings for.
The requirements for the case study C library are:
There shouldn’t already be Haskell bindings for it – I don’t want to duplicate work.
The C library should be “mediumsized”: big enough to be realistic, not so big that it takes forever to write bindings.
The C library should be of medium complexity. By this, I mean that it should have a range of different kinds of C functions, structures and things that need to be made accessible from Haskell. It shouldn’t be completely trivial, and it should require a little thought to come up with good bindings. On the other hand, it shouldn’t be so unusual that the normal ways of using C2HS don’t work.
Ideally it should be something that more than one person might want to use.
It needs to be a library that’s available for Linux. I don’t have a Mac and I’m not that keen on doing something that’s Windowsonly.
Requirements #2 and #3 are kind of squishy, but it should be fairly clear what’s appropriate and what’s not: any C library for which you think development of Haskell bindings would make a good C2HS tutorial case study is fair game.
If you have a library you think would be a good fit for this, drop me an email, leave a comment here or give me a shout on IRC (I’m usually on #haskell
as iross
or iross_
or something like that).
This is going to be the last substantive post of this series (which is probably as much of a relief to you as it is to me…). In this article, we’re going to look at phase space partitioning for our dimensionreduced $Z_{500}$ PCA data and we’re going to calculate Markov transition matrices for our partitions to try to pick out consistent nondiffusive transitions in atmospheric flow regimes.
We need to divide the phase space we’re working in (the unit sphere parameterised by $\theta$ and $\phi$) into a partition of equal sized components, to which we’ll assign each data point. We’ll produce partitions by dividing the unit sphere into bands in the $\theta$ direction, then splitting those bands in the $\phi$ direction as required. The following figures show the four partitions we’re going to use here^{1}:
In each case, the “compartments” of the partition are each of the same area on the unit sphere. For Partitions 1 and 2, we find the angle $\alpha$ of the boundary of the “polar” components by solving the equation
$\int_0^{\alpha} \sin \theta \, d\theta \int_0^{2\pi} \, d\phi = \frac{4\pi}{C},$
where $C$ is the number of components in the partition. For partition 1, with $N=4$, this gives $\alpha_1 = \pi/3$ and for partition 2, with $N=6$, $\alpha_2 = \cos^{1} (2/3)$.
Assigning points in our time series on the unit sphere to partitions is then done by this code (as usual, the code is in a Gist):
 Partition component: theta range, phi range.
data Component = C { thmin :: Double, thmax :: Double
, phmin :: Double, phmax :: Double
} deriving Show
 A partition is a list of components that cover the unit sphere.
type Partition = [Component]
 Angle for 141 partition.
th4 :: Double
th4 = acos $ 2.0 / 3.0
 Partitions.
partitions :: [Partition]
partitions = [ [ C 0 (pi/3) 0 (2*pi)
, C (pi/3) (2*pi/3) 0 pi
, C (pi/3) (2*pi/3) pi (2*pi)
, C (2*pi/3) pi 0 (2*pi) ]
, [ C 0 th4 0 (2*pi)
, C th4 (pith4) 0 (pi/2)
, C th4 (pith4) (pi/2) pi
, C th4 (pith4) pi (3*pi/2)
, C th4 (pith4) (3*pi/2) (2*pi)
, C (pith4) pi 0 (2*pi) ]
, [ C 0 (pi/2) 0 pi
, C 0 (pi/2) pi (2*pi)
, C (pi/2) pi 0 pi
, C (pi/2) pi pi (2*pi) ]
, [ C 0 (pi/2) (pi/4) (5*pi/4)
, C 0 (pi/2) (5*pi/4) (pi/4)
, C (pi/2) pi (pi/4) (5*pi/4)
, C (pi/2) pi (5*pi/4) (pi/4) ] ]
npartitions :: Int
npartitions = length partitions
 Convert list of (theta, phi) coordinates to partition component
 numbers for a given partition.
convert :: Partition > [(Double, Double)] > [Int]
convert part pts = map (convOne part) pts
where convOne comps (th, ph) = 1 + length (takeWhile not $ map isin comps)
where isin (C thmin thmax ph1 ph2) =
if ph1 < ph2
then th >= thmin && th < thmax && ph >= ph1 && ph < ph2
else th >= thmin && th < thmax && (ph >= ph1  ph < ph2)
The only thing we need to be careful about is dealing with partitions that extend across the zero of $\phi$.
What we’re doing here is really another kind of dimensionality reduction. We’ve gone from our original spatial maps of $Z_{500}$ to a continuous reduced dimensionality representation via PCA, truncation of the PCA basis and projection to the unit sphere, and we’re now reducing further to a discrete representation – each $Z_{500}$ map in our original time series data is represented by a single integer label giving the partition component in which it lies.
We can now use this discrete data to construct empirical Markov transition matrices.
Once we’ve generated the partition time series described in the previous section, calculating the empirical Markov transition matrices is fairly straightforward. We need to be careful to avoid counting transitions from the end of one winter to the beginning of the next, but apart from that little wrinkle, it’s just a matter of counting how many times there’s a transition from partition component $j$ to partition component $i$, which we call $T_{ij}$. We also need to make sure that we consider the same number, $N_k$, of points from each of the partition components. The listing below shows the function we use to do this – the function takes as arguments the size of the partition and the time series of partition components as a vector, and returns the transition count matrix $\mathbf{T}$ and $N_k$, the number of points in each partition used to calculate the transitions:
transMatrix :: Int > SV.Vector Int > (M, Int)
transMatrix n pm = (accum (konst 0.0 (n, n)) (+) $ zip steps (repeat 1.0), ns)
where allSteps = [((pm SV.! (i + 1))  1, (pm SV.! i)  1) 
i < [0..SV.length pm  2], (i + 1) `mod` 21 /= 0]
steps0 = map (\k > filter (\(i, j) > i == k) allSteps) [0..n1]
ns = minimum $ map length steps0
steps = concat $ map (take ns) steps0
Once we have $\mathbf{T}$, the Markov matrix is calculated as $\mathbf{M} = N_k^{1} \mathbf{T}$ and the symmetric and asymmetric components of $\mathbf{M}$ are calculated in the obvious way:
splitMarkovMatrix :: M > (M, M)
splitMarkovMatrix mm = (a, s)
where s = scale 0.5 $ mm + tr mm
a = scale 0.5 $ mm  tr mm
We can then calculate the $\mathbf{M}^A + \mathbf{M}^A$ matrix that recovers the nondiffusive part of the system dynamics. One thing we need to consider is the statistical significance of the resulting components in the $\mathbf{M}^A + \mathbf{M}^A$ matrix: these components need to be sufficiently large compared to the “natural” variation due to the diffusive dynamics in the system for us to consider them not to have occurred by chance. The statistical significance calculations aren’t complicated, but I’ll just present the results here without going into the details (you can either just figure out what’s going on directly from the code or you can read about it in Crommelin (2004)).
Let’s look at the results for the four partitions we showed earlier. In each case, we’ll show the $\mathbf{T}$ Markov transition count matrix and the $\mathbf{M}^A + \mathbf{M}^A$ “nondiffusive dynamics” matrix. We’ll annotate the entries in this matrix to show their statistical significance: $\underline{\underline{\mathbf{> 95\%}}}$, $\underline{\mathbf{95\%90\%}}$, $\mathbf{90\%85\%}$, $\underline{85\%80\%}$, $80\%75\%$, $\mathit{<75\%}$.
For partition #1, we find:
$\mathbf{T} = \begin{pmatrix} 145 & 67 & 63 & 34 \\ 77 & 110 & 62 & 60 \\ 62 & 32 & 125 & 90 \\ 24 & 73 & 70 & 142 \\ \end{pmatrix} \qquad \mathbf{M}^A + \mathbf{M}^A = \frac{1}{100} \begin{pmatrix} 0 & 0 & \mathit{0.3} & \mathit{3.2} \\ \mathit{3.2} & 0 & \underline{\underline{\mathbf{9.7}}} & 0 \\ 0 & 0 & 0 & 6.5 \\ 0 & \mathit{4.2} & 0 & 0 \\ \end{pmatrix}$
For partition #2:
$\mathbf{T} = \begin{pmatrix} 77 & 22 & 30 & 37 & 14 & 14 \\ 27 & 66 & 23 & 7 & 42 & 29 \\ 26 & 21 & 77 & 33 & 9 & 28 \\ 20 & 10 & 30 & 66 & 29 & 39 \\ 33 & 19 & 10 & 29 & 65 & 38 \\ 7 & 32 & 24 & 24 & 37 & 70 \\ \end{pmatrix}$
$\mathbf{M}^A + \mathbf{M}^A = \frac{1}{100} \begin{pmatrix} 0 & 0 & \mathit{2.1} & \mathbf{8.8} & 0 & \mathit{3.6} \\ \mathit{2.6} & 0 & \mathit{1.0} & 0 & \underline{\underline{\mathbf{11.9}}} & 0 \\ 0 & 0 & 0 & \mathit{1.5} & 0 & \mathit{2.1} \\ 0 & \mathit{1.5} & 0 & 0 & 0 & \underline{7.7} \\ \underline{\mathbf{9.8}} & 0 & \mathit{0.5} & 0 & 0 & \mathit{0.5} \\ 0 & \mathit{1.5} & 0 & 0 & 0 & 0 \\ \end{pmatrix}$
For partition #3:
$\mathbf{T} = \begin{pmatrix} 159 & 71 & 63 & 26 \\ 67 & 142 & 33 & 77 \\ 59 & 46 & 133 & 81 \\ 27 & 64 & 78 & 150 \\ \end{pmatrix} \qquad \mathbf{M}^A + \mathbf{M}^A = \frac{1}{100} \begin{pmatrix} 0 & \mathit{1.3} & \mathit{1.3} & 0 \\ 0 & 0 & 0 & \mathit{4.1} \\ 0 & \mathit{4.1} & 0 & \mathit{0.9} \\ \mathit{0.3} & 0 & 0 & 0 \\ \end{pmatrix}$
And for partition #4:
$\mathbf{T} = \begin{pmatrix} 160 & 53 & 68 & 27 \\ 75 & 135 & 36 & 62 \\ 56 & 43 & 133 & 76 \\ 19 & 70 & 50 & 169 \\ \end{pmatrix} \qquad \mathbf{M}^A + \mathbf{M}^A = \frac{1}{100} \begin{pmatrix} 0 & 0 & \mathit{3.9} & \mathit{2.6} \\ \mathbf{7.1} & 0 & 0 & 0 \\ 0 & \mathit{2.3} & 0 & \underline{\mathbf{8.4}} \\ 0 & \mathit{2.6} & 0 & 0 \\ \end{pmatrix}$
So, what conclusions can we draw from these results? First, the results we get here are rather different from those in Crommelin’s paper. This isn’t all that surprising – as we’ve followed along with the analysis in the paper, our results have become more and more different, mostly because the later parts of the analysis are more dependent on smaller details in the data, and we’re using a longer time series of data than Crommelin did. The plots below represent that contents of the $\mathbf{M}^A + \mathbf{M}^A$ matrices for each partition in a graphical form that makes it easier to see what’s going on. In these figures, the thickness and darkness of the arrows show the statistical significance of the transitions.
We’re only going to be able to draw relatively weak conclusions from these results. Let’s take a look at the apparent dynamics for partitions #1 and #2 shown above. In both cases, there is a highly significant flow from the right hand side of the plot to the left, presumably mostly representing transitions from the higher probability density regions on the right (around $\theta=\pi/2$, $\phi=7\pi/4$) to those on the left (around $\theta=3\pi/4$, $\phi=3\pi/8$). In addition, there are less significant flows from the upper hemisphere of the unit sphere to the lower, more significant for partition #2 than for #1, with the flow apparently preferentially going via partition component number 4 for partition #2. Looking back at the “all data” spherical PDF with some labelled bumps, we see that the flow from the right hand side of the PDF to the left is probably something like a transition from bump 4 (more like a blocking pattern) to bump 2 (more like a normal flow).
I’ll freely admit that this isn’t terrible convincing, and for partitions #3 and #4, the situation is less clear.
For me, one of the lessons to take away from this is that even though we started with quite a lot of data (daily $Z_{500}$ maps for 66 years), the progressive steps of dimensionality reduction that we’ve used to try to elucidate what’s going on in the data result in less and less data on which we do the later steps of our analysis, making it more and more difficult to get statistically significant (or even superficially convincing) results. It’s certainly not the case that the results in Crommelin (2004) are just a statistical accident – there really is observed persistence of atmospheric flow patterns and pretty clear evidence that there are consistent transitions between different flow regimes. It’s just that those might be quite hard to see via this kind of analysis. Why the results that we see here are less consistent than those in Crommelin’s analysis is hard to determine. Perhaps it’s just because we have more data and there was more variability in climate in the additional later part of the $Z_{500}$ time series. Or I might have made a mistake somewhere along the way!
It’s difficult to tell, but if I was doing this analysis “for real”, rather than just as an exercise to play with data analysis in Haskell, I’d probably do two additional things:
Use a truncated version of the data set to attempt to the replicate the results from Crommelin (2004) as closely as possible. This would give better confidence that I’ve not made a mistake.
Randomly generate partitions of the unit sphere for calculating the Markov transition matrices and use some sort of bootstrapping to get a better idea of how robust the “significant” transitions really are. (Generating random partitions of the sphere would be kind of interesting – I’d probably sample a bunch of random points uniformly on the sphere, then use some kind of springbased relaxation to spread the points out and use the Voronoi polygons around the relaxed points as the components of the partition.)
However, I think that that’s quite enough about atmospheric flow regimes for now…
I was originally planning to do some more work to demonstrate the independence of the results we’re going to get to the choice of partition in a more sophisticated way, but my notes are up to about 80 pages already, so I think these simpler fixed partition Markov matrix calculations will be the last thing I do on this!↩
I took over the daytoday support for C2HS about 18 months ago and have now finally cleaned up all the issues on the GitHub issue tracker. It took a lot longer than I was expecting, mostly due to pesky “real work” getting in the way. Now seems like a good time to announce the 0.25.1 “Snowmelt” release of C2HS and to summarise some of the more interesting new C2HS features.
When I first started working on C2HS, I kept breaking things and getting emails letting me know that suchandsuch a package no longer worked. That got boring pretty quickly, so I wrote a Shellydriven regression suite to build a range of packages that use C2HS to check for breakages. This now runs on Travis CI so that whenever a C2HS change is pushed to GitHub, as well as the main C2HS test suite, a bunch of C2HSdependent packages are built. This has been pretty handy for avoiding some stupid mistakes.
Thanks to work contributed by Philipp Balzarek, the treatment of the mapping between C enum
values and Haskell Enum
types is now much better than it was. The C enum
/Haskell Enum
association is kind of an awkward fit, since the C and Haskell worlds make really quite different assumptions about what an “enumerated” type is, and the coincidence of names is less meaningful than you might hope. We might have to do some more work on that in the future: I’ve been thinking about whether it would be good to have a CEnum
class in Foreign.C.Types
to capture just the features of C enums
that can be mapped to Haskell types in a sensible way.
You can now say things like:
#include <stdio.h>
{#pointer *FILE as File foreign finalizer fclose newtype#}
{#fun fopen as ^ {`String', `String'} > `File'#}
{#fun fileno as ^ {`File'} > `Int'#}
main :: IO ()
main = do
f < fopen "tst.txt" "w"
...
and the file descriptor f
will be cleaned up by a call to fclose
via the Haskell garbage collector. This encapsulates a very common use case for handling pointers to C structures allocated by library functions. Previously there was no direct way to associate finalizers with foreign pointers in C2HS, but now it’s easy.
C2HS has a new const
hook for directly accessing the value of C preprocessor constants – you can just say {#const FOO#}
to use the value of a constant FOO
defined in a C header in Haskell code.
I’ve implemented a couple of special mechanisms for argument marshalling that were requested. The first of these is a little esoteric, but an example should make it clear. A common pattern in some C libraries is to have code that looks like this:
typedef struct {
int a;
float b;
char dummy;
} oid;
void func(oid *obj, int aval, float bval);
int oid_a(oid *obj);
float oid_b(oid *obj);
Here the function func
takes a pointer to an oid
structure and fills in the values in the structure and the other functions take oid
pointers and do various things with them. Dealing with functions like func
through the Haskell FFI is a tedious because you need to allocate space for an oid
structure, marshall a pointer to the allocated space and so on. Now though, the C2HS code
{#pointer *oid as Oid foreign newtype#}
{#fun func as ^ {+, `Int', `Float'} > `Oid'#}
generates Haskell code like this:
newtype Oid = Oid (ForeignPtr Oid)
withOid :: Oid > (Ptr Oid > IO b) > IO b
withOid (Oid fptr) = withForeignPtr fptr
func :: Int > Float > IO Oid
func a2 a3 =
mallocForeignPtrBytes 12 >>= \a1'' > withForeignPtr a1'' $ \a1' >
let {a2' = fromIntegral a2} in
let {a3' = realToFrac a3} in
func'_ a1' a2' a3' >>
return (Oid a1'')
This allocates the right amount of space using the fast mallocForeignPtrBytes
function and deals with all the marshalling for you. The special +
parameter in the C2HS function hook definition triggers this (admittedly rather specialised) case.
The second kind of “special” argument marshalling is more general. A lot of C libraries include functions where small structures are passed “bare”, i.e. not as pointers. The Haskell FFI doesn’t include a means to marshal arguments of this type, which makes using libraries of this kind painful, with a lot of boilerplate marshalling code needed (just the kind of thing C2HS is supposed to eliminate!). The solution I came up with for C2HS is to add an argument annotation for function hooks that says that a structure pointer should really be passed as a bare structure. In such cases, C2HS then generates an additional C wrapper function to marshal between structure pointer and bare structure arguments. An example will make this clear. Suppose you have some code in a C header:
typedef struct {
int x;
int y;
} coord_t;
coord_t *make_coord(int x, int y);
void free_coord(coord_t *coord);
int coord_x(coord_t c, int dummy);
Here, the coord_x
function takes a bare coord_t
structure as a parameter. To bind to these functions in C2HS code, we write this:
{#pointer *coord_t as CoordPtr foreign finalizer free_coord newtype#}
{#fun pure make_coord as makeCoord {`Int', `Int'} > `CoordPtr'#}
{#fun pure coord_x as coordX {%`CoordPtr', `Int'} > `Int'#}
Here, the %
annotation on the CoordPtr
argument to the coordX
function hook tells C2HS that this argument needs to be marshalled as a bare structure. C2HS then generates Haskell code as usual, but also an extra .chs.c
file containing wrapper functions. This C code needs to be compiled and linked to the Haskell code.
This is kind of new and isn’t yet really supported by released versions of Cabal. I’ve made some Cabal changes to support this, which have been merged and will hopefully go into the next or next but one Cabal release. When that’s done, the handling of the C wrapper code will be transparent – Cabal will know that C2HS has generated these extra C files and will add them to the “C sources” list for whatever it’s building.
Previously, variadic C functions weren’t supported in C2HS at all. Now though, you can do fun things like this:
#include <stdio.h>
{#fun variadic printf[int] as printi {`String', `Int'} > `()'#}
{#fun variadic printf[int, int] as printi2 {`String', `Int', `Int'} > `()'#}
{#fun variadic printf[const char *] as prints {`String', `String'} > `()'#}
You need to give distinct names for the Haskell functions to be bound to different calling sequences of the underlying C function, and because there’s no other way of finding them out, you need to specify explicit types for the arguments you want to pass in the place of C’s ...
variadic argument container (that’s what the C types in the square brackets are). Once you do that, you can call printf
and friends to your heart’s content. (The user who wanted this feature wanted to use it for calling Unix ioctl
…)
A big benefit of C2HS is that it tries quite hard to manage the associations between C and Haskell types and the marshalling of arguments between C and Haskell. To that end, we have a lot of default marshallers that allow you very quickly to write FFI bindings. However, we can’t cover every case. There were a few longstanding issues (imported from the original Trac issue tracker when I moved the project to GitHub) asking for default marshalling for various C standard or “standardish” typedefs
. I held off on trying to fix those problems for a long time, mostly because I thought that fixing them one at a time as special cases would be a little futile and would just devolve into endless additions of “just one more” case.
In the end, I implemented a general scheme to allow users to explicitly associate C typedef
names with Haskell types and to define default marshallers between them. As an example, using this facility, you can write code to marshal Haskell String
values to and from C wide character strings like this:
#include <wchar.h>
{#typedef wchar_t CWchar#}
{#default in `String' [wchar_t *] withCWString* #}
{#default out `String' [wchar_t *] peekCWString* #}
{#fun wcscmp {`String', `String'} > `Int'#}
{#fun wcscat {`String', `String'} > `String'#}
I think that’s kind of fun…
As well as the features described above, there’s a lot more that’s been done over the last 18 months: better handling of structure tags and typedefs
; better crossplatform support (OS X, FreeBSD and Windows); lots more default marshallers; support for parameterised pointer types; some vague gestures in the direction of “backwards compatibility” (basically just a C2HS_MIN_VERSION
macro); and just in the last couple of days, some changes to deal with marshalling of C bool
values (really C99 _Bool
) which aren’t supported directly by the Haskell FFI (so again require some wrapper code and some other tricks).
As well as myself and Manuel Chakravarty, the original author of C2HS, the following people have contributed to C2HS development over the last 18 months (real names where known, GitHub handles otherwise):
Many thanks to all of them, and many thanks also to Benedikt Huber, who maintains the languagec
package on which C2HS is critically dependent!
All of the work I’ve done on C2HS has been driven purely by user demand, based on issues I imported from the original Trac issue tracker and then on things that people have asked for on GitHub. (Think of it as a sort of callbyneed exploration of the C2HS design space.) I’m now anticipating that since I’ve raised my head above the parapet by touting all these shiney new features, I can expect a new stream of bug reports to come in…
One potential remaining large task is to “sort out” the Haskell C language libraries, of which there are now at least three, all with different pros and cons. The languagec
library used in C2HS has some analysis capabilities that aren’t present in the other libraries, but the other libraries (notably Geoffrey Mainland’s languagecquote
and Manuel’s languagecinline
) support more recent dialects of C. Many of the issues with C2HS on OS X stem from modern C features that occur in some of the OS X headers that the languagec
package just doesn’t recognise. Using one of the other C language packages might alleviate some of those problems. To do that though, some unholy mushingtogether of languagec
and one of these other packages has to happen, in order to bring the analysis capabilities of languagec
to the other package. That doesn’t look like much fun at all, so I might ignore the problem and hope it goes away.
I guess longer term the question is whether tools like C2HS really have a future. There are better approaches to FFI programming being developed by research groups (Manuel’s is one of them: this talk is pretty interesting) so maybe we should just wait until they’re ready for prime time. On the other hand, quite a lot of people seem to use C2HS, and it is pretty convenient.
One C2HS design decision I’ve recently had to modify a little is that C2HS tries to use only information available via the “official” Haskell FFI. Unfortunately, there are situations where that just isn’t enough. The recent changes to marshal C99 _Bool
values are a case in point. In order to determine offsets into structures containing _Bool
members, you need to know how big a _Bool
is. Types that are marshalled by the Haskell FFI are all instances of Storable
, so you can just use the size
method from Storable
for this. However, the Haskell FFI doesn’t know anything about _Bool
, so you end up having to “query” the C compiler for the information by generating a little C test program that you compile and run. (You can find out which C compiler to use from the output of ghc info
, which C2HS thus needs to run first.) This is all pretty nasty, but there’s no obvious other way to do it.
This makes me think, since I’m having to do this anyway, that it might be worth reorganising some of C2HS’s structure member offset calculation code to use the same sort of “query the C compiler” approach. There are some cases (e.g. structures within structures) where it’s just not possible to reliably calculate structure member offsets from the size and alignment information available through the Haskell FFI – the C compiler is free to insert padding between structure members, and you can’t work out just by looking when a particular compiler is going to do that. Generating little C test programs and compiling and running them allows you to get the relevant information “straight from the horse’s mouth”… (I don’t know whether this idea really has legs, but it’s one thing I’m thinking about.)
]]>This is going to be the oldest of old hat for the cool Haskell kids who invent existential higherkinded polymorphic whatsits before breakfast, but it amused me, and it’s the first time I’ve used some of these more interesting language extensions for something “real”.
I have a Haskell library called hnetcdf
for reading and writing NetCDF files. NetCDF is a format for gridded data that’s very widely used in climate science, meteorology and oceanography. A NetCDF file contains a number of gridded data sets, along with associated information describing the coordinate axes for the data. For example, in a climate application, you might have air temperature or humidity on a latitude/longitude/height grid.
So far, so simple. There are C and Fortran libraries for reading and writing NetCDF files and the interfaces are pretty straightforward. Writing a basic Haskell binding for this stuff isn’t very complicated, but one thing is a little tricky, which is the choice of Haskell type to represent the gridded data.
In Haskell, we have a number of different array abstractions that are in common use – you can think of flattening your array data into a vector, using a Repa array, using a hmatrix
matrix, or a number of other possibilities. I wanted to support a sort of “store polymorphism” over these different options, so you’d be able to use the same approach to read data directly into a Repa array or a hmatrix
matrix.
To do this, I wrote an NcStore
class, whose first version looked something like this:
class NcStore s where
toForeignPtr :: Storable e => s e > ForeignPtr e
fromForeignPtr :: Storable e => ForeignPtr e > [Int] > s e
smap :: (Storable a, Storable b) => (a > b) > s a > s b
It’s basically just a way of getting data in and out of a “store”, in the form of a foreign pointer that can be used to pass data to the NetCDF C functions, plus a mapping method. This thing can’t be a functor because of the Storable
constraints on the types to be stored (which we need so that we can pass these things to C functions).
That works fine for vectors from Data.Vector.Storable
:
instance NcStore Vector where
toForeignPtr = fst . unsafeToForeignPtr0
fromForeignPtr p s = unsafeFromForeignPtr0 p (Prelude.product s)
smap = map
and for Repa foreign arrays:
import Data.Array.Repa
import qualified Data.Array.Repa as R
import qualified Data.Array.Repa.Repr.ForeignPtr as RF
import Data.Array.Repa.Repr.ForeignPtr (F)
instance Shape sh => NcStore (Array F sh) where
toForeignPtr = RF.toForeignPtr
fromForeignPtr p s = RF.fromForeignPtr (shapeOfList $ reverse s) p
smap f s = computeS $ R.map f s
However, there’s a problem if we try to write an instance of NcStore
for hmatrix
matrices. Most hmatrix
functions require that the values stored in a hmatrix
matrix are instances of the hmatrix
Element
class. While it’s completely trivial to make types instances of this class (you just write instance Element Blah
and you’re good), you still need to propagate the Element
constraint through your code. In particular, I needed to use the hmatrix
flatten
function to turn a matrix into a vector of values in rowmajor order for passing to the NetCDF C API. The flatten
function has type signature
flatten :: Element t => Matrix t > Vector t
so that Element
constraint somehow has to get into NcStore
, but only for cases when the “store” is a hmatrix
matrix.
At this point, all the real Haskell programmers are asking what the big deal is. You just switch on the ConstraintKinds
and TypeFamilies
extensions and rewrite NcStore
like this:
class NcStore s where
type NcStoreExtraCon s a :: Constraint
type NcStoreExtraCon s a = ()
toForeignPtr :: (Storable e, NcStoreExtraCon s e) =>
s e > ForeignPtr e
fromForeignPtr :: (Storable e, NcStoreExtraCon s e) =>
ForeignPtr e > [Int] > s e
smap :: (Storable a, Storable b, NcStoreExtraCon s a, NcStoreExtraCon s b) =>
(a > b) > s a > s b
Here, I’ve added an associated type called NcStoreExtraCon s a
, which is a constraint, I’ve given a default for this (of ()
, which is a “do nothing” empty constraint), and I’ve added the relevant constraint to each of the methods of NcStore
. The NcStore
instances for storable Vector
s and Repa arrays look the same as before, but the instance for hmatrix
matrices now looks like this:
instance NcStore HMatrix where
type NcStoreExtraCon HMatrix a = C.Element a
toForeignPtr (HMatrix m) = fst3 $ unsafeToForeignPtr $ C.flatten m
fromForeignPtr p s =
let c = last s
d = product s
in HMatrix $ matrixFromVector RowMajor (d `div` c) (last s) $
unsafeFromForeignPtr p 0 (Prelude.product s)
smap f (HMatrix m) = HMatrix $ C.mapMatrix f m
I’ve just added the Element
constraint on the type of values contained in the “store” to the instance, and I can then use any hmatrix
function that requires this constraint without any trouble: you can see the use of flatten
in the toForeignPtr
method definition.
The problem I had here is really just an instance of what’s come to be called the “restricted monad” problem. This is where you have a type class, possibly with constraints, and you want to write instances of the class where you impose additional constraints. The classic case is making Set
a monad: Set
requires its elements to be members of Ord
, but Monad
is fully polymorphic, and so there’s no way to make an instance of something like Ord a => Monad (Set a)
.
There’s even a package on Hackage called rmonad
that uses just this “constraint kinds + associated types” approach to allow you to write “restricted monads” of this kind. So this appears to be a wellknown method, but it was fun to rediscover it. The ability to combine these two language extensions in this (to me) quite unexpected way is really rather satisfying!
The analysis of preferred flow regimes in the previous article is all very well, and in its way quite illuminating, but it was an entirely static analysis – we didn’t make any use of the fact that the original $Z_{500}$ data we used was a time series, so we couldn’t gain any information about transitions between different states of atmospheric flow. We’ll attempt to remedy that situation now.
What sort of approach can we use to look at the dynamics of changes in patterns of $Z_{500}$? Our $(\theta, \phi)$ parameterisation of flow patterns seems like a good start, but we need some way to model transitions between different flow states, i.e. between different points on the $(\theta, \phi)$ sphere. Each of our original $Z_{500}$ maps corresponds to a point on this sphere, so we might hope that we can some up with a way of looking at trajectories of points in $(\theta, \phi)$ space that will give us some insight into the dynamics of atmospheric flow.
Since atmospheric flow clearly has some stochastic element to it, a natural approach to take is to try to use some sort of Markov process to model transitions between flow states. Let me give a very quick overview of how we’re going to do this before getting into the details. In brief, we partition our $(\theta, \phi)$ phase space into $P$ components, assign each $Z_{500}$ pattern in our time series to a component of the partition, then count transitions between partition components. In this way, we can construct a matrix $M$ with
$M_{ij} = \frac{N_{i \to j}}{N_{\mathrm{tot}}}$
where $N_{i \to j}$ is the number of transitions from partition $i$ to partition $j$ and $N_{\mathrm{tot}}$ is the total number of transitions. We can then use this Markov matrix to answer some questions about the type of dynamics that we have in our data – splitting the Markov matrix into its symmetric and antisymmetric components allows us to respectively look at diffusive (or irreversible) and nondiffusive (or conservative) dynamics.
Before trying to apply these ideas to our $Z_{500}$ data, we’ll look (in the next article) at a very simple Markov matrix calculation by hand to get some understanding of what these concepts really mean. Before that though, we need to take a look at the temporal structure of the $Z_{500}$ data – in particular, if we’re going to model transitions between flow states by a Markov process, we really want uncorrelated samples from the flow, and our daily $Z_{500}$ data is clearly correlated, so we need to do something about that.
Let’s look at the autocorrelation properties of the PCA projected component time series from our original $Z_{500}$ data. We use the autocorrelation
function in the statistics
package to calculate and save the autocorrelation for these PCA projected time series. There is one slight wrinkle – because we have multiple winters of data, we want to calculate autocorrelation functions for each winter and average them. We do not want to treat all the data as a single continuous time series, because if we do we’ll be treating the jump from the end of one winter to the beginning of the next as “just another day”, which would be quite wrong. We’ll need to pay attention to this point when we calculate Markov transition matrices too. Here’s the code to calculate the autocorrelation:
npcs, nday, nyear :: Int
npcs = 10
nday = 151
nyear = 66
main :: IO ()
main = do
 Open projected points data file for input.
Right innc < openFile $ workdir </> "z500pca.nc"
let Just ntime = ncDimLength <$> ncDim innc "time"
let (Just projvar) = ncVar innc "proj"
Right (HMatrix projsin) <
getA innc projvar [0, 0] [ntime, npcs] :: HMatrixRet CDouble
 Split projections into oneyear segments.
let projsconv = cmap realToFrac projsin :: Matrix Double
lens = replicate nyear nday
projs = map (takesV lens) $ toColumns projsconv
 Calculate autocorrelation for oneyear segment and average.
let vsums :: [Vector Double] > Vector Double
vsums = foldl1 (SV.zipWith (+))
fst3 (x, _, _) = x
doone :: [Vector Double] > Vector Double
doone ps = SV.map (/ (fromIntegral nyear)) $
vsums $ map (fst3 . autocorrelation) ps
autocorrs = fromColumns $ map doone projs
 Generate output file.
let outpcdim = NcDim "pc" npcs False
outpcvar = NcVar "pc" NcInt [outpcdim] M.empty
outlagdim = NcDim "lag" (nday  1) False
outlagvar = NcVar "lag" NcInt [outlagdim] M.empty
outautovar = NcVar "autocorr" NcDouble [outpcdim, outlagdim] M.empty
outncinfo =
emptyNcInfo (workdir </> "autocorrelation.nc") #
addNcDim outpcdim # addNcDim outlagdim #
addNcVar outpcvar # addNcVar outlagvar #
addNcVar outautovar
flip (withCreateFile outncinfo) (putStrLn . ("ERROR: " ++) . show) $
\outnc > do
 Write coordinate variable values.
put outnc outpcvar $
(SV.fromList [0..fromIntegral npcs1] :: SV.Vector CInt)
put outnc outlagvar $
(SV.fromList [0..fromIntegral nday2] :: SV.Vector CInt)
put outnc outautovar $ HMatrix $
(cmap realToFrac autocorrs :: Matrix CDouble)
return ()
We read in the component time series as a hmatrix
matrix, split the matrix into columns (the individual component time series) then split each time series into yearlong segments. The we use the autocorrelation
function on each segment of each time series (dropping the confidence limit values that the autocorrelation
function returns since we’re not so interested in those here) and average across segments of each time series. The result is an autocorrelation function (for lags from zero to $\mathtt{nday}2$) for each PCA component. We write those to a NetCDF file for further processing.
The plot below shows the autocorrelation functions for the first three PCA projected component time series. The important thing to notice here is that there is significant autocorrelation in each of the PCA projected component time series out to lags of 5–10 days (the horizontal line on the plot is at a correlation of $e^{1}$). This makes sense – even at the bottom of the atmosphere, where temporal variability tends to be less structured than at 500,mb, we expect the weather tomorrow to be reasonably similar to the weather today.
It appears that there is pretty strong correlation in the $Z_{500}$ data at short timescales, which would be an obstacle to performing the kind of Markov matrix analysis we’re going to do next. To get around this, we’re going to average our data over nonoverlapping 7day windows (seven days seems like a good compromise between throwing lots of data away and reducing the autocorrelation to a low enough level) and work with those 7day means instead of the unprocessed PCA projected component time series. This does mean that we now need to rerun all of our spherical PDF analysis for the 7day mean data, but that’s not much of a problem because everything is nicely scripted and it’s easy to rerun it all.
The figures below show the same plots as we earlier had for all the PCA projected component time series, except this time we’re looking at the 7day means of the projected component time series, to ensure that we have data without significant temporal autocorrelation.
The first figure tab (“Projected points”) shows the individual 7day mean data points, plotted using $(\theta, \phi)$ polar coordinates. Comparing with the corresponding plot for all the data in the earlier article, we can see (obviously!) that there’s less data here, but also that it’s not really any easier to spot clumping in the data points than it was when we used all the data. It again makes sense to do KDE to find a smooth approximation to the probability density of our atmospheric flow patterns.
The “Spherical PDF” tab shows the spherical PDF of 7day mean PCA components (parametrised by spherical polar coordinates $\theta$ and $\phi$) calculated by kernel density estimation: darker colours show regions of greater probability density. Two “bumps” are labelled for further consideration. Compared to the “all data” PDF, the kernel density estimate of the probability density for the 7day mean data is more concentrated, with more of the probability mass appearing in the two labelled bumps on the plot. (Recall that the “all data” PDF had four “bumps” that we picked out to look at – here we only really have two clear bumps.)
We can determine the statistical significance of those bumps in exactly the same way as we did before. The “Significance” tab above shows the results. As you’d expect, both of the labelled bumps are highly significant. However, notice that the significance scale here extends only to 99% significance, while that for that “all data” case extends to 99.9%. The reduced significance levels are simply a result of having less data points – we have 1386 7day mean points as compared to 9966 “all data” points, which means that we have more sampling variability in the null hypothesis PDFs that we use to generate the histograms used for the significance calculation. That increased sampling variability translates into less certainty that our “real data” couldn’t have occurred by chance, given the assumptions of the null hypothesis. Still, 99% confidence isn’t too bad!
Finally, we can plot the spatial patterns of atmospheric flow corresponding to the labelled bumps in the PDF, just as we did for the “all data” case. The “Bump patterns” tab shows the patterns for the two most prominent bumps in the 7day means PDF. As before, the two flow patterns seem to distinguish quite clearly between “normal” zonal flow (in this case, pattern #2) and blocking flow (pattern #1).
Now that we’ve dealt with this autocorrelation problem, we’re ready to start thinking about how we model transitions between different flow states. In the next article, we’ll use a simple lowdimensional example to explain what we’re going to do.
]]>The spherical PDF we constructed by kernel density estimation in the article before last appeared to have “bumps”, i.e. it’s not uniform in $\theta$ and $\phi$. We’d like to interpret these bumps as preferred regimes of atmospheric flow, but before we do that, we need to decide whether these bumps are significant. There is a huge amount of confusion that surrounds this idea of significance, mostly caused by blind use of “standard recipes” in common data analysis cases. Here, we have some data analysis that’s anything but standard, and that will rather paradoxically make it much easier to understand what we really mean by significance.
So what do we mean by “significance”? A phenomena is significant if it is unlikely to have occurred by chance. Right away, this definition raises two questions. First, chance implies some sort of probability, a continuous quantity, which leads to the idea of different levels of significance. Second, if we are going to be thinking about probabilities, we are going to need to talk about a distribution for those probabilities. The basic idea is thus to compare our data to a distribution that we explicitly decide based on a null hypothesis. A bump in our PDF will be called significant if it is unlikely to have occurred in data generated under the assumptions in our null hypothesis.
So, what’s a good null hypothesis in this case? We’re trying to determine whether these bumps are a significant deviation from “nothing happening”. In this case, “nothing happening” would mean that the data points we use to generate the PDF are distributed uniformly over the unit sphere parametrised by $\theta$ and $\phi$, i.e. that no point in $(\theta, \phi)$ space is any more or less likely to occur than any other. We’ll talk more about how we make use of this idea (and how we sample from our “uniform on the sphere” null hypothesis distribution) below.
We thus want some way of comparing the PDF generated by doing KDE on our data points to PDFs generated by doing KDE on “fake” data points sampled from our null hypothesis distribution. We’re going to follow a samplingbased approach: we will generate “fake” data sets, do exactly the same analysis we did on our real data points to produce “fake” PDFs, then compare these “fake” PDFs to our real one (in a way that will be demonstrated below).
There are a couple of important things to note here, which I usually think of under the heading of “minimisation of cleverness”. First, we do exactly the same analysis on our “fake” data as we do on our “real” data. That means that we can treat arbitrarily complex chains of data analysis without drama: here, we’re doing KDE, which is quite complicated from a statistical perspective, but we don’t really need to think about that, because the fact that we treat the fake and real data sets identically means that we’re comparing like with like and the complexity just “divides out”. Second, because we’re simulating, in the sense that we generate fake data based on a null hypothesis on the data and run it through whatever data analysis steps we’re doing, we make any assumptions that go into our null hypothesis perfectly explicit. If we assume Gaussian data, we can see that, because we’ll be sampling from a Gaussian to generate our fake data. If, as in this case, our null hypothesis distribution is something else, we’ll see that perfectly clearly because we sample directly from that distribution to generate our fake data.
This is a huge contrast to the usual case for “normal” statistics, where one chooses some standard test ($t$test, MannWhitney test, KolmogorovSmirnov test, and so on) and turns a crank to produce a test statistic. In this case, the assumptions inherent in the form of the test are hidden – a good statistician will know what those assumptions are and will understand the consequences of them, but a bad statistician (I am a bad statistician) won’t and will almost certainly end up applying tests outside of the regime where they are appropriate. You see this all the time in published literature: people use tests that are based on the assumption of Gaussianity on data that clearly isn’t Gaussian, people use tests that assume particular variance structures that clearly aren’t correct, and so on. Of course, there’s a very good reason for this. The kind of samplingbased strategy we’re going to use here needs a lot of computing power. Before compute power was cheap, standard tests were all that you could do. Old habits die hard, and it’s also easier to teach a small set of standard tests than to educate students in how to design their own samplingbased tests. But we have oodles of compute power, we have a very nonstandard situation, and so a samplingbased approach allows us to sidestep all the hard thinking we would have to do to be good statisticians in this sense, which is what I meant by “minimisation of cleverness”.
So, we’re going to do samplingbased significance testing here. It is shockingly easy to do and, if you’ve been exposed to the confusion of “traditional” significance testing, shockingly easy to understand.
In order to test the significance of the bumps we see in our spherical PDF, we need some way of sampling points from our null hypothesis distribution, i.e. from a probability distribution that is uniform on the unit sphere. The simplest way to do this is to sample points from a spherically symmetric threedimensional probability distribution then project those points onto the unit sphere. The most suitable threedimensional distribution, at least from the point of view of convenience, is a three dimensional Gaussian distribution with zero mean and unit covariance matrix. This is particularly convenient because if we sample points $\mathbf{u} = (x, y, z)$ from this distribution, each of the coordinates $x$, $y$ and $z$ are individually distributed as a standard Gaussian, i.e. $x \sim \mathcal{N}(0, 1)$, $y \sim \mathcal{N}(0, 1)$, $z \sim \mathcal{N}(0, 1)$. To generate $N$ random points uniformly distributed on the unit sphere, we can thus just generate $3N$ standard random deviates, partition them into 3vectors and normalise each vector. Haskell code to do this sampling using the mwcrandom
package is shown below – here, nData
is the number of points we want to sample, and the randPt
function generates a single normalised $(x, y, z)$ as a Haskell 3tuple (as usual, the code is in a Gist; this is from the makeunifpdfsample.hs program):
 Random data point generation.
gen < create
let randPt gen = do
unnorm < SV.replicateM 3 (standard gen)
let mag = sqrt $ unnorm `dot` unnorm
norm = scale (1.0 / mag) unnorm
return (norm ! 0, norm ! 1, norm ! 2)
 Create random data points, flatten to vector and allocate on
 device.
dataPts < replicateM nData (randPt gen)
If we sample the same number of points from this distribution that we have in our real data and then use the same KDE approach the we used for the real data to generate an empirical PDF on the unit sphere, what do we get? Here’s what one distribution generated by this procedure looks like, using the same colour scale as the “real data” distribution in the earlier article to aid in comparison (darker colours show regions of greater probability density):
We can see that our sample from the null hypothesis distribution also has “bumps”, although they seem to be less prominent than the bumps in PDF for our real data. Why do we see bumps here? Our null hypothesis distribution is uniform, so why is the simulated empirical PDF bumpy? The answer, of course, is sampling variation. If we sample 9966 points on the unit sphere, we are going to get some clustering of points (leading to bumps in the KDEderived distribution) just by chance. Those chance concentrations of points are what lead to the bumps in the plot above.
What we ultimately want to do then is to answer the question: how likely is it that the bumps in the distribution of our real data could have arisen by chance, assuming that our real data arose from a process matching our null hypothesis?
The way we’re going to answer the question posed in the last section is purely empirically. We’re going to generate empirical distributions (histograms) of the possible values of the null hypothesis distribution to get a picture of the sampling variability that is possible, then we’re going to look at the values of our “real data” distribution and calculate the proportion of the null hypothesis distribution values less than the real data distribution values. This will give us the probability that our real data distribution could have arisen by chance if the data really came from the null hypothesis distribution.
In words, it sounds complicated. In reality and in code, it’s not. First, we generate a large number of realisations of the null hypothesis distribution, by sampling points on the unit sphere and using KDE to produce PDFs from those point distributions in exactly the same way that we did for our real data, as shown here (code from the makehist.hs program):
 Generate PDF realisations.
pdfs < forM [1..nrealisations] $ \r > do
putStrLn $ "REALISATION: " ++ show r
 Create random data points.
dataPts < SV.concat <$> replicateM nData (randPt gen)
SV.unsafeWith dataPts $ \p > CUDA.pokeArray (3 * nData) p dDataPts
 Calculate kernel values for each grid point/data point
 combination and accumulate into grid.
CUDA.launchKernel fun gridSize blockSize 0 Nothing
[CUDA.IArg (fromIntegral nData), CUDA.VArg dDataPts, CUDA.VArg dPdf]
CUDA.sync
res < SVM.new (ntheta * nphi)
SVM.unsafeWith res $ \p > CUDA.peekArray (ntheta * nphi) dPdf p
unnormv < SV.unsafeFreeze res
let unnorm = reshape nphi unnormv
 Normalise.
let int = dtheta * dphi * sum (zipWith doone sinths (toRows unnorm))
return $ cmap (realToFrac . (/ int)) unnorm :: IO (Matrix Double)
(This is really the key aspect of this samplingbased approach: we perform exactly the same data analysis on the test data sampled from the null hypothesis distribution that we perform on our real data.) We generate 10,000 realisations of the null hypothesis distribution (stored in the pdfs
value), in this case using CUDA to do the actual KDE calculation, so that it doesn’t take too long.
Then, for each spatial point on our unit sphere, i.e. each point in the $(\theta, \phi)$ grid that we’re using, we collect all the values of our null hypothesis distribution – this is the samples
value in this code:
 Convert to perpoint samples and generate perpoint histograms.
let samples = [SV.generate nrealisations (\s > (pdfs !! s) ! i ! j) :: V 
i < [0..ntheta1], j < [0..nphi1]]
(rngmin, rngmax) = range nbins $ SV.concat samples
hists = map (histogram_ nbins rngmin rngmax) samples :: [V]
step i = rngmin + d * fromIntegral i
d = (rngmax  rngmin) / fromIntegral nbins
bins = SV.generate nbins step
For each point on our grid on the unit sphere, we then calculate a histogram of the samples from the 10,000 empirical PDF realisations, using the histogram_
function from the statistics
package. We use the same bins for all the histograms to make life easier in the next step.
There’s one thing that’s worth commenting on here. You might think that we’re doing excess work here. Our null hypothesis distribution is spherically symmetric, so shouldn’t the histograms be the same for all points on the unit sphere? Well, should they? Or might the exact distribution of samples depend on $\theta$, since the $(\theta, \phi)$ grid cells will be smaller at the poles of our unit sphere than at the equator? Well, to be honest, I don’t know. And I don’t really care. By taking the approach I’m showing you here, I don’t need to worry about that question, because I’m generating independent histograms for each grid cell on the unit sphere, so my analysis is immune to any effects related to grid cell size. Furthermore, this approach also enables me to change my null hypothesis if I want to, without changing any of the other data analysis code. What if I decide that this spherically symmetric null hypothesis is too weak? What if I want to test my real data against the hypothesis that there is a spherically symmetric background distribution of points on my unit sphere, plus a couple of bumps (of specified amplitude and extent) representing what I think are the most prominent patterns of atmospheric flow? That’s quite a complicated null hypothesis, but as long as I can define it clearly and sample from it, I can use exactly the same data analysis process that I’m showing you here to evaluate the significance of my real data compared to that null hypothesis. (And sampling from a complicated distribution is usually easier than doing anything else with it. In this case, I might say what proportion of the time I expect to be in each of my bump or background regimes, for the background I can sample uniformly on the sphere and for the bumps I can sample from a Kent distribution^{1}.)
Once we have the histograms for each grid point on the unit sphere, we can calculate the significance of the values of the real data distribution (this is from the makesignificance.hs program – I split these things up to make checking what was going on during development easier):
 Split histogram values for later processing.
let nhistvals = SV.length histvals
oneslice i = SV.slice i nbin histvals
histvecs = map oneslice [0, nbin.. nhistvals  nbin]
hists = A.listArray ((0, 0), (ntheta1, nphi1)) histvecs
nrealisations = SV.sum $ hists A.! (0, 0)
 Calculate significance values.
let doone :: CDouble > CDouble > CDouble
doone dith diph =
let ith = truncate dith ; iph = truncate diph
pdfval = pdf ! ith ! iph
hist = hists A.! (ith, iph)
pdfbin0 = truncate $ (pdfval  minbin) / binwidth
pdfbin = pdfbin0 `max` 0 `min` nbin  1
in (SV.sum $ SV.take pdfbin hist) / nrealisations
sig = build (ntheta, nphi) doone :: Matrix CDouble
We read the histograms into the histvals
value from an intermediate NetCDF file and build an array of histograms indexed by grid cell indexes in the $\theta$ and $\phi$ directions. Then, for each grid cell, we determine which histogram bin the relevant value from the real data distribution falls into and sum the histogram values from the corresponding histogram from all the bins smaller than the real data value bin. Dividing this sum by the total number of null hypothesis distribution realisations used to construct the histograms gives us the fraction of null hypothesis distribution values for this grid cell that are smaller than the actual value from the real data distribution.
For instance, if the real data distribution value is greater than 95% of the values generated by the null hypothesis distribution simulation, then we say that we have a significance level of 95% at that point on the unit sphere. We can plot these significance levels in the same way that we’ve been plotting the spherical PDFs. Here’s what those significance levels look like, choosing contour levels for the plot to highlight the most significant regions, i.e. the regions least likely to have occurred by chance if the null hypothesis is true:
In particular, we see that each of the three bumps picked out with labels in the “real data” PDF plot in the earlier article are among the most significant regions of the PDF according to this analysis, being larger than 99.9% of values generated from the null hypothesis uniform distribution.
It’s sort of traditional to try to use some other language to talk about these kinds of results, giving specific terminological meanings to the words “significance levels” and “$p$values”, but I prefer to keep away from that because, as was the case for the terminology surrounding PCA, the “conventional” choices of words are often confusing, either because noone can agree on what the conventions are (as for PCA) or the whole basis for setting up the conventions is confusing. In the case of hypothesis testing, there are still papers being published in statistical journals arguing about what significance and $p$values and hypothesis testing really mean, nearly 100 years after these ideas were first outlined by Ronald Fisher and others. I’ve never been sure enough about what all this means to be comfortable using the standard terminology, but the samplingbased approach we’ve used here makes it much harder to get confused – we can say “our results are larger than 99.9% of results that could be encountered as a result of sampling variability under the assumptions of our null hypothesis”, which seems quite unambiguous (if a little wordy!).
In the next article we’ll take a quick look at what these “bumps” in our “real data” PDF represent in terms of atmospheric flow.
]]>The Haskell kernel density estimation code in the last article does work, but it’s distressingly slow. Timing with the Unix time
command (not all that accurate, but it gives a good idea of orders of magnitude) reveals that this program takes about 6.3 seconds to run. For a oneoff, that’s not too bad, but in the next article, we’re going to want to run this type of KDE calculation thousands of times, in order to generate empirical distributions of null hypothesis PDF values for significance testing. So we need something faster.
It’s quite possible that the Haskell code here could be made quite a bit faster. I didn’t spend a lot of time thinking about optimising it. I originally tried a slightly different approach using a mutable matrix to accumulate the kernel values across the data points, but this turned out to be slower than very simple code shown in the previous article (something like ten times slower!). It’s clear though that this calculation is very amenable to parallelisation – the unnormalised PDF value at each point in the $(\theta, \phi)$ grid can by calculated independently of any other grid point, and the calculation for each grid point accesses all of the data points in a very regular way.
So, let’s parallelise. I have an NVIDIA graphics card in my machine here, so it’s very tempting to do something with CUDA. If I was a real Haskell programmer, I’d use Accelerate, which is an embedded DSL for data parallel array processing that has a CUDA backend. Unfortunately, a few experiments revealed that it would take me a while to learn how to use Accelerate, which has some restrictions on how you can structure algorithms that didn’t quite fit with the way I was trying to do things. So I gave up on that.
However, I’ve been writing quite a lot of CUDA C++ recently, so I decided to use the simple FFI bindings to the CUDA runtime API in the Haskell cuda
package, and to write the CUDA code itself in C++. If you’re at all familiar with CUDA C++, running kernels from Haskell turns out to be really pretty easy.
I’m not going to get into a long description of CUDA itself here. You can read all about it on the NVIDIA website or there are a couple of MOOCs that cover parallel programming with CUDA^{1}.
Here’s the CUDA code for the KDE calculation:
#include <cuda.h>
#include <cmath>
// Use a regular 2.5 deg. x 2.5 deg. theta/phi grid on unit sphere.
const int Nphi = 2 * 360 / 5, Ntheta = 2 * 180 / 5;
// Integration steps.
const double dphi = 2.0 * M_PI / Nphi, dtheta = M_PI / Ntheta;
// Density kernel bandwidth.
const double bandwidth = M_PI / 6.0;
extern "C"
__global__ void make_pdf_kernel
(unsigned int D, const double * __restrict__ d_data, double *d_pdf)
{
unsigned int c = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int r = blockIdx.y * blockDim.y + threadIdx.y;
double th = (0.5 + r) * dtheta, ph = c * dphi;
double gx = sin(th) * cos(ph), gy = sin(th) * sin(ph), gz = cos(th);
if (r > Ntheta  c > Nphi) return;
double sum = 0.0;
for (unsigned int i = 0; i < D; ++i) {
double dx = d_data[3 * i], dy = d_data[3 * i + 1], dz = d_data[3 * i + 2];
double u = acos(dx * gx + dy * gy + dz * gz) / bandwidth;
if (u < 1) sum += 1  u * u;
}
d_pdf[r * Nphi + c] = sum;
}
I’ll just say a couple of things about it:
We launch CUDA threads in twodimensional blocks set up to cover the whole of the $(\theta, \phi)$ grid: the row and column in the grid are calculated from the CUDA block and thread indexes in the first two lines of the kernel.
The main part of the computation is completely straightforward: each thread determines the coordinates of the grid point it’s working on, then loops over all the data points (which are stored flattened into the d_data
vector) accumulating values of the KDE Epanechnikov kernel, finally writing the result out to the d_pdf
vector (also a twodimensional array flattened into a vector in rowmajor order).
We do almost nothing to optimise the CUDA code: this is just about the simplest CUDA implementation of this algorithm that’s possible, and it took about five minutes to write. The only “clever” thing is the declaration of the input data point array as const double * __restrict__ d_data
. This little bit of magic tells the CUDA C++ compiler that the d_data
array will not be aliased, contains data that will not be modified during the execution of the CUDA kernel, and is accessed in a consistent pattern across all threads running the kernel. The upshot of this is that the compiler can generate code that causes the GPU to cache this data very aggressively (actually, on more recent GPUs, in a very fast L1 cache). Since every thread accesses all of the data points stored in d_data
, this can lead to a large reduction in accesses to global GPU memory. Much optimisation of GPU code comes down to figuring out ways to reduce global memory bandwidth, since global memory is much slower than the registers and local and shared memory in the multiprocessors in the GPU. The usual approach to this sort of thing is to load chunks of data into shared memory (“shared” in this context means shared between GPU threads within a single thread block), which is faster than global memory. This requires explicitly managing this loading though, which isn’t always very convenient. In contrast, telling the CUDA compiler that it can cache d_data
in this way has more or less the same effect, with next to no effort.
The CUDA code is compiled to an intermediate format called PTX using the NVIDIA C++ compiler, nvcc
:
nvcc O2 ptx arch=compute_50 code=sm_50 make_pdf.cu
The arch
and code
options tell the compiler to produce code for NVIDIA devices with “compute capability” 5.0, which is what the GPU in my machine has.
Calling the CUDA code from Haskell isn’t too hard. Here, I show just the parts that are different from the previous Haskellonly approach:
main :: IO ()
main = do
 CUDA initialisation.
CUDA.initialise []
dev < CUDA.device 0
ctx < CUDA.create dev []
ptx < B.readFile "make_pdf.ptx"
CUDA.JITResult time log mdl < CUDA.loadDataEx ptx []
fun < CUDA.getFun mdl "make_pdf_kernel"
... code omitted ...
 Convert projections to 3D points, flatten to vector and allocate
 on device.
let nData = rows projsin
dataPts = SV.concat $ map projToPt $ toRows $ cmap realToFrac projsin
dDataPts < CUDA.mallocArray $ 3 * nData
SV.unsafeWith dataPts $ \p > CUDA.pokeArray (3 * nData) p dDataPts
 Calculate kernel values for each grid point/data point
 combination and accumulate into grid.
dPdf < CUDA.mallocArray $ ntheta * nphi :: IO (CUDA.DevicePtr Double)
let tx = 32 ; ty = 16
blockSize = (tx, ty, 1)
gridSize = ((nphi  1) `div` tx + 1, (ntheta  1) `div` ty + 1, 1)
CUDA.launchKernel fun gridSize blockSize 0 Nothing
[CUDA.IArg (fromIntegral nData), CUDA.VArg dDataPts, CUDA.VArg dPdf]
CUDA.sync
res < SVM.new (ntheta * nphi)
SVM.unsafeWith res $ \p > CUDA.peekArray (ntheta * nphi) dPdf p
unnormv < SV.unsafeFreeze res
let unnorm = reshape nphi unnormv
First, we need to initialise the CUDA API and load our compiled module. The PTX format is justintime compiled to binary code for the installed GPU during this step, and we output some information about the compilation process.
Allocating memory on the GPU is done using the cuda
package’s versions of functions like mallocArray
, and the pokeArray
function is used to transfer data from the host (i.e. the CPU) to the GPU memory. The CUDA kernel is then run using the launchKernel
function, which takes arguments specifying the layout of threads and thread blocks to use (the values shown in the code above were the best I found from doing a couple of quick timing experiments), as well as the parameters to pass to the kernel function. Once the kernel invocation is finished, the results can be retrieved using the peekArray
function.
This is all obviously a little bit grungy and it would definitely be nicer from an aesthetic point of view to use something like Accelerate, but if you already know CUDA C++ and are familiar with the CUDA programming model, it’s really not that hard to do this sort of mixed language CPU/GPU programming.
And is it fast? Oh yeah. Again, using the very unsophisticated approach of measuring elapsed time using the Unix time
utility, we find that the CUDA version of the KDE code runs in about 0.4 seconds on my machine. That’s more than ten times faster than the Haskell only version, on a not very beefy GPU. And as I’ve said, I’ve not put any real effort into optimising the CUDA code. I’m sure it could be made to go faster. But for our purposes, this is good enough. In the next article, we’ll want to run this code 10,000 times (you’ll see why), which will take a little over an hour with the CUDA code, rather than nearly 17 hours with the Haskellonly code. For writing library code, or for more performancecritical applications, you would obviously put a lot more effort into optimisation (think about all the FFT stuff from earlier articles!), but for this kind of oneoff data analysis task, there’s no benefit to spending more time. It’s easy to set off an analysis job, go for lunch and have the results ready when you come back.
The Udacity course is pretty good, as is the Coursera Heterogeneous Parallel Programming course.↩
Up to this point, all the analysis that we’ve done has been what might be called “normal”, or “pedestrian” (or even “boring”). In climate data analysis, you almost always need to do some sort of spatial and temporal subsetting and you very often do some sort of anomaly processing. And everyone does PCA! So there’s not really been anything to get excited about yet.
Now that we have our PCAtransformed $Z_{500}$ anomalies though, we can start to do some more interesting things. In this article, we’re going to look at how we can use the new representation of atmospheric flow patterns offered by the PCA eigenpatterns to reduce the dimensionality of our data, making it much easier to handle. We’ll then look at our data in an interesting geometrical way that allows us to focus on the patterns of flow while ignoring the strengths of different flows, i.e. we’ll be treating strong and weak blocking events as being the same, and strong and weak “normal” flow patterns as being the same. This simplification of things will allow us to do some statistics with our data to get an idea of whether there are statistically significant (in a sense we’ll define) flow patterns visible in our data.
First let’s think about how we might use the PCAtransformed data we generated in the previous article – we know that the PCA eigenpatterns are the $Z_{500}$ anomaly patterns that explain the biggest fraction of the total $Z_{500}$ anomaly variance: the first PCA eigenpattern is the pattern with the biggest variance, the second is the pattern orthogonal to the first with the biggest variance, the third is the pattern orthogonal to the first two patterns with the biggest variance, and so on.
In order to go from the 72 × 15 = 1080dimensional $Z_{500}$ anomaly data to something that’s easier to handle, both in terms of manipulation and in terms of visualisation, we’re going to take the seemingly radical step of discarding everything but the first three PCA eigenpatterns. Why three? Well, for one thing, three dimensions is about all we can hope to visualise. The first three PCA eigenpatterns respectively explain 8.86%, 7.46% and 6.27% of the total $Z_{500}$ anomaly variance, for a total of 22.59%. That doesn’t sound like a very large fraction of the overall variance, but it’s important to remember that we’re interested in largescale features of the flow here, and most of the variance in the later PCA eigenpatterns is smallscale “weather”, which we’d like to suppress from our analysis anyway.
Let’s make it quite clear what we’re doing here. At each time step, we have a 72 × 15 map of $Z_{500}$ anomaly values. We transform this map into the PCA basis, which doesn’t lose any information, but then we truncate the vector of PCA projected components, retaining only the first three. So instead of having 72 × 15 = 1080 numbers (i.e. the individual gridpoint $Z_{500}$ anomaly values), we have just three numbers, the first three PCA projected components for the time step. We can thus think of our reduced dimensionality $Z_{500}$ anomaly “map” as a point in threedimensional space.
Three dimensions is still a little tricky to visualise, so we’re going to do something slightly sneaky. We’re going to take the data points in the threedimensional space spanned by the first three PCA components, and we’re going to project those threedimensional points onto the unit sphere. This yields twodimensional points that we can represent by standard polar coordinates – if we think of a coordinate system where the $x$, $y$ and $z$axes lie respectively along the directions of the $e_1$, $e_2$ and $e_3$ components in our threedimensional space, then the standard colatitude $\theta$ and longitude $\phi$ are suitable coordinates to represent our data points. Because there’s real potential for confusion here, I’m going to be careful from now on to talk about coordinates on this “PCA projection sphere” only as $\theta$ or $\phi$, reserving the words “latitude” and “longitude” for spatial points on the real Earth in the original $Z_{500}$ data.
The following figures show how this works. Figure 1 shows our threedimensional points, colour coded by distance from the origin. In fact, we normalise each of the PC components by the standard deviation of the whole set of PC component values – recall that we are using normalised PCA eigenpatterns here, so that the PCA projected component time series carry the units of $Z_{500}$. That means that for the purposes of looking at patterns of $Z_{500}$ variability, it makes sense to normalise the projected component time series somehow. Figure 2 shows the threedimensional points with the unit sphere in this threedimensional space, and Figure 3 shows each of the original threedimensional points projected onto this unit sphere. We can then represent each $Z_{500}$ pattern as a point on this sphere, parameterised by $\theta$ and $\phi$, the angular components of the usual spherical coordinates in this threedimensional space.
Original data points.
Data points with projection sphere.
Points projected onto unit sphere.
Once we’ve projected our PCA data onto the unit sphere as described above, we can look at the distribution of data points in terms of the polar coordinates $\theta$ and $\phi$:
Note that we’re doing two kinds of projection here: first we’re transforming the original $Z_{500}$ data (a time series of spatial maps) to the PCA basis (a time series of PCA projected components, along with the PCA eigenpatterns), then we’re taking the first three PCA components, normalising them and projecting to the unit sphere in the space spanned by the first three PCA eigenpatterns. Each point in the plot above thus represents a single spatial configuration of $Z_{500}$ at a single point in time.
Looking at this plot, it’s not all that clear whether there exist spatial patterns of $Z_{500}$ that are more common than others. There’s definitely some clustering of points in some areas of the plot, but it’s quite hard to assess because of the distortion introduced by plotting the spherical points on this kind of rectangular plot. What we’d really like is a continuous probability distribution where we can see regions of higher density of $Z_{500}$ points as “bumps” in the distribution.
We’re going to use a method called kernel density estimation (KDE) to get at such a continuous distribution. As well as being better for plotting and identifying interesting patterns in the data, this will also turn out to give us a convenient route to use to determine the statistical significance of the “bumps” that we find.
We’ll start by looking at how KDE works in a simple onedimensional case. Then we’ll write some Haskell code to do KDE on the twodimensional sphere onto which we’ve projected our data. This is a slightly unusual use of KDE, but it turns out not to be much harder to do that the more “normal” cases.
The basic idea of KDE is pretty simple to explain. Suppose we have a sample of onedimensional data points $\{ x_i \}$ for $i = 1, \dots, N$. We can think of this sample as defining a probability distribution for $x$ as
$p(x) = \frac{1}{N} \sum_{i=1}^N \delta(x  x_i) \qquad (1)$
where $\delta(x)$ is the usual Dirac $\delta$function. What $(1)$ is saying is that our data defines a PDF with a little probability mass (of weight $1/N$) concentrated at each data point. In kernel density estimation, all that we do is to replace the function $\delta(x)$ by a kernel function $K(x)$, where we require that
$\int_{\infty}^\infty K(x) \, dx = 1.$
The intuition here is pretty clear: the $\delta$functions in $(1)$ imply that our knowledge of the $x_i$ is perfect, so replacing the $\delta$functions with a “smeared out” kernel $K(x)$ represents a lack of perfect knowledge of the $x_i$. This has the effect of smoothing the “spikey” $\delta$function to give something closer to a continuous probability density function.
This is (of course!) a gross simplification. Density estimation is a complicated subject – if you’re interested in the details, the (very interesting) book by Bernard Silverman is required reading.
So, we’re going to estimate a probability density function as
$p(x) = \frac{1}{N} \sum_{i=1}^N K(x  x_i), \qquad (2)$
which raises the fundamental question of what to use for the kernel, $K(x)$? This is more or less the whole content of the field of density estimation. We’re not going to spend a lot of time talking about the possible choices here because that would take us a bit off track, but what we have to choose basically comes down to two factors: what shape is the kernel and how “wide” is it?
A natural choice for the kernel in onedimensional problems might be a Gaussian PDF, so that we would estimate our PDF as
$p(x) = \frac{1}{N} \sum_{i=1}^N \phi(x  x_i, \sigma^2),$
where $\phi(\mu, \sigma^2)$ is a Gaussian PDF with mean $\mu$ and standard deviation $\sigma$. Here, the standard deviation measures how “wide” the kernel is. In general, people talk about the “bandwidth” of a kernel, and usually write a kernel as something like $K(x; h)$, where $h$ is the bandwidth (which means something different for different types of kernel, but is generally supposed to be some sort of measure of how spread out the kernel is). In fact, in most cases, it turns out to be better (for various reasons: see Silverman’s book for the details) to choose a kernel with compact support. We’re going to use something called the Epanechnikov kernel:
$K(u) = \frac{3}{4} (1  u^2) \, \mathbf{1}_{\{u \leq 1\}}, \qquad (3)$
where $u = x / h$ and $\mathbf{1}_{\{u \leq 1\}}$ is the indicator function for $u \leq 1$, i.e. a function that is one for all points where $u \leq 1$ and zero for all other points (this just ensures that our kernel has compact support and is everywhere nonnegative).
This figure shows how KDE works in practice in a simple onedimensional case: We randomly sample ten points from the range $[1, 9]$ (red impulses in the figure) and use an Epanechnikov kernel with bandwidth $h = 2$ centred around each of the sample points. Summing the contributions from each of the kernels gives the thick black curve as an estimate of the probability density function from which the sample points were drawn.
The Haskell code to generate the data from which the figure is drawn looks like this (as usual, the code is in a Gist):
module Main where
import Prelude hiding (enumFromThenTo, length, map, mapM_, replicate, zipWith)
import Data.Vector hiding ((++))
import System.Random
import System.IO
 Number of sample points.
n :: Int
n = 10
 Kernel bandwidth.
h :: Double
h = 2.0
 Ranges for data generation and output.
xgenmin, xgenmax, xmin, xmax :: Double
xgenmin = 1 ; xgenmax = 9
xmin = xgenmin  h ; xmax = xgenmax + h
 Output step.
dx :: Double
dx = 0.01
main :: IO ()
main = do
 Generate sample points.
samplexs < replicateM n $ randomRIO (xgenmin, xgenmax)
withFile "xs.dat" WriteMode $ \h > forM_ samplexs (hPutStrLn h . show)
let outxs = enumFromThenTo xmin (xmin + dx) xmax
 Calculate values for a single kernel.
doone n h xs x0 = map (/ (fromIntegral n)) $ map (kernel h x0) xs
 Kernels for all sample points.
kernels = map (doone n h outxs) samplexs
 Combined density estimate.
pdf = foldl1' (zipWith (+)) kernels
pr h x s = hPutStrLn h $ (show x) ++ " " ++ (show s)
kpr h k = do
zipWithM_ (pr h) outxs k
hPutStrLn h ""
withFile "kernels.dat" WriteMode $ \h > mapM_ (kpr h) kernels
withFile "kde.dat" WriteMode $ \h > zipWithM_ (pr h) outxs pdf
 Epanechnikov kernel.
kernel :: Double > Double > Double > Double
kernel h x0 x
 abs u <= 1 = 0.75 * (1  u^2)
 otherwise = 0
where u = (x  x0) / h
The density estimation part of the code is basically a direct transcription of $(2)$ and $(3)$ into Haskell. We have to choose a resolution at which we want to output samples from the PDF (the value dx
in the code) and we have to use that to generate $x$ values to output the PDF (outxs
in the code), but once we’ve done that, it’s just a matter of calculating the values of kernels centred on our random sample points for each of the output points, then combining the kernels to get the final density estimate. We’re going to do essentially the same thing in our spherical PDF case, although we have to think a little bit about the geometry of the problem, since we’re going to need a twodimensional version of the Epanechnikov kernel on the unit sphere to replace $(3)$. That’s just a detail though, and conceptually the calculation we’re going to do is identical to the onedimensional example.
The code to calculate our spherical PDF using kernel density estimation is in the linked file. We follow essentially the same approach that we used in the onedimensional case described above, except that we use a twodimensional Epanechnikov kernel defined in terms of angular distances between points on the unit sphere. We calculate PDF values on a grid of points $g_{ij}$, where $i$ and $j$ label the $\theta$ and $\phi$ coordinate directions of our $N_{\theta} \times N_{\phi}$ grid on the unit sphere, so that $i = 1, \dots, N_{\theta}$, $j = 1, \dots, N_{\phi}$. Given a set of data points $x_k$, $k = 1, \dots, N$, we define the angular distance between a grid point and a data point as
$\delta_{ij,k} = \cos^{1} (\hat{g}_{ij} \cdot \hat{x}_k),$
where $\hat{a}$ is a unit vector in the direction of a vector $a$ (this is the distance
function in the code).
We can then define an angular Epanechnikov kernel as
$K(\delta) = A (1  u^2) \, \mathbf{1}_{\{u \leq 1\}},$
where $u = \delta / h$ for a bandwidth $h$ (which is an angular bandwidth here) and where $A$ is a normalisation constant. The inband
function in the code calculates an unnormalised version of this kernel for all grid points closer to a given data point that the specified bandwidth. We accumulate these unnormalised kernel values for all grid points using the hmatrix
build
function.
To make life a little simpler, we deal with normalisation of the spherical PDF after we accumulate all of the kernel values, calculating an overall normalisation constant from the unnormalised PDF $p_u(\theta, \phi)$ as
$C = \int_{S^2} p_u(\theta, \phi) \, \sin \theta \, d\theta d\phi$
(the value int
in the code) and dividing all of the accumulated kernel values by this normalisation constant to give the final normalised spherical PDF (called norm
in the code).
Most of the rest of the code is then concerned with writing the results to a NetCDF file for further processing.
The spherical PDF that we get from this KDE calculation is shown in the next plot, parametrised by spherical polar coordinates $\theta$ and $\phi$: darker colours show regions of greater probability density.
We can now see quite clearly that the distribution of spatial patterns of $Z_{500}$ in $(\theta, \phi)$ space does appear to be nonuniform, with some preferred regions and some less preferred regions. In the next article but one, we’ll address the question of the statistical significance of these apparent “bumps” in the PDF, before we go on to look at what sort of flow regimes the preferred regions of $(\theta, \phi)$ space represent. (Four of the more prominent “bumps” are labelled in the figure for reference later on.)
Before that though, in the next article we’ll do some optimisation of the spherical PDF KDE code so that it’s fast enough to use for the samplingbased significance testing approach we’re going to follow.
]]>Although the basics of the “project onto eigenvectors of the covariance matrix” prescription do hold just the same in the case of spatiotemporal data as in the simple twodimensional example we looked at in the earlier article, there are a number of things we need to think about when we come to look at PCA for spatiotemporal data. Specifically, we need to think bout data organisation, the interpretation of the output of the PCA calculation, and the interpretation of PCA as a change of basis in a spatiotemporal setting. Let’s start by looking at data organisation.
The $Z_{500}$ anomaly data we want to analyse has 66 × 151 = 9966 days of data, each of which has 72 × 15 = 1080 spatial points. In our earlier twodimensional PCA example, we performed PCA on a collection of twodimensional data points. For the $Z_{500}$ data, it’s pretty clear that the “collection of points” covers the time steps, and each “data point” is a 72 × 15 grid of $Z_{500}$ values. We can think of each of those grids as a 1080dimensional vector, just by flattening all the grid values into a single row, giving us a sequence of “data points” as vectors in $\mathbb{R}^{1080}$ that we can treat in the same kind of way as we did the twodimensional data points in the earlier example. Our input data thus ends up being a set of 9966 1080dimensional vectors, instead of 500 twodimensional vectors (as for the mussel data). If we do PCA on this collection of 1080dimensional vectors, the PCA eigenvectors will have the same shape as the input data vectors, so we can interpret them as spatial patterns, just by inverting the flattening we did to get from spatial maps of $Z_{500}$ to vectors – as long as we interpret each entry in the eigenvectors as the same spatial point as the corresponding entry in the input data vectors, everything works seamlessly. The transformation goes like this:
pattern → vector → PCA → eigenvector → eigenpattern
So we have an interpretation of the PCA eigenvectors (which we’ll henceforth call “PCA eigenpatterns” to emphasise that they’re spatial patterns of variability) in this spatiotemporal data case. What about the PCA eigenvalues? These have exactly the same interpretation as in the twodimensional case: they measure the variance “explained” by each of the PCA eigenpatterns. And finally, the PCA projected components tell us how much of each PCA eigenpattern is present in each of the input data vectors. Since our input data has one spatial grid per time step, the projections give us one time series for each of the PCA eigenvectors, i.e. one time series of PCA projected components per spatial point in the input. (In one way, it’s kind of obvious that we need this number of values to reproduce the input data perfectly – I’ll say a little more about this when we think about what “basis” means in this setting.)
The PCA calculation works just the same as it did for the twodimensional case: starting with our 1080dimensional data, we centre the data, calculate the covariance matrix (which in this case is a 1080 × 1080 matrix, the diagonal entries of which measure the variances at each spatial point and the offdiagonal entries of which measure the covariances between each pair of spatial points), perform an eigendecomposition of the covariance matrix, then project each of the input data points onto each of the eigenvectors of the covariance matrix.
We’ve talked about PCA as being nothing more than a change of basis, in the twodimensional case from the “normal” Euclidean basis (with unit basis vectors pointing along the $x$ and $y$coordinate axes) to another orthnormal basis whose basis vectors are the PCA eigenvectors. How does this work in the spatiotemporal setting? This is probably the point that confuses most people in going from the simple twodimensional example to the $N$dimensional spatiotemporal case, so I’m going to labour the point a bit to make things as clear as possible.
First, what’s the “normal” basis here? Each time step of our input data specifies a $Z_{500}$ value at each point in space – we have one number in our data vector for each point in our grid. In the twodimensional case, we had one number for each of the mussel shell measurements we took (length and width). For the $Z_{500}$ data, the 1080 data values are the $Z_{500}$ values measured at each of the spatial points. In the mussel shell case, the basis vectors pointed in the $x$axis direction (for shell length) and the $y$axis direction (for the shell width). For the $Z_{500}$ case, we somehow need basis vectors that point in each of the “grid point directions”, one for each of the 1080 grid points. What do these look like? Imagine a spatial grid of the same shape (i.e. 72 × 15) as the $Z_{500}$ data, where all the grid values are zero, except for one point, which has a grid value of one. That is a basis vector pointing in the “direction” of the grid point with the nonzero data value. We’re going to call this the “grid” basis for brevity. We can represent the $Z_{500}$ value at any spatial point $(i, j)$ as
$Z_{500}(i, j) = \sum_{k = 1}^{1080} \phi_k e_k(i, j)$
where $e_k(i, j)$ is zero unless $k = 15(i  1) + j$, in which case it’s one (i.e. it’s exactly the basis element we just described, where we’re numbering the basis elements in rowmajor order) and $\phi_k$ is a “component” in the expansion of the $Z_{500}$ field using this grid basis. Now obviously here, because of the basis we’re using, we can see immediately that $\phi_{15(i1)+j} = Z_{500}(i, j)$, but this expansion holds for any orthnormal basis, so we can transform to a basis where the basis vectors are the PCA eigenvectors, just as for the twodimensional case. If we call these eigenvectors $\tilde{e}_k(i, j)$, then
$Z_{500}(i, j) = \sum_{k = 1}^{1080} \tilde{\phi}_k \tilde{e}_k(i, j),$
where the $\tilde{\phi}_k$ are the components in the PCA eigenvector basis. Now though, the $\tilde{e}_k(i, j)$ aren’t just the “zero everywhere except at one point” grid basis vectors, but they can have nonzero values anywhere.
Compare this to the case for the twodimensional example, where we started with data in a basis that had seperate measurements for shell length and shell width, then transformed to the PCA basis where the length and width measurements were “mixed up” into a sort of “size” measurement and a sort of “aspect ratio” measurement. The same thing is happening here: instead of looking at the $Z_{500}$ data in terms of the variations at individual grid points (which is what we see in the grid basis), we’re going to be able to look at variations in terms of coherent spatial patterns that span many grid points. And because of the way that PCA works, those patterns are the “most important”, in the sense that they are the orthogonal (which in this case means uncorrelated) patterns that explain the most of the total variance in the $Z_{500}$ data.
As I’ve already mentioned, I’m going to try to be consistent in terms of the terminology I use: I’m only ever going to talk about “PCA eigenvalues”, “PCA eigenpatterns”, and “PCA projected components” (or “PCA projected component time series”). Given the number of discussions I’ve been involved in in the past where people have been talking past each other just because one person means one thing by “principal component” and the other means something else, I’d much rather pay the price of a little verbosity to avoid that kind of confusion.
The PCA calculation for the $Z_{500}$ data can be done quite easily in Haskell. We’ll show in this section how it’s done, and we’ll use the code to address a couple of remaining issues with how spatiotemporal PCA works (specifically, area scaling for data in latitude/longitude coordinates and the relative scaling of PCA eigenpatterns and projected components).
There are three main steps to the PCA calculation: first we need to centre our data and calculate the covariance matrix, then we need to do the eigendecomposition of the covariance matrix, and finally we need to project our original data onto the PCA eigenvectors. We need to think a little about the data volumes involved in these steps. Our $Z_{500}$ data has 1080 spatial points, so the covariance matrix will be a 1080 × 1080 matrix, i.e. it will have 1,166,400 entries. This isn’t really a problem, and performing an eigendecomposition of a matrix of this size is pretty quick. What can be more of a problem is the size of the input data itself – although we only have 1080 spatial points, we could in principle have a large number of time samples, enough that we might not want to read the whole of the data set into memory at once for the covariance matrix calculation. We’re going to demonstrate two approaches here: in the first “online” calculation, we’ll just read all the data at once and assume that we have enough memory; in the second “offline” approach, we’ll only ever read a single time step of $Z_{500}$ data at a time into memory. Note that in both cases, we’re going to calculate the full covariance matrix in memory and do a direct eigendecomposition using SVD. There are offline approaches for calculating the covariance and there are iterative methods that allow you to calculate some eigenvectors of a matrix without doing a full eigendecomposition, but we’re not going to worry about that here.
As usual, the code is in a Gist.
For the online calculation, the PCA calculation itself is identical to our twodimensional test case and we reuse the pca
function from the earlier post. The only thing we need to do is to read the data in as a matrix to pass to the pca
function. In fact, there is one extra thing we need to do before passing the $Z_{500}$ anomaly data to the pca
function. Because the $Z_{500}$ data is sampled on a regular latitude/longitude grid, grid points near the North pole correspond to much smaller areas of the earth than grid points closer to the equator. In order to compensate for this, we scale the $Z_{500}$ anomaly data values by the square root of the cosine of the latitude – this leads to covariance matrix values that scale as the cosine of the latitude, which gives the correct area weighting. The listing below shows how we do this. First we read the NetCDF data then we use the hmatrix
build
function to construct a suitably scaled data matrix:
Right z500short < get innc z500var :: RepaRet3 CShort
 Convert anomaly data to a matrix of floating point values,
 scaling by square root of cos of latitude.
let latscale = SV.map (\lt > realToFrac $ sqrt $ cos (lt / 180.0 * pi)) lat
z500 = build (ntime, nspace)
(\t s > let it = truncate t :: Int
is = truncate s :: Int
(ilat, ilon) = divMod is nlon
i = Repa.Z Repa.:. it Repa.:. ilat Repa.:. ilon
in (latscale ! ilat) *
(fromIntegral $ z500short Repa.! i)) :: Matrix Double
Once we have the scaled $Z_{500}$ anomaly data in a matrix, we call the pca
function, which does both the covariance matrix calculation and the PCA eigendecomposition and projection, then write the results to a NetCDF file. We end up with a NetCDF file containing 1080 PCA eigenpatterns, each with 72 × 15 data points on our latitude/longitude grid and PCA projected component time series each with 9966 time steps.
One very important thing to note here is the relative scaling of the PCA eigenpatterns and the PCA projected component time series. In the twodimensional mussel shell example, there was no confusion about the fact that the PCA eigenvectors as we presented them were unit vectors, and the PCA projected components had the units of length measured along those unit vectors. Here, in the spatiotemporal case, there is much potential for confusion (and the range of conventions in the climate science literature doesn’t do anything to help alleviate that confusion). To make things very clear: here, the PCA eigenvectors are still unit vectors and the PCA projected component time series have the units of $Z_{500}$!
The reason for the potential confusion is that people quite reasonably like to draw maps of the PCA eigenpatterns, but they also like to think of these maps as being spatial patterns of $Z_{500}$ variation, not just as basis vectors. This opens the door to all sorts of more or less reputable approaches to scaling the PCA eigenpatterns and projected components. One wellknown book on statistical analysis in climate research suggests that people should scale their PCA eigenpatterns by the standard deviation of the corresponding PCA projected component time series and the values of the PCA projected component time series should be divided by their standard deviation. The result of this is that the maps of the PCA eigenpatterns look like $Z_{500}$ maps and all of the PCA projected component time series have standard deviation of one. People then talk about the PCA eigenpatterns as showing a “typical ± 1 SD” event.
Here, we’re going to deal with this issue by continuing to be very explicit about what we’re doing. In all cases, our PCA eigenpatterns will be unit vectors, i.e. the things we get back from the pca
function, without any scaling. That means that the units in our data live on the PCA projected component time series, not on the PCA eigenpatterns. When we want to look at a map of a PCA eigenpattern in a way that makes it look like a “typical” $Z_{500}$ deviation from the mean (which is a useful thing to do), we will say something like “This plot shows the first PCA eigenpattern scaled by the standard deviation of the first PCA projected component time series.” Just to be extra explicit!
The “online” PCA calculation didn’t require any extra work, apart from some type conversions and the area scaling we had to do. But what if we have too much data to read everything into memory in one go? Here, I’ll show you how to do a sort of “offline” PCA calculation. By “offline”, I mean an approach that only ever reads a single time step of data from the input at a time, and only ever writes a single time step of the PCA projected component time series to the output at a time.
Because we’re going to be interleaving calculation and I/O, we’re going to need to make our PCA function monadic. Here’s the main offline PCA function:
pcaM :: Int > (Int > IO V) > (Int > V > IO ()) > IO (V, M)
pcaM nrow readrow writeproj = do
(mean, cov) < meanCovM nrow readrow
let (_, evals, evecCols) = svd cov
evecs = fromRows $ map evecSigns $ toColumns evecCols
evecSigns ev = let maxelidx = maxIndex $ cmap abs ev
sign = signum (ev ! maxelidx)
in cmap (sign *) ev
varexp = scale (1.0 / sumElements evals) evals
project x = evecs #> (x  mean)
forM_ [0..nrow1] $ \r > readrow r >>= writeproj r . project
return (varexp, evecs)
It makes use of a couple of convenience type synonyms:
type V = Vector Double
type M = Matrix Double
The pcaM
function takes as arguments the number of data rows to process (in our case, the number of time steps), an IO
action to read a single row of data (given the zerobased row index), and an IO
action to write a single row of PCA projected component time series data. As with the “normal” pca
function, the pcaM
function returns the PCA eigenvalues and PCA eigenvectors as its result.
Most of the pcaM
function is the same as the pca
function. There are only two real differences. First, the calculation of the mean and covariance of the data uses the meanCovM
function that we’ll look at in a moment. Second, the writing of the PCA projected component time series output is done by a monadic loop that uses the IO
actions passed to pcaM
to alternately read, project and write out rows of data (the pca
function just returned the PCA projected component time series to its caller in one go).
Most of the real differences to the pca
function lie in the calculation of the mean and covariance of the input data:
meanCovM :: Int > (Int > IO V) > IO (V, M)
meanCovM nrow readrow = do
 Accumulate values for mean calculation.
refrow < readrow 0
let maddone acc i = do
row < readrow i
return $! acc + row
mtotal < foldM maddone refrow [1..nrow1]
 Calculate sample mean.
let mean = scale (1.0 / fromIntegral nrow) mtotal
 Accumulate differences from mean for covariance calculation.
let refdiff = refrow  mean
caddone acc i = do
row < readrow i
let diff = row  mean
return $! acc + (diff `outer` diff)
ctotal < foldM caddone (refdiff `outer` refdiff) [1..nrow1]
 Calculate sample covariance.
let cov = scale (1.0 / fromIntegral (nrow  1)) ctotal
return (mean, cov)
Since we don’t want to read more than a single row of input data at a time, we need to explicitly accumulate data for the mean and covariance calculations. That means making two passes over the input data file, reading a row at a time – the maddone
and caddone
helper functions accumulate a single row of data for the mean and covariance calculations. The accumulator for the mean is pretty obvious, but that for the covariance probably deserves a bit of comment. It uses the hmatrix
outer
function to calculate $(x_i  \bar{x}) (x_i  \bar{x})^T$ (where $x_i$ is the $i$th data row (as a column vector) and $\bar{x}$ is the data mean), which is the appropriate contribution to the covariance matrix for each individual data row.
Overall, the offline PCA calculation makes three passes over the input data file (one for the mean, one for the covariance, one to project the input data onto the PCA eigenvectors), reading a single data row at a time. That makes it pretty slow, certainly far slower than the online calculation, which reads all of the data into memory in one go, then does all the mean, covariance and projection calculations in memory, and finally writes out the PCA projected components in one go. However, if you have enough data that you can’t do an online calculation, this is the way to go. You can obviously imagine ways to make this more efficient, probably by reading batches of data rows at a time. You’d still need to do three passes over the data, but batching the reads would make things a bit quicker.
There are three things we can look at that come out of the PCA analysis of the $Z_{500}$ anomaly data: the PCA eigenvalues (best expressed as “fraction of variance explained”), the PCA eigenpatterns and the PCA projected component time series.
First, let’s look at the eigenvalues. This plot shows the fraction of variance explained for the first 100 PCA eigenvalues of the $Z_{500}$ anomaly data, both individually (blue) and cumulatively (orange):
The eigenvalues are ordered in decreasing order of magnitude in what’s usually called a “scree plot”. The reason for the name is pretty obvious, since the eigenvalues fall off quickly in magnitude giving the graph the look of cliff face with a talus slope at its foot. We often look for a “knee” in a plot like this to get some idea of how many PCA eigenpatterns we need to consider to capture a good fraction of the total variance in the data we’re looking at. Here we can see that just ten of the PCA eigenpatterns explain about half of the total variance in the $Z_{500}$ anomaly data (which is a set of 1080dimensional vectors, remember). However, there’s not all that much of a “knee” in the scree plot here, which is pretty typical for climate and meteorological data – we often see a gradual falloff in PCA eigenvalue magnitude rather than a discrete set of larger magnitude eigenvalues that we can identify as “the important ones”.
We can get some idea of what’s going on with this gradual falloff by looking at the PCA eigenpatterns. As mentioned in the previous section, there is a question about how we scale these for display. To be completely explicit about things, here we’re going to plot PCA eigenpatterns scaled by the standard deviation of the corresponding PCA projected component time series. This gives us “typical one standard deviation” patterns that we can plot with units of geopotential height. These are usually easier to interpret than the “unit vector” PCA eigenpatterns than come out of the PCA calculation.
Here are the first six PCA eigenpatterns for the $Z_{500}$ anomaly data (you can click on these images to see larger versions; the numbers in parentheses show the fraction of total $Z_{500}$ anomaly variance explained by each PCA eigenpattern.):
For comparison here are the eigenpatterns for eigenpatterns $10, 20, \dots, 60$:
The first thing to note about these figures is that the spatial scales of variation for the PCA eigenpatterns corresponding to smaller eigenvalues (i.e. smaller explained variance fractions) are also smaller – for the most extreme case, compare the dipolar circumpolar spatial pattern for the first eigenpattern (first plot in the first group of plots) to the finescale spatial features for the 60th eigenpattern (last plot in the second group). This is what we often seen when we do PCA on atmospheric data. The larger spatial scales capture most of the variability in the data so are represented by the first few eigenpatterns, while smaller scale spatial variability is represented by later eigenpatterns. Intuitively, this is probably related to the powerlaw scaling in the turbulent cascade of energy from large (planetary) scales to small scales (where dissipation by thermal diffusion occurs) in the atmosphere^{1}.
The next thing we can look at, at least in the first few patterns in the first group of plots, are some of the actual patterns of variability these things represent. The first PCA eigenpattern, for example, represents a dipole in $Z_{500}$ anomaly variability with poles in the North Atlantic just south of Greenland and over mainland western Europe. If you look back at the blocking $Z_{500}$ anomaly plots in an earlier post, you can kind of convince yourself that this first PCA eigenpattern looks a little like some instances of a blocking pattern over the North Atlantic. Similarly, the second PCA eigenpattern is mostly a dipole between the North Pacific and North America (with some weaker associated variability over the Atlantic, so we might expect this somehow to be related to blocking episodes in the Pacific sector.
This is all necessarily a bit vague, because these patterns represent only part of the variability in the data, with each individual pattern representing only a quite small fraction of the variability (8.86% for the first eigenpattern, 7.46% for the second, 6.27% for the third). At any particular point in time, the pattern of $Z_{500}$ anomalies in the atmosphere will be made up of contributions from these patterns plus many others. What we hope though is that we can tease out some interesting characteristics of the atmospheric flow by considering just a subset of these PCA eigenpatterns. Sometimes this is really easy and obvious – if you perform PCA and find that there are two leading eigenpatterns that explain 80% of the variance in your data, then you can quite straightforwardly press ahead with analysing only those two patterns of variability, safe in the knowledge that you’re capturing most of what’s going on in your data. In our case, we’re going to try to get some sense of what’s going on by looking at only the first three PCA eigenpatterns (we’ll see why three in the next article). The first three eigenpatterns explain only 22.59% of the total variance in our $Z_{500}$ anomaly data, so this isn’t obviously a smart thing to do. It does turn out to work and to be quite educational though!
The last component of the output from the PCA procedure is the time series of PCA projected component values. Here we have one time series (of 9966 days) for each of the 1080 PCA eigenpatterns that we produced. At each time step, the actual $Z_{500}$ anomaly field can be recovered by adding up all the PCA eigenpatterns, each weighted by the corresponding projected component. You can look at plots of these time series, but they’re not in themselves all that enlightening. I’ll say some more about them in the next article, where we need to think about the autocorrelation properties of these time series.
(As a side note, I’d comment that the PCA eigenpatterns shown above match up pretty well with those in Crommelin’s paper, which is reassuring. The approach we’re taking here, of duplicating the analysis done in an existing paper, is actually a very good way to go about developing new data analyis code – you can see quite quickly if you screw things up as you’re going along by comparing your results with what’s in the paper. Since I’m just making up all the Haskell stuff here as I go along, this is pretty handy!)
But don’t make too much of that, not in any kind of quantitative sense anyway – there’s certainly no obvious power law scaling in the explained variance of the PCA eigenpatterns as a function of eigenvalue index, unless you look at the data with very power law tinted spectacles! I’m planning to look at another paper at some point in the future that will serve as a good vehicle for exploring this question of when and where we can see power law behaviour in observational data.}↩
The preprocessing that we’ve done hasn’t really got us anywhere in terms of the main analysis we want to do – it’s just organised the data a little and removed the main source of variability (the seasonal cycle) that we’re not interested in. Although we’ve subsetted the original geopotential height data both spatially and temporally, there is still a lot of data: 66 years of 181day winters, each day of which has $72 \times 15$ $Z_{500}$ values. This is a very common situation to find yourself in if you’re dealing with climate, meteorological, oceanographic or remote sensing data. One approach to this glut of data is something called dimensionality reduction, a term that refers to a range of techniques for extracting “interesting” or “important” patterns from data so that we can then talk about the data in terms of how strong these patterns are instead of what data values we have at each point in space and time.
I’ve put the words “interesting” and “important” in quotes here because what’s interesting or important is up to us to define, and determines the dimensionality reduction method we use. Here, we’re going to sidestep the question of determining what’s interesting or important by using the de facto default dimensionality reduction method, principal components analysis (PCA). We’ll take a look in detail at what kind of “interesting” and “important” PCA give us a little later.
PCA is, in principle, quite a simple method, but it causes many people endless problems. There are some very good reasons for this:
PCA is in some sense nothing more than a generic change of basis operation (with the basis we change to chosen in a special way). The result of this is that a lot of the terminology used about PCA is also very generic, and hence very confusing (words like “basis”, “component”, “eigenvector”, “projection” and so on could mean more or less anything in this context!).
PCA is used in nearly every field where multivariate data is analysed, and is the archetypical “unsupervised learning” method. This means that it has been invented, reinvented, discovered and rediscovered many times, under many different names. Some other names for it are: empirical orthogonal function (EOF) analysis, the KarhunenLoève decomposition, proper orthogonal decomposition (POD), and there are many others. Each of these different fields also uses different terms for the different outputs from PCA. This is very confusing: some people talk about principal components, some about empirical orthogonal functions and principal component time series, some about basis functions, and so on. Here, we’re going to try to be very clear and careful about the names that we use for things to try to alleviate some of the confusion.
There is a bit of a conceptual leap that’s necessary to go from very basic examples of using PCA to using PCA to analyse the kind of spatiotemporal data we have here. I used to say something like: “Well, there’s a nice twodimensional example, and it works just the same in 100 dimensions, so let’s just apply it to our atmospheric data!” A perfectly reasonable reponse to that is: “WHAT?! Are you an idiot?”. Here, we’re going to take that conceptual leap slowly, and describe exactly how the “change of basis” view of PCA works for spatiotemporal data.
There are some aspects of the scaling of the different outputs from PCA that are really confusing. In simple terms, PCA breaks your data down into two parts, and you could choose to put the units of your data on either one of those parts, normalising the other part. Which one you put the units on isn’t always an obvious choice and it’s really easy to screw things up if you do it wrong. We’ll look at this carefully here.
So, there’s quite a bit to cover in the next couple of articles. In this article, we will: explain the basic idea of PCA with a very simple (twodimensional!) example; give a recipe for how to perform PCA on a data set; talk about why PCA works from an algebraic standpoint; talk about how to do these calculations in Haskell. Then in the next article, we will: describe exactly how we do PCA on spatiotemporal data; demonstrate how to perform PCA on the $Z_{500}$ anomaly data; show how to visualise the $Z_{500}$ PCA results and save them for later use. What we will end up with from this stage of our analysis is a set of “important” spatial patterns (we’ll see what “important” means for PCA) and time series of how strong each of those spatial patterns is at a particular point in time. The clever thing about this decomposition is that we can restrict our attention to the few most “important” patterns and discard all the rest of the variability in the data. That makes the subsequent exploration of the data much simpler.
We’re going to take our first look at PCA using a very simple example. It might not be immediately obvious how the technique we’re going to develop here will be applicable to the spatiotemporal $Z_{500}$ data we really want to analyse, but we’ll get to that a little later, after we’ve seen how PCA works in this simple example and we’ve done a little algebra to get a clearer understanding of just why the “recipe” we’re going to use works the way that it does.
Suppose we go to the seaside and measure the shells of mussels^{1}. We’ll measure the length and width of each shell and record the data for each mussel as a twodimensional (length, width) vector. There will be variation in the sizes and shapes of the mussels, some longer, some shorter, some fatter, some skinnier. We might end up with data that looks something like what’s shown below, where there’s a spread of length in the shells around a mean of about 5 cm, a spread in the width of shells around a mean of about 3 cm, and there’s a clear correlation between shell length and width (see Figure 1 below). Just from eyeballing this picture, it seems apparent that maybe measuring shell length and width might not be the best way to represent this data – it looks as though it could be better to think of some combination of length and width as measuring the overall “size” of a mussel, and some other combination of length and width as measuring the “fatness” or “skinniness” of a mussel. We’ll see how a principal components analysis of this data extracts these two combinations in a clear way.
The code for this post is available in a Gist. The Gist contains a Cabal file as well as the Haskell source, to make it easy to build. Just do something like this to build and run the code in a sandbox:
git clone https://gist.github.com/d39bf143ffc482ea3700.git pca2d
cd pca2d
cabal sandbox init
cabal install
./.cabalsandbox/bin/pca2d
Just for a slight change, I’m going to produce all the plots in this section using Haskell, specifically using the Chart
library. We’ll use the hmatrix
library for linear algebra, so the imports we end up needing are:
import Control.Monad
import Numeric.LinearAlgebra.HMatrix
import Graphics.Rendering.Chart.Easy hiding (Matrix, Vector, (>), scale)
import Graphics.Rendering.Chart.Backend.Cairo
There are some name overlaps between the monadic plot interface provided by the Graphics.Rendering.Chart.Easy
module and hmatrix
, so we just hide the overlapping ones.
We generate 500 synthetic data points:
 Number of test data points.
n :: Int
n = 500
 Mean, standard deviation and correlation for two dimensions of test
 data.
meanx, meany, sdx, sdy, rho :: Double
meanx = 5.0 ; meany = 3.0 ; sdx = 1.2 ; sdy = 0.6 ; rho = 0.75
 Generate test data.
generateTestData :: Matrix Double
generateTestData =
let seed = 1023
mean = 2 > [ meanx, meany ]
cov = matrix 2 [ sdx^2 , rho*sdx*sdy
, rho*sdx*sdy , sdy^2 ]
samples = gaussianSample seed n mean cov
in fromRows $ filter ((> 0) . minElement) $ toRows samples
The mussel shell length and width values are generated from a twodimensional Gaussian distribution, where we specify mean and standard deviation for both shell length and width, and the correlation between the length and width (as the usual Pearson correlation coefficient). Given this information, we can generate samples from the Gaussian distribution using hmatrix
’s gaussianSample
function. (If we didn’t have this function, we would calculate the Cholesky decomposition of the covariance matrix we wanted, generate samples from a pair of standard onedimensional Gaussian distributions and multiple twodimensional vectors of these samples by one of the Cholesky factors of the covariance matrix – this is just what the gaussianSample
function does for us.) We do a little filtering in generateTestData
to make sure that we don’t generate any negative values^{2}.
The main program that drives the generation of the plots we’ll look at below is:
main :: IO ()
main = do
let dat = generateTestData
(varexp, evecs, projs) = pca dat
(mean, cov) = meanCov dat
cdat = fromRows $ map (subtract mean) $ toRows dat
forM_ [(PNG, "png"), (PDF, "pdf"), (SVG, "svg")] $ \(ptype, suffix) > do
doPlot ptype suffix dat evecs projs 0
doPlot ptype suffix cdat evecs projs 1
doPlot ptype suffix cdat evecs projs 2
doPlot ptype suffix cdat evecs projs 3
putStrLn $ "FRACTIONAL VARIANCE EXPLAINED: " ++ show varexp
and you can see the doPlot
function that generates the individual plots in the Gist. I won’t say a great deal about the plotting code, except to observe that the new monadic API to the Chart
library makes generating this kind of simple plot in Haskell no harder than it would be using Gnuplot or something similar. The plot code produces one of four plots depending on an integer parameter, which ranges from zero (the first plot above) to three. Because we’re using the Cairo backend to the Chart
library, we can generate image output in any of the formats that Cairo supports – here we generate PDF (to insert into LaTeX documents), SVG (to insert into web pages) and PNG (for a quick look while we’re playing with the code).
The main program above is pretty simple: generate test data, do the PCA calculation (by calling the pca
function, which we’ll look at in detail in a minute), do a little bit of data transformation to help with plotting, then call the doPlot
function for each of the plots we want. Here are the plots we produce, which we’ll refer to below as we work through the PCA calculation:
Synthetic mussel shell test data for twodimensional PCA example.
Centred synthetic mussel shell test data for twodimensional PCA example.
PCA eigenvectors for twodimensional PCA example.
Data projection onto PCA eigenvectors for twodimensional PCA example.
Let’s now run through the “recipe” for performing PCA, looking at the figures above in parallel with the code for the pca
function:
1 2 3 4 5 6 7 8 9 10 11 

We’ll look at just why this recipe works in the next section, but for the moment, let’s just see what happens:
We start with our original mussel shell data (Figure 1 above).
We calculate the mean and covariance of our data (line 3 of the pca
function listing). PCA analyses the deviations of our data from the mean, so we effectively look at “centred” data, as shown in Figure 2, where we’ve just removed the mean from each coordinate in our data. The mean and covariance calculation is conveniently done using hmatrix
’s meanCov
function.
Then we calculate the eigendecomposition of the covariance matrix. Because the covariance matrix is a real symmetric matrix, by construction, we know that the eigenvectors will form a complete set that we can use as a basis to represent our data. (We’re going to blithely ignore all questions of possible degeneracy here – for real data, “almost surely” means always!) Here, we do the eigendecomposition using a singular value decomposition (line 4 in the listing of the pca
function). The singular values give us the eigenvalues and the right singular vectors give us the eigenvectors. The choice here to use SVD (via hmatrix
’s svd
function) rather than some other means of calculating an eigendecomposition is based primarily on the perhaps slightly prejudiced idea that SVD has the best and most stable implementations – here, hmatrix
calls out to LAPACK to do this sort of thing, so there’s probably not much to choose, since the other eigendecomposition implementations in LAPACK are also good, but my prejudice in favour of SVD remains! If you want some better justification for why SVD is “the” matrix eigendecomposition, take a look at this very interesting historical review of the development of SVD: G. W. Stewart (1993). On the early history of the singularvalue decomposition. SIAM Rev. 35(4), 551566.
We do a little manipulation of the directions of the eigenvectors (lines 58 in the listing), flipping the signs of them to make the largest components point in the positive direction – this is mostly just to make the eigenvectors look good for plotting. The eigenvectors are shown in the Figure 3: we’ll call them $\mathbf{e}_1$ (the one pointing to the upper right) and $\mathbf{e}_2$ (the one pointing to the upper left). Note that these are unit vectors. We’ll talk about this again when we look at using PCA for spatiotemporal data.
Once we have unit eigenvectors, we can project our (centred) data points onto these eigenvectors (lines 10 and 11 of the listing: the project
function centres a data point by taking off the mean, then projects onto each eigenvector using hmatrix
’s matrixvector product operator #>
). Figure 4 shows in schematic form how this works – we pick out one data point in green and draw lines parallel and orthogonal to the eigenvectors showing how we project the data point onto the eigenvectors. Doing this for each data point is effectively just a change of basis: instead of representing our centred data value by measurements along the $x$ and $y$axes, we represent it by measurements in the directions of $\mathbf{e}_1$ and $\mathbf{e}_2$. We’ll talk more about this below as well.
Finally, the eigenvalues from the eigendecomposition of the covariance matrix tell us something about how much of the total variance in our input data is “explained” by the projections onto each of the eigenvectors. I’ve put the word “explained” in quotes because I don’t think it’s a very good word to use, but it’s what everyone says. Really, we’re just saying how much of the data variance lies in the direction of each eigenvector. Just as you can calculate the variance of the mussel length and width individually, you can calculate the variance of the projections onto the eigenvectors. The eigenvalues from the PCA eigendecomposition tell you how much variance there is in each direction, and we calculate the “fraction of variance explained” for each eigenvector and return it from the pca
function.
So, the pca
function returns three things: eigenvalues (actually fractional explained variance calculated from the eigenvalues) and eigenvectors from the PCA eigendecomposition, plus projections of each of the (centred) data points onto each of the eigenvectors. The terminology for all these different things is very variable between different fields. We’re going to sidestep the question of what these things are called by always explicitly referring to PCA eigenvectors (or, later on when we’re dealing with spatiotemporal data, PCA eigenpatterns), PCA explained variance fractions and PCA projected components. These terms are a bit awkward, but there’s no chance of getting confused this way. We could choose terminology from one of the fields where PCA is commonly used, but that could be confusing for people working in other fields, since the terminology in a lot of cases is not very well chosen.
Together, the PCA eigenvectors and PCA projected components constitute nothing more than a change of orthonormal basis for representing our input data – the PCA output contains exactly the same information as the input data. (Remember that the PCA eigenvectors are returned as unit vectors from the pca
function, so we really are just looking at a simple change of basis.) So it may seem as though we haven’t really done anything much interesting with our data. The interesting thing comes from the fact that we can order the PCA eigenvectors in decreasing order of the explained variance fraction. If we find that data projected onto the first three (say) eigenvectors explains 80% of the total variance in our data, then we may be justified in considering only those three components. In this way, PCA can be used as a dimensionality reduction method, allowing us to use lowdimensional data analysis and visualisation techniques to deal with input data that has high dimensionality.
This is exactly what we’re going to do with the $Z_{500}$ data: we’re going to perform PCA, and take only the leading PCA eigenvectors and components, throwing some information away. The way that PCA works guarantees that the set of orthogonal patterns we keep are the “best” patterns in terms of explaining the variance in our data. We’ll have more to say about this in the next section when we look at why our centre/calculate covariance/eigendecomposition recipe works.
In the last section, we presented a “recipe” for PCA (at least for twodimensional data): centre the data; calculate the covariance matrix; calculate the eigendecomposition of the covariance matrix; project your centred data points onto the eigenvectors. The eigenvalues give you a measure of the proportion of the variance in your data in the direction of the corresponding eigenvector. And the projection of the data points onto the PCA eigenvectors is just a change of basis, from whatever original basis your data was measured in (mussel shell length and width as the two components of each data point in the example) to a basis with the PCA eigenvectors as basis vectors.
So why does this work? Obviously, you can use whatever basis you like to describe your data, but why is the PCA eigenbasis useful and interesting? I’ll explain this quite quickly, since it’s mostly fairly basic linear algebra, and you can read about it in more detail in more or less any linear algebra textbook^{3}.
To start with, let’s review some facts about eigenvectors and eigenvalues. For a matrix $\mathbf{A}$, an eigenvector $\mathbf{u}$ and its associated eigenvalue $\lambda$ satisfy
$\mathbf{A} \mathbf{u} = \lambda \mathbf{u}.$
The first thing to note is that any scalar multiple of $\mathbf{u}$ is also an eigenvector, so an eigenvector really refers to a “direction”, not to a specific vector with a fixed magnitude. If we multiply both sides of this by $\mathbf{u}^T$ and rearrange a little, we get
$\lambda = \frac{\mathbf{u}^T \mathbf{A} \mathbf{u}}{\mathbf{u}^T \mathbf{u}}.$
The denominator of the fraction on the right hand side is just the length of the vector $\mathbf{u}$. Now, we can find the largest eigenvalue $\lambda_1$ and corresponding eigenvector $\mathbf{u}_1$ by solving the optimisation problem
$\mathbf{u}_1 = \underset{\mathrm{arg max}}{\mathbf{u}^T \mathbf{u} = 1} \; \mathbf{u}^T \mathbf{A} \mathbf{u},$
where for convenience, we’ve restricted the optimisation to find a unit eigenvector, and we find $\lambda_1$ directly from the fact that $\mathbf{A} \mathbf{u}_1 = \lambda_1 \mathbf{u}_1$.
We can find next largest (in magnitude) eigenvalue and corresponding eigenvector of the matrix $\mathbf{A}$ by projecting the rows of $\mathbf{A}$ into the subspace orthogonal to $\mathbf{u}_1$ to give a new matrix $\mathbf{A}_1$ and solving the optimisation problem
$\mathbf{u}_2 = \underset{\mathrm{arg max}}{\mathbf{u}^T \mathbf{u} = 1} \; \mathbf{u}^T \mathbf{A}_1 \mathbf{u},$
finding the second largest eigenvalue $\lambda_2$ from $\mathbf{A}_1 \mathbf{u}_2 = \lambda_2 \mathbf{u}_2$. Further eigenvectors and eigenvalues can be found in order of decreasing eigenvalue magnitude by projecting into subspaces orthogonal to all the eigenvectors found so far and solving further optimisation problems.
This link between this type of optimisation problem and the eigenvectors and eigenvalues of a matrix is the key to understanding why PCA works the way that it does. Suppose that we have centred our ($K$dimensional) data, and that we call the $N$ centred data vectors $\mathbf{x}_i$, $i = 1, 2, \dots N$. If we now construct an $N \times K$ matrix $\mathbf{X}$ whose rows are the $\mathbf{x}_i$, then the sample covariance of the data is
$\mathbf{C} = \frac{1}{N  1} \mathbf{X}^T \mathbf{X}.$
Now, given a direction represented as a unit vector $\mathbf{u}$, we can calculate the data variance in that direction as $\mathbf{X} \mathbf{u}^2$, so that if we want to know the direction in which the data has the greatest variance, we solve an optimisation problem of the form
$\mathbf{u}_1 = \underset{\mathrm{arg max}}{\mathbf{u}^T \mathbf{u} = 1} \; (\mathbf{X} \mathbf{u})^T \mathbf{X} \mathbf{u} = \underset{\mathrm{arg max}}{\mathbf{u}^T \mathbf{u} = 1} \; \mathbf{u}^T \mathbf{C} \mathbf{u}.$
But this optimisation problem is just the eigendecomposition optimisation problem for the covariance matrix $\mathbf{C}$. This demonstrates we can find the directions of maximum variance in our data by looking at the eigendecomposition of the covariance matrix $\mathbf{C}$ in decreasing order of eigenvalue magnitude.
There are a couple of things to add to this. First, the covariance matrix is, by construction, a real symmetric matrix, so its eigenvectors form a complete basis – this means that we really can perform a change of basis from our original data to the PCA basis with no loss of information. Second, because the eigenvectors of the covariance matrix are orthogonal, the projections of our data items onto the eigenvector directions (what we’re going to call the PCA projected components) are uncorrelated. We’ll see some consequences of this when we look at performing PCA on the $Z_{500}$ data. Finally, and related to this point, it’s worth noting that PCA is a linear operation – the projected components are linearly uncorrelated, but that doesn’t mean that there can’t be some nonlinear relationship between them. There are generalisations of PCA to deal with this case, but we won’t be talking about them for the purposes of this analysis.
Everything we’ve done here is pretty straightforward, but you might be wondering why we would want to change to this PCA basis at all? What’s the point? As I noted above, but is worth reiterating, the most common use for PCA, and the way that we’re going to use it with the $Z_{500}$ data, is as a dimensionality reduction method. For the $Z_{500}$ data, we have, for each day we’re looking at, $72 \times 15 = 1080$ spatial points, which is a lot of data to look at and analyse. What we usually do is to perform PCA, then ignore all but the first few leading PCA eigenvectors and projected components. Because of the way the optimisation problems described above are set up, we can guarantee that the leading $m$ PCA eigenvectors span the $m$dimensional subspace of the original data space containing the most data variance, and we can thus convince ourselves that we aren’t missing interesting features of our data by taking only those leading components. We’ll see how this works in some detail when we do the PCA analysis of the $Z_{500}$ data, but in the mussel measurement case, this would correspond to thinking just of the projection of the mussel length and width data along the leading $\mathbf{e}_1$ eigendirection, so reducing the measurements to a single “size” parameter that neglects the variation in fatness or skinniness of the mussels. (This twodimensional case is a bit artificial. Things will make more sense when we look at the 1080dimensional case for the $Z_{500}$ data.)
Well, not really, since I live in the mountains of Austria and there aren’t too many mussels around here, so I’ll generate some synthetic data!↩
Obviously, a Gaussian distribution is not right for quantities like lengths that are known to be positive, but here we’re just generating some data for illustrative purposes, so we don’t care all that much. If we were trying to model this kind of data though, we’d have to be more careful.↩
I like Gilbert Strang’s Linear Algebra and its Applications, although I’ve heard from some people that they think it’s a bit hard for a first textbook on the subject – if you’ve had any exposure to this stuff before though, it’s good.↩