#### March 30, 2015

I took over the day-to-day support for C2HS about 18 months ago and have now finally cleaned up all the issues on the GitHub issue tracker. It took a *lot* longer than I was expecting, mostly due to pesky “real work” getting in the way. Now seems like a good time to announce the 0.25.1 “Snowmelt” release of C2HS and to summarise some of the more interesting new C2HS features.

#### March 8, 2015

This is going to be the oldest of old hat for the cool Haskell kids who invent existential higher-kinded polymorphic whatsits before breakfast, but it amused me, and it’s the first time I’ve used some of these more interesting language extensions for something “real”.

#### February 12, 2015

(There’s no code in this post, just some examples to explain what we’re going to do next.)

Suppose we define the state of the system whose evolution we want to study by a probability vector $\mathbf{p}(t)$ – at any moment in time, we have a probability distribution over a finite partition of the state space of the system (so that if we partition the state space into $N$ components, then $\mathbf{p}(t) \in \mathbb{R}^N$). Evolution of the system as a Markov chain is then defined by the evolution rule

$\mathbf{p}(t + \Delta{}t) = \mathbf{M} \mathbf{p}(t), \qquad (1)$

where $\mathbf{M} \in \mathbb{R}^{N \times N}$ is a *Markov matrix*. This approach to modelling the evolution of probability densities has the benefit both of being simple to understand and to implement (in terms of estimating the matrix $\mathbf{M}$ from data) and, as we’ll see, of allowing us to distinguish between random “diffusive” evolution and conservative “non-diffusive” dynamics.

We’ll see how this works by examining a very simple example.

#### February 9, 2015

The analysis of preferred flow regimes in the previous article is all very well, and in its way quite illuminating, but it was an entirely *static* analysis – we didn’t make any use of the fact that the original $Z_{500}$ data we used was a time series, so we couldn’t gain any information about transitions between different states of atmospheric flow. We’ll attempt to remedy that situation now.

What sort of approach can we use to look at the dynamics of changes in patterns of $Z_{500}$? Our $(\theta, \phi)$ parameterisation of flow patterns seems like a good start, but we need some way to model transitions between different flow states, i.e. between different points on the $(\theta, \phi)$ sphere. Each of our original $Z_{500}$ maps corresponds to a point on this sphere, so we might hope that we can some up with a way of looking at trajectories of points in $(\theta, \phi)$ space that will give us some insight into the dynamics of atmospheric flow.

#### February 6, 2015

A quick post today to round off the “static” part of our atmospheric flow analysis.

Now that we’ve satisfied ourselves that the bumps in the spherical PDF in article 8 of this series are significant (in the narrowly defined sense of the word “significant” that we’ve discussed), we might ask what to sort of atmospheric flow regimes these bumps correspond. Since each point on our unit sphere is really a point in the three-dimensional space spanned by the first three $Z_{500}$ PCA eigenpatterns that we calculated earlier, we can construct composite maps to look at the spatial patterns of flow for each bump just by combining the first three PCA eigenpatterns in proportions given by the “$(x, y, z)$” coordinates of points on the unit sphere.

#### February 2, 2015

The spherical PDF we constructed by kernel density estimation in the article before last appeared to have “bumps”, i.e. it’s not uniform in $\theta$ and $\phi$. We’d like to interpret these bumps as preferred regimes of atmospheric flow, but before we do that, we need to decide whether these bumps are *significant*. There is a huge amount of confusion that surrounds this idea of significance, mostly caused by blind use of “standard recipes” in common data analysis cases. Here, we have some data analysis that’s anything but standard, and that will rather paradoxically make it much easier to understand what we really mean by significance.

#### January 30, 2015

The Haskell kernel density estimation code in the last article does work, but it’s distressingly slow. Timing with the Unix `time`

command (not all that accurate, but it gives a good idea of orders of magnitude) reveals that this program takes about 6.3 seconds to run. For a one-off, that’s not too bad, but in the next article, we’re going to want to run this type of KDE calculation thousands of times, in order to generate empirical distributions of null hypothesis PDF values for significance testing. So we need something faster.

#### January 27, 2015

Up to this point, all the analysis that we’ve done has been what might be called “normal”, or “pedestrian” (or even “boring”). In climate data analysis, you almost always need to do some sort of spatial and temporal subsetting and you very often do some sort of anomaly processing. And *everyone* does PCA! So there’s not really been anything to get excited about yet.

Now that we have our PCA-transformed $Z_{500}$ anomalies though, we can start to do some more interesting things. In this article, we’re going to look at how we can use the new representation of atmospheric flow patterns offered by the PCA eigenpatterns to reduce the dimensionality of our data, making it much easier to handle. We’ll then look at our data in an interesting geometrical way that allows us to focus on the *patterns* of flow while ignoring the strengths of different flows, i.e. we’ll be treating strong and weak blocking events as being the same, and strong and weak “normal” flow patterns as being the same. This simplification of things will allow us to do some statistics with our data to get an idea of whether there are statistically significant (in a sense we’ll define) flow patterns visible in our data.

#### October 12, 2014

Although the basics of the “project onto eigenvectors of the covariance matrix” prescription do hold just the same in the case of spatio-temporal data as in the simple two-dimensional example we looked at in the earlier article, there are a number of things we need to think about when we come to look at PCA for spatio-temporal data. Specifically, we need to think bout data organisation, the interpretation of the output of the PCA calculation, and the interpretation of PCA as a change of basis in a spatio-temporal setting. Let’s start by looking at data organisation.

#### September 18, 2014

The pre-processing that we’ve done hasn’t really got us anywhere in terms of the main analysis we want to do – it’s just organised the data a little and removed the main source of variability (the seasonal cycle) that we’re not interested in. Although we’ve subsetted the original geopotential height data both spatially and temporally, there is still a *lot* of data: 66 years of 181-day winters, each day of which has $72 \times 15$ $Z_{500}$ values. This is a very common situation to find yourself in if you’re dealing with climate, meteorological, oceanographic or remote sensing data. One approach to this glut of data is something called *dimensionality reduction*, a term that refers to a range of techniques for extracting “interesting” or “important” patterns from data so that we can then talk about the data in terms of how strong these patterns are instead of what data values we have at each point in space and time.

I’ve put the words “interesting” and “important” in quotes here because what’s interesting or important is up to us to define, and determines the dimensionality reduction method we use. Here, we’re going to side-step the question of determining what’s interesting or important by using the de facto default dimensionality reduction method, *principal components analysis* (PCA). We’ll take a look in detail at what kind of “interesting” and “important” PCA give us a little later.

« OLDER POSTS