Non-diffusive atmospheric flow #7: PCA for spatio-temporal data

Although the basics of the “project onto eigenvectors of the covariance matrix” prescription do hold just the same in the case of spatio-temporal data as in the simple two-dimensional example we looked at in the earlier article, there are a number of things we need to think about when we come to look at PCA for spatio-temporal data. Specifically, we need to think bout data organisation, the interpretation of the output of the PCA calculation, and the interpretation of PCA as a change of basis in a spatio-temporal setting. Let’s start by looking at data organisation.

Non-diffusive atmospheric flow #6: principal components analysis

The pre-processing that we’ve done hasn’t really got us anywhere in terms of the main analysis we want to do – it’s just organised the data a little and removed the main source of variability (the seasonal cycle) that we’re not interested in. Although we’ve subsetted the original geopotential height data both spatially and temporally, there is still a lot of data: 66 years of 181-day winters, each day of which has $72×15$ ${Z}_{500}$ values. This is a very common situation to find yourself in if you’re dealing with climate, meteorological, oceanographic or remote sensing data. One approach to this glut of data is something called dimensionality reduction, a term that refers to a range of techniques for extracting “interesting” or “important” patterns from data so that we can then talk about the data in terms of how strong these patterns are instead of what data values we have at each point in space and time.

I’ve put the words “interesting” and “important” in quotes here because what’s interesting or important is up to us to define, and determines the dimensionality reduction method we use. Here, we’re going to side-step the question of determining what’s interesting or important by using the de facto default dimensionality reduction method, principal components analysis (PCA). We’ll take a look in detail at what kind of “interesting” and “important” PCA give us a little later.

Non-diffusive atmospheric flow #5: pre-processing

Note: there are a couple of earlier articles that I didn’t tag as “haskell” so they didn’t appear in Planet Haskell. They don’t contain any Haskell code, but they cover some background material that’s useful to know (#3 talks about reanalysis data and what ${Z}_{500}$ is, and #4 displays some of the characteristics of the data we’re going to be using). If you find terms here that are unfamiliar, they might be explained in one of these earlier articles.

The code for this post is available in a Gist.

Update: I missed a bit out of the pre-processing calculation here first time round. I’ve updated this post to reflect this now. Specifically, I forgot to do the running mean smoothing of the mean annual cycle in the anomaly calculation – doesn’t make much difference to the final results, but it’s worth doing just for the data manipulation practice…

Before we can get into the “main analysis”, we need to do some pre-processing of the ${Z}_{500}$ data. In particular, we are interested in large-scale spatial structures, so we want to subsample the data spatially. We are also going to look only at the Northern Hemisphere winter, so we need to extract temporal subsets for each winter season. (The reason for this is that winter is the season where we see the most interesting changes between persistent flow regimes. And we look at the Northern Hemisphere because it’s where more people live, so it’s more familiar to more people.) Finally, we want to look at variability about the seasonal cycle, so we are going to calculate “anomalies” around the seasonal cycle.

We’ll do the spatial and temporal subsetting as one pre-processing step and then do the anomaly calculation seperately, just for simplicity.

Non-diffusive atmospheric flow #4: exploring Z500

In the last article, I talked a little about geopotential height and the ${Z}_{500}$ data we’re going to use for this analysis. Earlier, I talked about how to read data from the NetCDF files that the NCEP reanalysis data comes in. Now we’re going to take a look at some of the features in the data set to get some idea of what we might see in our analysis. In order to do this, we’re going to have to produce some plots. As I’ve said before, I tend not to be very dogmatic about what software to use for plotting – for simple things (scatter plots, line plots, and so on) there are lots of tools that will do the job (including some Haskell tools, like the Chart library), but for more complex things, it tends to be much more efficient to use specialised tools. For example, for 3-D plotting, something like Paraview or Mayavi is a good choice. Here, we’re mostly going to be looking at geospatial data, i.e. maps, and for this there aren’t really any good Haskell tools. Instead, we’re going to use something called NCL (NCAR Command Language). This isn’t by any stretch of the imagination a pretty language from a computer science point of view, but it has a lot of specialised features for plotting climate and meteorological data and is pretty perfect for the needs of this task (the sea level pressure and ${Z}_{500}$ plots in the last post were made using NCL). I’m not going to talk about the NCL scripts used to produce the plots here, but I might write about NCL a bit more later since it’s a very good tool for this sort of thing.

Non-diffusive atmospheric flow #3: reanalysis data and Z500

In this article, we’re going to look at some of the details of the data that we’re going to be using in our study of non-diffusive flow in the atmosphere. This is still all background material, so there’s no Haskell code here!

Non-diffusive atmospheric flow #2: outline & plan

As I said in the last article, the next bit of this data analysis series is going to attempt to use Haskell to reproduce the analysis in the paper: D. T. Crommelin (2004). Observed nondiffusive dynamics in large-scale atmospheric flow. J. Atmos. Sci. 61(19), 2384–2396. Before we can do this, we need to cover some background, which I’m going to do in this and the next couple of articles. There won’t be any Haskell code in any of these three articles, so I’m not tagging them as “Haskell” so that they don’t end up on Planet Haskell, annoying category theorists who have no interest in atmospheric dynamics. I’ll refer to these background articles from the later “codey” articles as needed.

Haskell data analysis: Reading NetCDF files

I never really intended the FFT stuff to go on for as long as it did, since that sort of thing wasn’t really what I was planning as the focus for this Data Analysis in Haskell series. The FFT was intended primarily as a “warm-up” exercise. After fourteen blog articles and about 10,000 words, everyone ought to be sufficiently warmed up now…

Instead of trying to lay out any kind of fundamental principles for data analysis before we get going, I’m just going to dive into a real example. I’ll talk about generalities as we go along when we have some context in which to place them.

All of the analysis described in this next series of articles closely follows that in the paper: D. T. Crommelin (2004). Observed nondiffusive dynamics in large-scale atmospheric flow. J. Atmos. Sci. 61(19), 2384–2396. We’re going to replicate most of the data analysis and visualisation from this paper, maybe adding a few interesting extras towards the end.

It’s going to take a couple of articles to lay out some of the background to this problem, but I want to start here with something very practical and not specific to this particular problem. We’re going to look at how to gain access to meteorological and climate data stored in the NetCDF file format from Haskell. This will be useful not only for the low-frequency atmospheric variability problem we’re going to look at, but for other things in the future too.

Link Round-up

Here’s a mixed bag of interesting links, some sciencey, some mathsy, some miscellany:

1. Network Rail Virtual Archives: OK, this might not, at first sight, sound like something interesting, but it really is. This site has original Victorian-era engineering drawings for a whole range of British railway infrastructure. Bridges, viaducts, stations, tunnels. All rendered in lovely 19th Century penmanship. The Forth Bridge is particularly nice.

2. open.NASA: A couple of years ago, NASA started a project to open-source code and data from their Earth observing and planetary missions. Open.NASA is gateway to these resources. I’ve not had a chance to look at it in huge detail yet, but there is a lot of stuff there. The list of projects on the code.NASA part looks particularly entertaining.

3. Game of Primes: Giganotosaurus is a science fiction site that publishes one (longish) short story each month. They’re often very good, and this one was particularly striking – it’s quite beautifully done, full of mystery, and feels like it could be a part of something much larger and deeper.

4. Surprising connections in mathematics: This one is a bit more technical, from the Math Overflow Q&A website. A lot of the connections people mention are very technical, but some are more accessible, for instance the link between algebra and geometry developed by Descartes and others in the 17th Century. This is something we learn about in school, and something that we don’t think about too much because it seems “obvious”. Only obvious in retrospect, of course, since it took hundreds of years for the connection to be discovered!

5. De Bruijn grids and tilings: Another technical one, but very interesting. Aperiodic tilings of the plane, like Penrose tilings, are slightly mysterious. This article gives a really clear description of one systematic method for generating such tilings. It’s a very odd and intriguing little bit of mathematics.

6. Atul Gawande on end-of-life care: Atul Gawande is one of my favourite writers on medical and ethical issues. This article is quite long, but well worth a read.

Command and Control

by Eric Schlosser

My reading list recently has been chock-full of light-hearted and mood-lifting material: some Irvine Welsh novels (always guaranteed to shed a gentle light on all that’s best about the human condition), a long book about clinical depression, M. R. Carey’s interesting sort-of-zombie apocalypse/extreme mycology novel, The Girl With All The Gifts, de Becker’s The Gift Of Fear, a book all about fear and violence, and Piper Kerman’s prison memoir, Orange Is The New Black (which did spoil the mood a little having a few sparks of hope in among the gloom).

Among all this bleakness and blackness, Command and Control somehow manages to stand out as a particularly grim monument to human folly and our collective crimes against all sense and reason. It’s a book about nuclear weapons, so it never really had much chance of being too jolly, but even so, Schlosser’s decision to focus in parallel on US nuclear doctrine and nuclear weapons safety makes for some horrifying reading. It’s something of a mystery how we made it through the Cold War without either a “hot” war or at least some sort of unintended detonation of a nuclear weapon.

Many Books & Their Reviews #2

Second round of “many books”…

« OLDER POSTS
Site content copyright © 2011-2013 Ian Ross       Powered by Hakyll