Using a worked example, this tutorial explains how to write a Python program to generate a time series of NDVI from Sentinel-2. The aim is to impart a fundamental understanding of the process, so the program will be simple rather than scalable, fault tolerant or cost optimized.
Using Sentinel-2 imagery, we will generate an NDVI time series for a single point located in the Kolstad Lake Waterfowl Production Area, a small nature preserve in the middle of North America. The point is 45.5°N, 95.5°W. We will use the atmospherically corrected images available from the Sentinel-2 L2A collection.
Sentinel-2 L2A is 10m resolution, meaning a single image pixel represents a 10m x 10m square. A single image is large and the set of images needed to generate a time series will be tens or hundreds of times larger still. So one key challenge of deriving a time series will be minimizing the volume of data that needs to move over networks.
Sentinel-2 imagery is stored in Cloud Optimized Geotiff (COG) and available on AWS, Azure, GCP, and Copernicus. For this tutorial, we’ll use AWS, where the data is accessible from S3 buckets. Note that, in a production environment, you can leverage the “requester-pays” feature on Amazon allowing you to pay (a small) read-request fee to ensure you are not subjected to any rate-limits.
Three libraries will need to be imported. We’ll discuss each as it becomes necessary in the program:
To begin, we need to generate a list of all Sentinel-2 L2A images that contain our desired location. To do that we query a repository of L2A metadata stored on S3 and accessible via an API developed by Element 84, an AWS partner. The metadata will be in STAC (Spatiotemporal Asset Catalog) format, so we’ll use the PySTAC client to interface with the Element 84 end point:
Items is a collection of 1518 python dictionaries, each containing a myriad of metadata about a single Sentinel-2LA image. In fact, each dictionary is in the order of megabytes in size, so batch processing should be considered when querying numerous points. (Note: Items was length 1518 on 2023-05-27. It will be longer when you run this program today.)
From here we will begin processing, but note we will do so lazily; that means our actions will actually just be logged in the metadata for later processing. By deferring actual processing, we can ultimately do only those computations that affect the final output we want thus minimizing the computational and network loads.
The first thing we will do is transform the collection of STAC metadata into a three dimensional array. One dimension will be time (and will be of length 1518), the two other dimensions will be latitude and longitude. The elements will be the red and near-infrared (NIR) sensor readings.
The stackstac library is designed to do precisely this. You specify a region of interest in terms of latitude and longitude and the assets you wish to extract from the imagery (in our case, red and NIR sensor readings). Given these specifications and a collection of STAC metadata, then the stackstac.stack() function will return a lazy, three dimensional xarray.Dataset (time, lat,long) where each element is a tuple of “assets” from the imagery:
To specify the region of interest we defined a tiny area that encompasses the point we are interested in. Note, in addition to red and NIR, we request SCL (scene classification layer) which will be useful when we do q/a later in the process. The epsg argument instructs the function to use the WGS84 latitude/longitude reference system for the output. (The actual coordinate reference system used by the data is more complex and can change between images, hence the need to reproject everything to a common system). Finally, the `gdal_env` argument allows us to manage retries implicitly. (Given the volume of packages that will have to move over the internet, package loss is inevitable.)
The return value, sentinel_stack, is an xarray.DataSet:
Notice the length of the time dimension, 1518, corresponds exactly to the number of images we had to work with. And for each time point, we have a 12 x 8 grid. For each grid point, we will have measurements for NIR, red, and SCL.
Now we can proceed to, still lazily, do the actual NDVI calculations while dropping redundant metadata:
We now have NDVI for every element in our xarray.Dataset; i.e. for each of 1518 time points, we have 96 NDVI values (12 latitudes x 8 longitudes). Somewhere in that 12 x 8 grid is a singular point (45.5°N, 95.5°W) for which we wish to measure NDVI.
Recall that each pixel represents the average NDVI for an entire 10m x 10m area. To get the NDVI at some exact point within a pixel, you can interpolate using the NDVI for the pixel that contains the point and some number of surrounding pixels. xarray.Dataset.interp() will do this. For simplicity we will dispense with sophisticated interpolation and simply use the “nearest” (in this case the encompassing) pixel as the interpolated value for the actual point:
sentinel_point is a 1 dimensional xarray.Dataset ( across time) with 1518 elements, each of which is a tuple (NDVI,scl):
For the sake of making this demo run quickly, we can (optionally) shorten sentinel_point to 8 months. (In the analysis that follows, the data is NOT truncated.)
In production of course you can’t just shorten your time window to speed up processing. Instead computation should be moved closer to storage; i.e. the program should be run in the same cloud as the data. Parallelization (via threading or multiple machines) is of course an obvious optimization as well.
Up to this point, everything we did was lazy, only affecting metadata and task graphs. Now we'll actually load our selected subset of data.
sentinel_point is the same Dataset as before, but the values are now actually loaded and calculated. At this point we can transform the Dataset to a simple dataframe, which will be easier to work with for the finals steps:
Here’s NDVI as a function of time:
The plot looks good in that NDVI seems to change gradually with the season. The numerous outliers are, primarily, an artifact of cloud cover. This is where we can use the SCL data to filter out observations that were tainted by clouds. If visibility is sufficient, the Sentinel-2 sensors will classify the area we are studying as either “vegetation” (4) or “bare soil” (5). Since we know the area in question is one of those two things, then, if SCL as reported is something other than 4 or 5, then we can conclude that cloud cover is a problem and filter out that day’s NDVI value. (There is always a non-zero probability of false SCL readings, so this method is not perfect but is nonetheless highly effective.)
Now we have an NDVI time series! From here any of various interpolation/smoothing strategies can be used to estimate values for dates between observations and to remove remaining outliers.