With the release of the ESA WorldCereal system, we now have (for the first time ever) the precise location of essentially every winter cereal, spring cereal and maize farm in the world.
To put the precision in perspective, the best datasets available to date were at 10km and ~5km resolutions via SPAM2010 and GEOGLAM, respectively. WorldCereal is 10m resolution – that’s 1000 times more precise. In fact, this is even better than the 30m resolution that the USDA has on offer with the USDA CDL (Cropland Data Layer).
WorldCereal is monumental because it is the missing link that can enable global monitoring of the entire world cereal situation. That said, the key to unlocking that potential is to transform WorldCereals from its cumbersome native format (.tif / geotiff rasters) to point samples in Parquet or GeoParquet.
Once that is done, there is no limit to what data scientists can do with this treasure trove. For example, measuring crop health for any (or all!) of the world’s cereal farms is imminently doable. (Use NDVI as your metric and you can do this with our API in a single Python program.)
To that end then, this tutorial shows you exactly how to convert WorldCereals into points samples in Parquet or GeoParquet. It will show how to:
Parquet and GeoParquet enable maximum interoperability with data science workflows and the modern stack, including cloud storage like AWS, GCP, and Azure, cloud data warehouses like Snowflake, BigQuery, RedShift, and DataBricks, and data science / machine learning platforms like pandas, numpy, scikit-learn, and more.
The ESA WorldCereal dataset consists of 10m resolution global crop type maps (classification masks) covering maize, winter cereals and spring cereals. For maize and winter cereals, there are 106 .tif (raster) files corresponding to the 106 Agricultural Ecological Zones (AEZs) across the globe. Spring cereals are only covered for 21 of the 106 AEZs. Find the article (Van Tricht et al., 2023, in pre-print) describing the full methodology here. Note, while not covered in this tutorial, the WorldCereal dataset also has classification mask products for seasonal irrigated cropland and temporary crop extent which can be processed in the same manner as the 3 covered here.
Note: this tutorial is also available as a single python notebook. You can download it on Github below:
If you don’t want to generate your own locations by running this tutorial, you can download the output at the following links:
The output datasets consist of 6,827,917 generated point locations and have the following information for each: Crop, Country (Admin 0), State (Admin 1), Region (World Bank), Subregion, Continent, and aez_id. See below:
To begin you need to pull the WorldCereals datasets:
Download all of the above, unzip and move the contents to a folder called ‘data’ in your working directory. (The full WorldCereals dataset is available here. For smaller, user-specified regions it can also be accessed through the OpenEO API.)
There are a number of libraries that need to be imported. The first group is required for generating the point locations from the raster geotiff files (.tif). The second group is required for tagging the point locations with various administrative parameters.
The .tif raster files are very large so we first define a function to read them in chunks to avoid any memory-related issues.
Next we introduce a boolean constant called “demo_mode”. In demo-mode, running the program will generate Maize farm locations for just 3 of the 106 AEZs (covering Mexico, Guatemala, El Salvador, Honduras, and North-Western Nicaragua). It should take less than 10min to run for these AEZs. If you wish to run the code on different AEZs covering different regions than specified, modify the aez_id_list.
With demo-mode set to False, running the program will generate farm locations for maize, winter cereals, and spring cereals for all 106 AEZs (where available). Note running the notebook in full-mode can take tens of hours (probably between 10 and 20).
Now we can proceed with point generation. The code below will iterate through each of the raster .tif files with rasterio for the list of specified AEZs, reading and processing each the .tif files in chunks. For each chunk, we calculate the total number of pixels classified as being a specific crop and then randomly select 0.01% of these pixels in which we will randomly generate points. This method allows us to generate points that will be scaled by the density of the crops in the original .tif files. Note that the outputs we show in this tutorial were produced with demo-mode set to False.
Executing the above generates 7M points (for all 3 crop types across all applicable AEZs):
The next step is to tag each of the points by the administrative boundaries they fall inside. This makes the dataset versatile for multiple use cases.
We use the Natural Earth Features country and state-level administrative datasets to tag the farm locations generated above. For simplicity, we will access the datasets through Cartopy, but the full list of datasets may also be accessed here. Note that you can of course tag locations using any boundary file that is appropriate for your use case.
The DataFrame which originally contained only points locations and the corresponding crops is now tagged for the various administrative boundaries. Inspecting it visually:
Because AEZ regions sometimes overlap each other, our random sampling methodology will yield excess points in areas of overlap. This can actually be seen visually in the image above. The density of points in these areas of overlap is around 40% higher than what it should be. There is an easy fix for this: delete 40% of the points that fall in regions where AEZ overlap occurs. To do this, use the WorldCereal_AEZ.geojson file containing the AEZ footprint polygons.
By removing the excess points we have reduced the size of output by nearly 400,000 points:
Upon visual inspection, we can see that the excess points that were generated in areas of AEZ overlap have been eliminated:
During the generation process, points can show up along the edges of some of the AEZs in regions that do not correspond to farms. We manually drew rectangular polygons around all areas where these points occurred, and saved the polygons in geojson format in a list.
We use the list of polygons to filter out any points that may have been generated within these known problem areas.
We’ll do some final formatting to the dataset we’ve generated. Specifically, we reorder the columns into something that makes a little bit more sense and capitalize the ‘crop’ column.
Finally, we will save the locations dataset in both cloud-native (Parquet & GeoParquet) and standard tabular (CSV) formats.
For CSV, we will export both the full dataset we generated, and a random 3M subset of the points.
The ESA WorldCereal dataset will have massive implications for our ability to monitor crops no matter where they are in the world. With WorldCereal as point locations in the cloud-native Parquet and GeoParquet formats, these datasets are immediately unlocked for data science workflows and those wishing to do time series analytics with remote sensing data. From building more accurate crop yield models than has ever been possible, to constructing higher-resolution crop calendars globally, point sampling WorldCereals opens up all kinds of analysis possibilities. .