In our recent study, we investigate the predictive power of NDVI metrics (max, mean, cumulative) for estimating corn yield in the US Corn Belt.
Introduction
Agriculture plays a critical role in the United States economy, and the Corn Belt region is at the heart of this industry. Accurate and timely predictions of crop yield are essential for farmers, agribusinesses, and policymakers to make informed decisions that optimize resources, minimize losses, and ensure food security. In our last case study, we examined the historical Normalized Difference Vegetation Index (NDVI) patterns for the top four Corn Belt states in the US. Building on that research, this case study aims to explore the potential of NDVI-based metrics as predictors of corn crop yield across the entire Corn Belt region.
Questions
To what extent do simple NDVI-based metrics (max, mean, cumulative NDVI across growing season) correlate with corn crop yield variations across the Corn Belt states?
How do the relationships between these metrics and corn crop yields differ across the various Corn Belt states and sub-regions?
Extracting Locations from United States CDL
We once again used the United States Cropland Data Layer (CDL) to extract information about corn farm locations. For this analysis, however, we extracted a larger number (33,000) of corn farm locations from the GeoTIFF map.
from rasterio.warp import transform
united_states = rasterio.open('data/us/2022_30m_cdls/2022_30m_cdls.tif')
band = united_states.read(1)
filtered_band = np.where(band == 1)
sample_indices = np.random.choice(len(filtered_band[0]), size=33000, replace=False)
# Get the row and column indices
rows = filtered_band[0][sample_indices]
cols = filtered_band[1][sample_indices]
# Transform the row and column indices to the original CRS (EPSG:5070)
x_coords, y_coords = rasterio.transform.xy(united_states.transform, rows, cols)
# Perform the coordinate transformation from EPSG:5070 to EPSG:4326
longitude, latitude = transform(united_states.crs, 'EPSG:4326', x_coords, y_coords)
data = {'lat': latitude,
'lon': longitude}
df = pd.DataFrame(data)
Visualize corn farm locations
As always, we did a sanity check by plotting the locations to be used in the query:
# create figure
fig = plt.figure(figsize=(8,5), facecolor='none', dpi=96)
ax = fig.add_subplot(111, projection=ccrs.Mercator())
ax.set_extent([-107.23, -80, 31.57, 50.46], ccrs.PlateCarree()) # Set extent for corn belt states
# plot the corn farm locations
ax.scatter(df['lon'], df['lat'], s=1, transform=ccrs.Geodetic(), zorder=2, edgecolor='#ffffff', linewidth=0.25)
# add a title and legend
plt.title('Corn Farm Locations Across the Corn Belt States')
Querying Data from Streambatch
We then used the corn farm locations to query the Streambatch API for Sentinel-2 NDVI data.
# Prep for query
points = np.vstack([df.lon.values, df.lat.values]).T.tolist()
ndvi_request = {
'variable': ['ndvi.sentinel2'],
'space': points,
'time': {
'start':'2018-03-01',
'end':'2022-12-31',
}
}
# Make the request to the API
response = requests.post('streambatch.io/async', json=ndvi_request, headers=api_header)
# Parse the response to get the access URL and query ID for the result data
result_data = json.loads(response.content)
access_url = result_data.get('access_url')
# Read the data from the access URL into a pandas DataFrame
ndvi_data = pd.read_parquet(access_url, storage_options={"anon": True})
Analysis
First, we filtered the dataset to only include data for the growing season months (April - November). We then calculated the NDVI-based metrics of interest (max NDVI, mean NDVI, and cumulative NDVI across the growing season) for each farm location:
# Create a 'year' column from the 'time' column
ndvi_data['year'] = ndvi_data['time'].dt.year
# Filter the DataFrame for growing season months (April to November)
growing_season = ndvi_data[(ndvi_data['time'].dt.month >= 4) & (ndvi_data['time'].dt.month <= 11)]
# Group the data by unique 'lat', 'lon', and 'year' pairs
grouped = growing_season.groupby(['lat', 'lon', 'year'])
# Calculate the max and mean NDVI for each group
result = grouped['ndvi.sentinel2'].agg(['max', 'mean']).reset_index()
# Calculate the cumulative NDVI for each group
cumulative_ndvi = grouped['ndvi.sentinel2'].sum().reset_index().rename(columns={'ndvi.sentinel2': 'cumulative_ndvi'})
# Save the results in a new DataFrame
ndvi_stats = pd.DataFrame(result)
# Merge the cumulative_ndvi DataFrame with the ndvi_stats DataFrame
ndvi_stats = ndvi_stats.merge(cumulative_ndvi, on=['lat', 'lon', 'year'])
ndvi_stats.head()
Next, we grabbed the county-level boundary information from the Agricultural Census shapefile using it to tag the corn farm locations.
county_boundaries = gpd.read_file('data/us/CoGenAll17_WGS84WMAS/CoUS17_WGS84WMAS.shp')
# Convert the ndvi_stats DataFrame to a GeoDataFrame
geometry = [Point(xy) for xy in zip(ndvi_stats['lon'], ndvi_stats['lat'])]
ndvi_stats_geo = gpd.GeoDataFrame(ndvi_stats, geometry=geometry)
# Set the CRS of the ndvi_stats_geo GeoDataFrame to EPSG:4326 (WGS 84)
ndvi_stats_geo.crs = "EPSG:4326"
# Transform the CRS of ndvi_stats_geo to match the CRS of county_boundaries
ndvi_stats_geo = ndvi_stats_geo.to_crs(county_boundaries.crs)
# Perform a spatial join
joined = gpd.sjoin(ndvi_stats_geo, county_boundaries, op='within', how='left')
# Copy the required columns to the ndvi_stats DataFrame
ndvi_stats['county_name'] = joined['atlas_caps']
ndvi_stats['county_number'] = joined['cntyn']
ndvi_stats['state_county_code'] = joined['atlas_stco']
ndvi_stats['state_code'] = ndvi_stats['state_county_code'].str[:2].astype(int)
ndvi_stats.head()
We then merged the dataset with annual, county-level yield data that we sourced from the USDA NASS Quickstats data portal for all Corn Belt states. We also calculated the average NDVI metrics and renamed any columns that will be used later for visualization purposes.
yield_stats = pd.read_csv('data/us/US_corn_yield_county_level_2018-2022.csv')
ndvi_yield_stats = ndvi_stats.merge(yield_stats[['year', 'state_code', 'county_number', 'Value']],
on=['year', 'state_code', 'county_number'],
how='left')
# Group the merged_data DataFrame by 'state_county_code', 'county_name', 'county_number', 'state_code', 'Value', and 'Year'
grouped_data = ndvi_yield_stats.groupby(['state_county_code', 'county_name', 'county_number', 'state_code', 'Value', 'year'])
# Calculate the average of 'max', 'mean', and 'cumulative_ndvi' columns for each group
averages = grouped_data[['max', 'mean', 'cumulative_ndvi']].mean().reset_index()
# Save the results in a new DataFrame
average_ndvi_stats = pd.DataFrame(averages)
# Create a mapping between 'State ANSI' and 'State' from the 'yield_stats' DataFrame
state_mapping = dict(zip(yield_stats['state_code'].astype(int), yield_stats['State']))
# Map the 'state_code' column of 'average_ndvi_stats' to the state names
average_ndvi_stats['state_name'] = average_ndvi_stats['state_code'].map(state_mapping)
# Rename the columns of the DataFrame
average_ndvi_stats = average_ndvi_stats.rename(columns={'Value': 'Yield (BU/Acre)', 'max': 'Max Growing Season NDVI', 'mean': 'Mean Growing Season NDVI', 'cumulative_ndvi': 'Cumulative Growing Season NDVI', 'state_name': 'State Name'})
average_ndvi_stats.head()
Visualizing Overall Relationships
Next we visualized the relationships with scatterplots for each NDVI metric (max, mean, cumulative) against corn yield at the county level to visually explore their relationships.
def plot_scatter(ax, x, y, title, xlabel, ylabel):
sns.scatterplot(data=average_ndvi_stats, x=x, y=y, ax=ax)
ax.set_title(title)
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
fig, axs = plt.subplots(1, 3, figsize=(15, 5))
metric_list = [
('Max Growing Season NDVI', 'Max NDVI'),
('Mean Growing Season NDVI', 'Mean NDVI'),
('Cumulative Growing Season NDVI', 'Cumulative NDVI')
]
for idx, (metric, label) in enumerate(metric_list):
plot_scatter(axs[idx], metric, 'Yield (BU/Acre)', f"{label} vs Yield - Growing Season (Apr-Nov)", label, "Yield (BU/Acre)")
plt.tight_layout()
Next, we calculated the Spearman correlation coefficient and p-value for each NDVI metric and county-level corn yield, allowing us to quantify the strength and statistical significance of the relationships.
Given the predictive ability of max growing season NDVI for yield, we next wanted to determine the variability of this key metric between states. We started by visualizing the relationship between max NDVI and yield for all states individually. We kept the max NDVI vs Yield plot from above in the first grid square for comparison.
states = average_ndvi_stats['state_name'].unique()
fig, axs = plt.subplots(2, 7, figsize=(28, 10), sharex=True, sharey=True)
# Overall max NDVI vs yield for all states in the first square
axs[0, 0].scatter(average_ndvi_stats['Max Growing Season NDVI'], average_ndvi_stats['Yield (BU/Acre)'])
axs[0, 0].set_title('Overall - Max NDVI vs Yield')
axs[0, 0].set_xlabel('Max NDVI')
axs[0, 0].set_ylabel('Yield (BU/Acre)')
for idx, state in enumerate(states):
row, col = divmod(idx + 1, 7)
state_data = average_ndvi_stats[average_ndvi_stats['state_name'] == state]
axs[row, col].scatter(state_data['Max Growing Season NDVI'], state_data['Yield (BU/Acre)'])
axs[row, col].set_title(f'{state} - Max NDVI vs Yield')
axs[row, col].set_xlabel('Max NDVI')
axs[row, col].set_ylabel('Yield (BU/Acre)')
plt.tight_layout()
plt.show()
Next, we calculated the correlation for each state individually, creating a new dataframe and sorting it by correlation value.
from scipy.stats import spearmanr
# Function to calculate the Spearman correlation and p-value for a given state
def calculate_spearman(data, state):
state_data = data[data['state_name'] == state]
spearman_corr, spearman_p = spearmanr(state_data['Max Growing Season NDVI'], state_data['Yield (BU/Acre)'])
return {'state_name': state, 'Spearman Correlation': spearman_corr, 'Spearman p-value': spearman_p}
# Calculate the Spearman correlation and p-value for each state
spearman_results = [calculate_spearman(average_ndvi_stats, state) for state in states]
# Create a new DataFrame to store the results
spearman_df = pd.DataFrame(spearman_results)
# Sort by Spearman correlation
spearman_df_sorted = spearman_df.sort_values(by='Spearman Correlation', ascending=False)
spearman_df_sorted
Finally, we plotted the distributions of max NDVI for each state individually.
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 8))
sns.boxplot(data=average_ndvi_stats, x='state_name', y='Max Growing Season NDVI')
plt.xticks(rotation=90)
plt.title("Distribution of Maximum Growing Season NDVI by State")
plt.xlabel("State")
plt.ylabel("Max Growing Season NDVI")
plt.show()
Results
Max NDVI during the growing season was found to be the best predictor (correlation = 0.63) of crop yield for corn farms across all corn belt states. This result is consistent with previous studies (e.g., Johnson et al, 2021;Roznik et al, 2022), which demonstrated the strong relationship between max NDVI values and crop yield for corn. The max NDVI typically represents the peak of crop growth, capturing the highest level of crop biomass and photosynthetic activity during the growing season.
Mean NDVI (correlation = 0.31) and cumulative NDVI (correlation = 0.18) during the growing season were also found to correlate with overall yield, albeit to a lesser extent. These metrics capture the average and total vegetation activity throughout the growing season, respectively. According to existing research (e.g., Lobell et al., 2003; Sakamoto et al., 2005), these metrics can be indicative of the general health and productivity of crops but may not be as strongly associated with crop yield as max NDVI due to their sensitivity to transient weather events and other environmental factors.
We found considerable variation in the predictive power of NDVI for different Corn Belt states, with North Dakota (0.73), South Dakota (0.69), and Nebraska (0.64) exhibiting the highest correlation values. Additionally, we observed notable variability in max NDVI distributions among these states, particularly in Kansas, North Dakota, and South Dakota. This variability may enhance the overall predictive strength of max NDVI for yield. In contrast, states like Illinois and Iowa displayed much lower variability in their max NDVI during the growing season. This observation is consistent with previous research, indicating that regions with more uniform agricultural landscapes and stable growing conditions tend to exhibit less variability in NDVI values (Wardlow et al., 2007). Interestingly, there appears to be some overlap between the states with low variability in their max NDVI distributions and the top states overall by production, suggesting a potential connection between consistent growing conditions and high agricultural productivity.
Conclusions
In conclusion, our analysis supports existing literature by demonstrating the strong predictive power of max NDVI during the growing season for corn crop yields across the corn belt states. While mean and cumulative NDVI metrics also showed correlations with overall yield, their predictive capabilities were less robust compared to max NDVI. Additionally, states with lower variability in their max NDVI distributions, such as Illinois and Iowa, are also among the top corn-producing states. By better understanding the relationship between NDVI metrics and crop yield, stakeholders in the agricultural industry can make more informed decisions and optimize resource use, ultimately leading to increased productivity and sustainability.