Demystifying MERRA-2: A Step-by-Step Python Guide to Downloading and Processing Climate Data
If you’ve ever tried to download NASA’s MERRA-2 climate data, you know it’s rarely a straightforward “click and save” process. Between navigating Earthdata credentials, understanding GEOS-5 native grid coordinates, and parsing complex NetCDF files, the learning curve can be steep.
In this tutorial, we will walk through a robust Python script designed to automatically download wind speed vectors (U and V) across multiple heights (2m, 10m, and 50m), calculate the true wind speeds, and aggregate the data into ready-to-use CSV files and time-series plots.
Credits & Acknowledgments: A special thanks to Emily Laiken’s merradownload GitHub repository, which provided the foundational multi-processing logic and GEOS-5 grid translation functions adapted for this tutorial.
Step 1: The Setup and Imports
First, we need to import our libraries. We will use xarray and pandas for handling the multi-dimensional climate data, and matplotlib for graphing.
Note: This script relies on a custom opendap_download module to handle the multiprocessing download requests. Make sure you have that available in your environment!
Code
# Importsimport osimport reimport numpy as npimport pandas as pdimport xarray as xrimport matplotlib.pyplot as pltfrom datetime import datetime, timedeltafrom calendar import monthrange# Custom module for NASA OPeNDAP downloadsfrom opendap_download.multi_processing_download import DownloadManager
Step 2: The Control Center (Understanding the Inputs)
This is the most important part of the script. This is where you tell the code what you want, where you want it, and when you want it. Let’s break down the key parts you can customize:
1. Choosing Your Location(s)
The locs variable is a list of locations. You can add as many as you want! Just follow the format ('Name', Latitude, Longitude). * Example for one location:locs = [('durham', 54.7753, -1.5849)] * Example for multiple:locs = [('durham', 54.7753, -1.5849), ('london', 51.5072, -0.1276), ('paris', 48.8566, 2.3522)]
2. Understanding Databases (database_id and database_name)
MERRA-2 doesn’t store all its data in one giant file. It splits data into different “Collections” (Databases) based on what the data represents. In our default code, we use M2T1NXSLV (Time-averaged, 1-Hourly, Single-Level Diagnostics). This database contains surface-level wind, temperature, and pressure. Want different data? Here is how you find it: * Radiation/Solar Data: You would change the ID to M2T1NXRAD and name to tavg1_2d_rad_Nx. * Tip: You can find a full list of MERRA-2 databases by searching for the “MERRA-2 File Specification Document” online.
3. Selecting Parameters and Heights (variables_to_download)
Once you choose a database, you must specify the exact variables you want. In our default code, we are asking for Wind Vectors (U is East-West wind, V is North-South wind) at different heights: 2 meters, 10 meters, and 50 meters. * Want just 10m wind? Change it to ['U10M', 'V10M'] * Want surface temperature instead of wind? Change it to ['T2M'] (2-meter temperature).
Code
####### THE CONTROL CENTER: INPUTS - CHANGE THESE #########username ='YOUR_USERNAME'# Your Earthdata Loginpassword ='YOUR_PASSWORD'# Your Earthdata Password# WHEN do you want the data?years = [2017] # WHERE do you want the data? Add as many (Name, Lat, Lon) tuples as you want!locs = [('durham', 54.7753, -1.5849)] # WHAT DATABASE do you want to pull from?database_name ='tavg1_2d_slv_Nx'# Single-level diagnostics (Wind, Temp, Pressure)database_id ='M2T1NXSLV'# WHAT VARIABLES / HEIGHTS do you want from that database?# U = Eastward wind vector, V = Northward wind vector. # Numbers represent height (2m, 10m, 50m)variables_to_download = ['U2M', 'V2M', 'U10M', 'V10M', 'U50M', 'V50M']aggregator ='mean'####### CONSTANTS - DO NOT CHANGE BELOW THIS LINE #######lat_coords = np.arange(0, 361, dtype=int)lon_coords = np.arange(0, 576, dtype=int)database_url ='https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/MERRA2/'+ database_id +'.5.12.4/'NUMBER_OF_CONNECTIONS =5
Step 3: The Translation Engine (Helper Functions)
NASA’s GEOS-5 grid doesn’t use standard decimal latitudes and longitudes. The functions below translate your standard GPS coordinates into the specific indices NASA’s servers require. You do not need to edit anything here.
Code
####### HELPER FUNCTIONS #########def translate_lat_to_geos5_native(latitude):return ((latitude +90) /0.5)def translate_lon_to_geos5_native(longitude):return ((longitude +180) /0.625)def find_closest_coordinate(calc_coord, coord_array): index = np.abs(coord_array-calc_coord).argmin()return coord_array[index]def translate_year_to_file_number(year):if year >=1980and year <1992: return'100'elif year >=1992and year <2001: return'200'elif year >=2001and year <2011: return'300'elif year >=2011: return'400'else: raiseException('The specified year is out of range.')def generate_url_params(parameter, time_para, lat_para, lon_para): parameter =list(map(lambda x: x + time_para, parameter)) parameter =list(map(lambda x: x + lat_para, parameter)) parameter =list(map(lambda x: x + lon_para, parameter))return','.join(parameter)def generate_download_links(download_years, base_url, dataset_name, url_params): urls = []for y in download_years: y_str =str(y) file_num = translate_year_to_file_number(y)for m inrange(1,13): m_str =str(m).zfill(2) _, nr_of_days = monthrange(y, m)for d inrange(1,nr_of_days+1): d_str =str(d).zfill(2) file_name ='MERRA2_{num}.{name}.{y}{m}{d}.nc4'.format( num=file_num, name=dataset_name, y=y_str, m=m_str, d=d_str) query ='{base}{y}/{m}/{name}.nc4?{params}'.format( base=base_url, y=y_str, m=m_str, name=file_name, params=url_params) urls.append(query)return urls
Step 4: Executing the Download
This loop takes the locations and variables you defined in Step 2, calculates the bounding boxes, and securely downloads the subsets from NASA’s OPeNDAP servers using multi-processing.
Code
print('DOWNLOADING DATA FROM MERRA')for loc, lat, lon in locs:print('Downloading data for '+ loc) lat_coord = translate_lat_to_geos5_native(lat) lon_coord = translate_lon_to_geos5_native(lon) lat_closest = find_closest_coordinate(lat_coord, lat_coords) lon_closest = find_closest_coordinate(lon_coord, lon_coords) requested_lat ='[{lat}:1:{lat}]'.format(lat=lat_closest) requested_lon ='[{lon}:1:{lon}]'.format(lon=lon_closest)# We use the variables_to_download list you defined in the inputs! parameter = generate_url_params(variables_to_download, '[0:1:23]', requested_lat, requested_lon) generated_URL = generate_download_links(years, database_url, database_name, parameter) download_manager = DownloadManager() download_manager.set_username_and_password(username, password) download_manager.download_path ='WindSpeed/'+ loc download_manager.download_urls = generated_URL download_manager.start_download(NUMBER_OF_CONNECTIONS)
Step 5: Data Cleaning and Merging
MERRA-2 provides wind data as separate U and V vectors. This code block calculates the true “Wind Speed” magnitude using the Pythagorean theorem ( \(\sqrt{U^2 + V^2}\) ).
⚠️ IMPORTANT WARNING: If you changed your inputs in Step 2 to download something other than U and V wind vectors (for example, if you downloaded Temperature T2M), you will need to modify the math below. You wouldn’t use the Pythagorean theorem on Temperature! You would simply extract the df_hourly['T2M'] column.
Code
######### OPEN, CLEAN, MERGE, AND WRITE CSVS ##########def extract_date(data_set):if'HDF5_GLOBAL.Filename'in data_set.attrs: f_name = data_set.attrs['HDF5_GLOBAL.Filename']elif'Filename'in data_set.attrs: f_name = data_set.attrs['Filename']else: raiseAttributeError('The attribute name has changed again!') exp =r'(?<=\.)[^\.]*(?=\.nc4)' res = re.search(exp, f_name).group(0) y, m, d = res[0:4], res[4:6], res[6:8] date_str = ('%s-%s-%s'% (y, m, d)) data_set = data_set.assign(date=date_str) return data_setprint('CLEANING AND MERGING DATA')for loc, lat, lon in locs:print('Cleaning and merging data for '+ loc) dfs = [] file_path ='WindSpeed/'+ loc os.makedirs(file_path, exist_ok=True)forfilein os.listdir(file_path):if'.nc4'infile:try:with xr.open_mfdataset(file_path +'/'+file, preprocess=extract_date) as df: dfs.append(df.to_dataframe())exceptExceptionas e:print('Issue with file '+file+' | Error: '+str(e))iflen(dfs) ==0:print("No data downloaded!")continue df_hourly = pd.concat(dfs) df_hourly = df_hourly.reset_index()# --- IF YOU CHANGED VARIABLES, UPDATE THIS SECTION ---# Calculate actual wind speed for all three heights df_hourly['WindSpeed_2m'] = np.sqrt(df_hourly['U2M']**2+ df_hourly['V2M']**2) df_hourly['WindSpeed_10m'] = np.sqrt(df_hourly['U10M']**2+ df_hourly['V10M']**2) df_hourly['WindSpeed_50m'] = np.sqrt(df_hourly['U50M']**2+ df_hourly['V50M']**2) df_hourly['time'] = df_hourly['time'].dt.time df_hourly['date'] = pd.to_datetime(df_hourly['date'])# Extract the date, time, and all three wind speeds df_output = df_hourly[['date', 'time', 'WindSpeed_2m', 'WindSpeed_10m', 'WindSpeed_50m']] df_output.to_csv(file_path +'/'+ loc +'_hourly.csv', index=False)# Define how to aggregate all three columns agg_dict = {'WindSpeed_2m': aggregator, 'WindSpeed_10m': aggregator, 'WindSpeed_50m': aggregator}# ----------------------------------------------------- df_daily = df_output.groupby('date').agg(agg_dict) df_daily['date'] = df_daily.index df_daily.to_csv(file_path +'/'+ loc +'_daily.csv', index=False) df_weekly = df_daily.copy() df_weekly['Week'] = pd.to_datetime(df_weekly['date']).apply(lambda x: x.isocalendar()[1]) df_weekly['Year'] = pd.to_datetime(df_weekly['date']).apply(lambda x: x.year) df_weekly = df_weekly.groupby(['Year', 'Week']).agg(agg_dict).reset_index() df_weekly.to_csv(file_path +'/'+ loc +'_weekly.csv', index=False)
Step 6: Visualizing the Data
Finally, we plot our aggregated daily data.
Note: Just like Step 5, if you changed the requested parameters from Wind Speed to something else, you will need to update the column names being plotted here!
Code
######### PLOT THE DATA ##########print('PLOTTING DATA')for loc, lat, lon in locs: file_path ='WindSpeed/'+ loc csv_file = file_path +'/'+ loc +'_daily.csv'ifnot os.path.exists(csv_file):print(f"Skipping plot for {loc} - data file not found.")continue# Load the daily averaged data we just created df_plot = pd.read_csv(csv_file) df_plot['date'] = pd.to_datetime(df_plot['date']) plt.figure(figsize=(14, 6))# Plot the three lines plt.plot(df_plot['date'], df_plot['WindSpeed_50m'], label='50m Wind Speed', color='#1f77b4', linewidth=1.5, alpha=0.9) plt.plot(df_plot['date'], df_plot['WindSpeed_10m'], label='10m Wind Speed', color='#ff7f0e', linewidth=1.5, alpha=0.8) plt.plot(df_plot['date'], df_plot['WindSpeed_2m'], label='2m Wind Speed', color='#2ca02c', linewidth=1.5, alpha=0.7) plt.title(f'Daily Mean Wind Profile for {years[0]} - {loc.capitalize()}', fontsize=14, fontweight='bold') plt.xlabel('Date', fontsize=12) plt.ylabel('Wind Speed (m/s)', fontsize=12) plt.legend(loc='upper right', framealpha=1) plt.grid(True, linestyle='--', alpha=0.6) plt.tight_layout() plot_file = file_path +'/'+ loc +'_wind_profile_plot.png' plt.savefig(plot_file, dpi=300)print(f'Saved graph to: {plot_file}') plt.show()print('FINISHED')