← All posts

Demystifying MERRA-2: A Step-by-Step Python Guide to Downloading and Processing Climate Data

python
hindcast-data
Summary for listing page.
Author

Majid Bastankhah

Published

April 29, 2026

Demystifying MERRA-2: A Step-by-Step Python Guide to Downloading and Processing Climate Data

If you’ve ever tried to download NASA’s MERRA-2 climate data, you know it’s rarely a straightforward “click and save” process. Between navigating Earthdata credentials, understanding GEOS-5 native grid coordinates, and parsing complex NetCDF files, the learning curve can be steep.

In this tutorial, we will walk through a robust Python script designed to automatically download wind speed vectors (U and V) across multiple heights (2m, 10m, and 50m), calculate the true wind speeds, and aggregate the data into ready-to-use CSV files and time-series plots.

Credits & Acknowledgments: A special thanks to Emily Laiken’s merradownload GitHub repository, which provided the foundational multi-processing logic and GEOS-5 grid translation functions adapted for this tutorial.


Step 1: The Setup and Imports

First, we need to import our libraries. We will use xarray and pandas for handling the multi-dimensional climate data, and matplotlib for graphing.

Note: This script relies on a custom opendap_download module to handle the multiprocessing download requests. Make sure you have that available in your environment!

Code
# Imports
import os
import re
import numpy as np
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from calendar import monthrange

# Custom module for NASA OPeNDAP downloads
from opendap_download.multi_processing_download import DownloadManager

Step 2: The Control Center (Understanding the Inputs)

This is the most important part of the script. This is where you tell the code what you want, where you want it, and when you want it. Let’s break down the key parts you can customize:

1. Choosing Your Location(s)

The locs variable is a list of locations. You can add as many as you want! Just follow the format ('Name', Latitude, Longitude). * Example for one location: locs = [('durham', 54.7753, -1.5849)] * Example for multiple: locs = [('durham', 54.7753, -1.5849), ('london', 51.5072, -0.1276), ('paris', 48.8566, 2.3522)]

2. Understanding Databases (database_id and database_name)

MERRA-2 doesn’t store all its data in one giant file. It splits data into different “Collections” (Databases) based on what the data represents. In our default code, we use M2T1NXSLV (Time-averaged, 1-Hourly, Single-Level Diagnostics). This database contains surface-level wind, temperature, and pressure. Want different data? Here is how you find it: * Radiation/Solar Data: You would change the ID to M2T1NXRAD and name to tavg1_2d_rad_Nx. * Tip: You can find a full list of MERRA-2 databases by searching for the “MERRA-2 File Specification Document” online.

3. Selecting Parameters and Heights (variables_to_download)

Once you choose a database, you must specify the exact variables you want. In our default code, we are asking for Wind Vectors (U is East-West wind, V is North-South wind) at different heights: 2 meters, 10 meters, and 50 meters. * Want just 10m wind? Change it to ['U10M', 'V10M'] * Want surface temperature instead of wind? Change it to ['T2M'] (2-meter temperature).

Code
####### THE CONTROL CENTER: INPUTS - CHANGE THESE #########
username = 'YOUR_USERNAME'  # Your Earthdata Login
password = 'YOUR_PASSWORD'  # Your Earthdata Password

# WHEN do you want the data?
years = [2017] 

# WHERE do you want the data? Add as many (Name, Lat, Lon) tuples as you want!
locs = [('durham', 54.7753, -1.5849)] 

# WHAT DATABASE do you want to pull from?
database_name = 'tavg1_2d_slv_Nx'  # Single-level diagnostics (Wind, Temp, Pressure)
database_id = 'M2T1NXSLV' 

# WHAT VARIABLES / HEIGHTS do you want from that database?
# U = Eastward wind vector, V = Northward wind vector. 
# Numbers represent height (2m, 10m, 50m)
variables_to_download = ['U2M', 'V2M', 'U10M', 'V10M', 'U50M', 'V50M']

aggregator = 'mean' 

####### CONSTANTS - DO NOT CHANGE BELOW THIS LINE #######
lat_coords = np.arange(0, 361, dtype=int)
lon_coords = np.arange(0, 576, dtype=int)
database_url = 'https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/MERRA2/' + database_id + '.5.12.4/'
NUMBER_OF_CONNECTIONS = 5

Step 3: The Translation Engine (Helper Functions)

NASA’s GEOS-5 grid doesn’t use standard decimal latitudes and longitudes. The functions below translate your standard GPS coordinates into the specific indices NASA’s servers require. You do not need to edit anything here.

Code
####### HELPER FUNCTIONS #########
def translate_lat_to_geos5_native(latitude):
    return ((latitude + 90) / 0.5)

def translate_lon_to_geos5_native(longitude):
    return ((longitude + 180) / 0.625)

def find_closest_coordinate(calc_coord, coord_array):
    index = np.abs(coord_array-calc_coord).argmin()
    return coord_array[index]

def translate_year_to_file_number(year):
    if year >= 1980 and year < 1992: return '100'
    elif year >= 1992 and year < 2001: return '200'
    elif year >= 2001 and year < 2011: return '300'
    elif year >= 2011: return '400'
    else: raise Exception('The specified year is out of range.')

def generate_url_params(parameter, time_para, lat_para, lon_para):
    parameter = list(map(lambda x: x + time_para, parameter))
    parameter = list(map(lambda x: x + lat_para, parameter))
    parameter = list(map(lambda x: x + lon_para, parameter))
    return ','.join(parameter)
    
def generate_download_links(download_years, base_url, dataset_name, url_params):
    urls = []
    for y in download_years: 
        y_str = str(y)
        file_num = translate_year_to_file_number(y)
        for m in range(1,13):
            m_str = str(m).zfill(2)
            _, nr_of_days = monthrange(y, m)
            for d in range(1,nr_of_days+1):
                d_str = str(d).zfill(2)
                file_name = 'MERRA2_{num}.{name}.{y}{m}{d}.nc4'.format(
                    num=file_num, name=dataset_name, y=y_str, m=m_str, d=d_str)
                query = '{base}{y}/{m}/{name}.nc4?{params}'.format(
                    base=base_url, y=y_str, m=m_str, name=file_name, params=url_params)
                urls.append(query)
    return urls

Step 4: Executing the Download

This loop takes the locations and variables you defined in Step 2, calculates the bounding boxes, and securely downloads the subsets from NASA’s OPeNDAP servers using multi-processing.

Code
print('DOWNLOADING DATA FROM MERRA')
for loc, lat, lon in locs:
    print('Downloading data for ' + loc)
    lat_coord = translate_lat_to_geos5_native(lat)
    lon_coord = translate_lon_to_geos5_native(lon)
    lat_closest = find_closest_coordinate(lat_coord, lat_coords)
    lon_closest = find_closest_coordinate(lon_coord, lon_coords)
    
    requested_lat = '[{lat}:1:{lat}]'.format(lat=lat_closest)
    requested_lon = '[{lon}:1:{lon}]'.format(lon=lon_closest)
    
    # We use the variables_to_download list you defined in the inputs!
    parameter = generate_url_params(variables_to_download, '[0:1:23]', requested_lat, requested_lon)
    
    generated_URL = generate_download_links(years, database_url, database_name, parameter)
    
    download_manager = DownloadManager()
    download_manager.set_username_and_password(username, password)
    download_manager.download_path = 'WindSpeed/' + loc
    download_manager.download_urls = generated_URL
    download_manager.start_download(NUMBER_OF_CONNECTIONS)

Step 5: Data Cleaning and Merging

MERRA-2 provides wind data as separate U and V vectors. This code block calculates the true “Wind Speed” magnitude using the Pythagorean theorem ( \(\sqrt{U^2 + V^2}\) ).

⚠️ IMPORTANT WARNING: If you changed your inputs in Step 2 to download something other than U and V wind vectors (for example, if you downloaded Temperature T2M), you will need to modify the math below. You wouldn’t use the Pythagorean theorem on Temperature! You would simply extract the df_hourly['T2M'] column.

Code
######### OPEN, CLEAN, MERGE, AND WRITE CSVS ##########
def extract_date(data_set):
    if 'HDF5_GLOBAL.Filename' in data_set.attrs:
        f_name = data_set.attrs['HDF5_GLOBAL.Filename']
    elif 'Filename' in data_set.attrs:
        f_name = data_set.attrs['Filename']
    else: 
        raise AttributeError('The attribute name has changed again!')
    exp = r'(?<=\.)[^\.]*(?=\.nc4)'
    res = re.search(exp, f_name).group(0)
    y, m, d = res[0:4], res[4:6], res[6:8]
    date_str = ('%s-%s-%s' % (y, m, d))
    data_set = data_set.assign(date=date_str) 
    return data_set

print('CLEANING AND MERGING DATA')
for loc, lat, lon in locs:
    print('Cleaning and merging data for ' + loc)
    dfs = []
    file_path = 'WindSpeed/' + loc
    
    os.makedirs(file_path, exist_ok=True)
    
    for file in os.listdir(file_path):
        if '.nc4' in file:
            try:
                with xr.open_mfdataset(file_path + '/' + file, preprocess=extract_date) as df:
                    dfs.append(df.to_dataframe())
            except Exception as e:
                print('Issue with file ' + file + ' | Error: ' + str(e))
                
    if len(dfs) == 0:
        print("No data downloaded!")
        continue
        
    df_hourly = pd.concat(dfs)
    df_hourly = df_hourly.reset_index()
    
    # --- IF YOU CHANGED VARIABLES, UPDATE THIS SECTION ---
    # Calculate actual wind speed for all three heights
    df_hourly['WindSpeed_2m'] = np.sqrt(df_hourly['U2M']**2 + df_hourly['V2M']**2)
    df_hourly['WindSpeed_10m'] = np.sqrt(df_hourly['U10M']**2 + df_hourly['V10M']**2)
    df_hourly['WindSpeed_50m'] = np.sqrt(df_hourly['U50M']**2 + df_hourly['V50M']**2)
    
    df_hourly['time'] = df_hourly['time'].dt.time
    df_hourly['date'] = pd.to_datetime(df_hourly['date'])
    
    # Extract the date, time, and all three wind speeds
    df_output = df_hourly[['date', 'time', 'WindSpeed_2m', 'WindSpeed_10m', 'WindSpeed_50m']]
    df_output.to_csv(file_path + '/' + loc + '_hourly.csv', index=False)
    
    # Define how to aggregate all three columns
    agg_dict = {'WindSpeed_2m': aggregator, 'WindSpeed_10m': aggregator, 'WindSpeed_50m': aggregator}
    # -----------------------------------------------------
    
    df_daily = df_output.groupby('date').agg(agg_dict)
    df_daily['date'] = df_daily.index
    df_daily.to_csv(file_path + '/' + loc + '_daily.csv', index=False)
    
    df_weekly = df_daily.copy()
    df_weekly['Week'] = pd.to_datetime(df_weekly['date']).apply(lambda x: x.isocalendar()[1])
    df_weekly['Year'] = pd.to_datetime(df_weekly['date']).apply(lambda x: x.year)
    df_weekly = df_weekly.groupby(['Year', 'Week']).agg(agg_dict).reset_index()
    df_weekly.to_csv(file_path + '/' + loc + '_weekly.csv', index=False)

Step 6: Visualizing the Data

Finally, we plot our aggregated daily data.

Note: Just like Step 5, if you changed the requested parameters from Wind Speed to something else, you will need to update the column names being plotted here!

Code
######### PLOT THE DATA ##########
print('PLOTTING DATA')
for loc, lat, lon in locs:
    file_path = 'WindSpeed/' + loc
    
    csv_file = file_path + '/' + loc + '_daily.csv'
    if not os.path.exists(csv_file):
        print(f"Skipping plot for {loc} - data file not found.")
        continue

    # Load the daily averaged data we just created
    df_plot = pd.read_csv(csv_file)
    df_plot['date'] = pd.to_datetime(df_plot['date'])
    
    plt.figure(figsize=(14, 6))
    
    # Plot the three lines
    plt.plot(df_plot['date'], df_plot['WindSpeed_50m'], label='50m Wind Speed', color='#1f77b4', linewidth=1.5, alpha=0.9)
    plt.plot(df_plot['date'], df_plot['WindSpeed_10m'], label='10m Wind Speed', color='#ff7f0e', linewidth=1.5, alpha=0.8)
    plt.plot(df_plot['date'], df_plot['WindSpeed_2m'], label='2m Wind Speed', color='#2ca02c', linewidth=1.5, alpha=0.7)
    
    plt.title(f'Daily Mean Wind Profile for {years[0]} - {loc.capitalize()}', fontsize=14, fontweight='bold')
    plt.xlabel('Date', fontsize=12)
    plt.ylabel('Wind Speed (m/s)', fontsize=12)
    plt.legend(loc='upper right', framealpha=1)
    plt.grid(True, linestyle='--', alpha=0.6)
    
    plt.tight_layout()
    plot_file = file_path + '/' + loc + '_wind_profile_plot.png'
    plt.savefig(plot_file, dpi=300)
    print(f'Saved graph to: {plot_file}')
    
    plt.show()

print('FINISHED')