by Osho Jha, Chief Data Scientist, Arbol
Part 1: Setting up a virtual environment and retrieving NOAA data
Creating a Virtual Environment
Geospatial data is becoming an increasingly more powerful data set for a variety of applications. This type of data, which incorporates a geographical element, is a great representation of the “Big Data Revolution” as a whole because of its granularity, size, availability, and flexibility. Modern programming languages offer the ability to work with this type of data, something that was previously only possible through specialized software.
Python, the language we will use for our examples, offers many packages to work with geospatial data and it’s relevant file types. Specifically, we make use of xarray, netCDF4, geopandas, and shapely. Installing these packages is not always straightforward due to the many libraries that need to be installed as dependencies.
In our experience, we found that setting up a virtual environment was the simplest and cleanest way to successfully install all relevant packages. A Python virtual environment is an isolated sandbox for Python projects. This allows projects to have their own dependencies regardless of what dependencies other projects have. Virtual environments also avoid version related issues when working with older packages, esoteric packages, and sharing code between developers.
This code was run on Python 3.6.6 and Conda 4.5.10. on Mac OS El Capitan 10.11.4
We highly recommend the use of Anaconda in order to manage package versions.
In order to set up the virtual environment, open up terminal and execute the following commands (this assumes that you are in the correct directory or have python in your path variables):
conda create -n env_name python=3.6
source activate env_name
env_name
is the name of the environment you are creating, no quotes are necessary. The source activate
command initializes the environment and places you in it.
Now your environment will need new packages. Since it’s a conda based virtual environment a lot of base packages will come pre-installed. We installed Jupyter to start using the following command:
conda install -n env_name package_name
Virtual environments allow you to set the package version as well. Simply modify the command above with package_name=X.X
where X.X
is the relevant version number for the package. For the purposes of working geospatial data, we installed the following packages:
1. Xarray version 0.10.8
2. netCDF4 version 1.4.0
3. geopandas version 0.3.0
4. shapely version 1.6.4.post1
Finally, if using Jupyter — which we also recommend, run the following command:
python -m ipykernel install user --name env --name --display-name “Name to Show in Jupyter”
“Name to Show in Jupyter” is the name you would like displayed when you list the kernels in a Jupyter Notebook. After this command has run, we can launch a Jupyter notebook as we normally would and select the new environment we just created as a kernel in Jupyter.
Retrieving NOAA Data
Now that we have set up our virtual environment we can pull down some data from the National Oceanic and Atmospheric Administration (NOAA). NOAA, through the National Centers for Environmental Information (NCEI), provides access to data for tracking atmospheric, oceanographic, coastal, and geophysical data. Seemingly, you could even get data describing features on the surface of the sun:
“From the depths of the ocean to the surface of the sun and from million-year-old ice core records to near real-time satellite images, NCEI is the Nation’s leading authority for environmental information.”
The data we are using for our example will be from CPC Global Unified Gauge-Based Analysis of Daily Precipitation. This dataset has daily data going back to 1979 covering the globe in 0.5 degree longitude and latitude steps. The data itself consists of gauge measurements across stations for a set geography. Gauge observations are critical in constructing precipitation analyses, but we will discuss this data set in depth in another piece.
The data is accessible through an FTP site run by NOAA and can be downloaded from here: ftp://ftp.cdc.noaa.gov/Datasets/cpc_global_precip/
Our GitHub has code for downloading this data using python, but it is possible to download manually as well. The files are in netCDF (network Common Data Form) format. netCDF4 files are commonly used for maintaining large quantities of spatial data. In particular, it is a convenient way to store grid data as a time series.
Earlier, we installed some packages in our virtual environment that help Python handle netCDF files. In particular, we xarray’s open_dataset
function to open our .nc files. xarray provides the ability to compute on netCDF frames much in the way pandas provides the ability to compute on panel data. xarray also contains functions that allow data to easily to be converted to DataFrames and vice versa. For familiarity, we convert netCDF data to DataFrames in this example. The following function can open netCDF files and return a monthly time series of rainfall for our chosen location and year range:
def get_CPC_precip_lat_lon(lat,lon,yr_range):
grid_precip_ts = pd.Series()
for yr in yr_range:
fname = 'path_to_directory/precip.'+str(yr)+'.nc'
ds = xr.open_dataset(fname)
df = ds.to_dataframe()
yr_ts = df.loc[(lat,lon)]
yr_ts.index.name = None
yr_ts = pd.Series(yr_ts.values.squeeze(),index=yr_ts.index.values)
grid_precip_ts = pd.concat([grid_precip_ts,yr_ts],axis=0)
return(grid_precip_ts)
So now we are up and running with some NOAA data can dig into the analysis.
Part 2: Tag Missing Data Using Shapefiles
Find Missing Data
In order to start tagging missing data, we want to take a more granular look than the one provided by the function in our earlier post. Luckily some of the core code of our previous function can easily be reworked in order to concatenate the raw data into a DataFrame:
#loop through years to process each file
#remember to change fname to relevant pathstats = pd.DataFrame()
for yr in yr_range:
print("Checking data for: "+str(yr))fname = '/Users/Osho/cpc_noaa/precip.'+str(yr)+'.nc'
ds = xr.open_dataset(fname)
df_data = ds.to_dataframe()missing_days = tabulateMissingDays_general(df_data)
missing_days.insert(loc=0,column='year', value[yr]*len(missing_days))stats = pd.concat([stats, missing_days], axis=0)
To start, stats only contains raw data from the .nc files in the form of a DataFrame. The DataFrame has four columns: [lat, lon, time, precipitation]. Time is date with daily granularity.
We can send stats to the following function and process the data to find the number of locations missing x days of data for year y:
def tabulateMissingDays_general(df_all_data):
df_all_data.reset_index(inplace=True)
#cutting out antarctica and arctic
trim_locations = df_all_data[(df_all_data['lat']>=-60) & (df_all_data['lat']<=80)].copy(deep=True)
trim_locations['isNaN'] = np.isnan(trim_locations['precip'])
unique_locations = trim_locations[['lat','lon']].drop_duplicates(subset=['lat', 'lon'], keep='first')
total_measurements = len(unique_locations)
nan_rows = trim_locations[trim_locations['isNaN']==True]
#you can do alot of different subsets to get more details on the specific lat, lon
#and the specific days. this approach is best for general data checkingcount_nan = nan_rows[['lat','lon','time']]
count_nan = count_nan.groupby(['lat', 'lon']).agg({'time':np.size})
count_nan.columns = ['missing_days']
ctr = collections.Counter(count_nan['missing_days'])
unique_nans = pd.DataFrame.from_dict(ctr, orient='index').reset_index()
unique_nans.columns = ['missing_days', 'num_locations']
unique_nans['missing_locations_percent'] = unique_nans['num_locations']/total_measurements
unique_nans['missing_locations_percent'] = pd.Series(["{0:.2f}%".format(val * 100)
for val in unique_nans['missing_locations_percent']], index = unique_nans.index)
return(unique_nans)
We trim out data for Antarctica and the Arctic because our focus is precipitation analysis to better understand vegetation trends, but this is not necessary for other types of analysis. Those regions actually have a lot of data as its crucial to studies on climate change.
Now that we have an understanding of the gaps in our data, we can dig further to look at the missing days by location:
#remember to edit the function name in the loop we created abovedef tabulateMissingDays_locations(df_all_data):
df_all_data.reset_index(inplace=True)
#cutting out antarctica and arctic
trim_locations = df_all_data[(df_all_data['lat']>=-60) & (df_all_data['lat']<=80)].copy(deep=True)
trim_locations['isNaN'] = np.isnan(trim_locations['precip'])
nan_rows = trim_locations[trim_locations['isNaN']==True]
count_nan = nan_rows[['lat','lon','time']]
count_nan = count_nan.groupby(['lat', 'lon']).agg({'time':np.size})
count_nan.columns = ['missing_days']
return(count_nan)
This function returns a DataFrame where each row is lat and lon, a year, and the number of missing days of data for that lat and lon. While this is useful, it would be much easier to understand if we could get a country and state level tagging of each lat and lon. In order to tag the data we will need to use shapefiles.
Working with Shapefiles
Shapefiles store geometric locations and attributes such points, lines, and polygons. Using a shapefile we can create polygons and define their borders in terms of (lat, lon) pairings. Specifically, we will be utilizing the Database of Global Administrative Areas (GADM) files that have pre-defined polygons that describe every country on Earth. The files have six levels of increasing granularity. For our purposes, we use level 2 which is able to give a country and state breakdown.
The GADM files can be downloaded by country or by world. World is easy enough to work with but if you do want to download by country to save space we have a module that can stitch shapefiles together in a DataFrame. In order to do this, we need to use the geopandas package which has the ability to read shapefiles as well as functions for spatial operations with polygons all in a convenient pandas like DataFrame:
shape_file_dir = "path_to_shapefile_directory"def getShapeFiles(directory):
os.chdir(directory)
file_list = []
for file in glob.glob("*.shp"):
file_list.append(directory+"/"+file)
return(file_list)shape_files = getShapeFiles(shape_file_dir)
gdf_shape_files = []
for file in shape_files:
gdf_shape_files.append(gpd.read_file((file)))gdf_all_geos = pd.concat(gdf_shape_files, sort=False)
In order to stitch the files together, we must first create a list of geopandas frames and then use panda’s concat function. Looking at the data from our shapefile, we see that there is a column called “geometry” which lays out polygons for the associated country and state in NAME_0 and NAME_1, respectively. Now we can check to see where our lat, lon pairs with missing data are located.
Using Geopandas to Tag Missing Data
In order to get our geographic tagging, we will have to convert our lat and lon pairs into POINT objects and then utilize geopandas sjoin which allows for joining on a “within” function. This checks to see if a POINT is within a POLYGON
Consider a data frame called unique_locations with a column of lats and a column of lons that represent all unique lat, lon pairings in our NOAA data with missing data:
from shapely.geometry import Point, Polygon
geometry = unique_locations.apply(lambda x:Point([x['lat'],x['lon']]),axis=1)
unique_locations["geometry"]= geometryunique_locations_gdp = gpd.GeoDataFrame(unique_locations,
geometry=geometry)
unique_locations_gdp.crs = {"init" :"epsg:4326"}tagged_locations = gpd.sjoin(unique_locations_gdp, global_data, op="within")
CRS is a coordinate reference system. This needs to be specified in order to complete the join. More info on CRS can be found in the geopandas docs http://geopandas.org/projections.html
Visit Arbol’s GitHub for more information on our project.
Arbol makes automated payments based on weather outcomes using smart contracts and third-party weather data.
Sign-up for our Pilot Program
Say hi on Twitter