In order to harmonize several disparate datasets, we implemented the workflow
described below. For a more complete description of our workflow and quality control
measures, please see the manuscript Meyer & Labou et al. (2020). The authors
recommend that when citing the GLCP to please cite both the data descriptor publication
as well as the data product hosted on EDI.
The authors recommend that future users implement similar packages used in this
workflow for manipulating the GLCP. In particular, loading the GLCP into certain
environments, such as R, often require the use of external packages. Most notably, the
fread() function from the data.table package (Dowle & Srinivasan, 2019) can
quickly read the GLCP into the R environment through the command:
glcp = fread(x = 'glcp.csv', header = TRUE, integer64 = 'character').
Data Sources
Lake locations and boundaries
For the locations of lakes, we used the HydroLAKES database version 1.0 (Messager
et al. 2016) (https://www.hydrosheds.org/page/hydrolakes), which incorporates multiple
lake datasets (e.g., Shuttle Radar Topology Mission, Water Body Data, Global Lakes and
Wetlands Database) and includes 1,427,688 lakes of at least 10 hectares in surface area.
The majority of HydroLAKES lakes are defined as uncontrolled lakes (99.5%), with the
remainder identified as reservoirs (0.47%) and controlled lakes (0.03%). HydroLAKES,
which is available in the form of shapefiles, includes an extensive number of attributes
for lake polygons including: lake surface area (polygon area), elevation, shoreline
development, total volume, average depth, residence time, latitude and longitude of pour
point, lake type, and others. The HydroLAKES v1.0 identifier (Hylak_id) is retained in
the GLCP to facilitate future work making use of other attributes in the HydroLAKES
data, which are not included in the GLCP. Additionally, the GLCP also contains the
latitude (centr_lat) and longitude (centr_lon) of each lake’s centroid, which were
calculated within ArcGIS version 3.1 (ESRI 2015).
Hereafter, HydroLAKES lake polygons are referred to as lakes.
Basins
Because lakes are products of the landscapes in which they reside, we calculated
climate and population values relative to each lake’s basin. To identify basin
boundaries, we used the HydroBASINS dataset, a basin-level analog to the HydroLAKES
dataset. The HydroBASINS version 1.c format 1 database (Lehner and Grill 2013) includes
3,786,218 unique basins and is derived from the HydroSHEDS database (Lehner et
al. 2008), which uses 15 arc-second resolution data to identify river basins,
watersheds, and sub-basins globally. In HydroBASINS, basins are identified using the
Pfafstetter coding system, with Level 1 as the highest level (i.e., continent level) and
Level 12 as the smallest available sub-basin. Table 1 details the number and median size
of basins within each Pfafstetter level for basins used within the GLCP. We retain the
original HydroBASINS version 1.c identifier (HYBAS_ID) for each basin in the GLCP, for
ease of future integration with existing HydroBASINS attributes, such as distance from
basin outlet to next downstream sink and indicators of endorheic basins.
Hereafter, HydroBASINS polygons are referred to as basins.
Surface water extent
For changes in lake surface water area over time, we used the Joint Research Centre
(JRC) Global Surface Water Dataset described in Pekel et al.(2016), which used LANDSAT
imagery (30 meter resolution) from March 1984 through October 2015 to identify changes
in surface water area for lakes, rivers, streams, and wetlands. The data are hosted by
the European Commission JRC and are formally referred to as the Global Surface Water
Dataset. Hereafter, we use the abbreviation JRC to refer to this dataset.
The JRC data subsetted for Yearly Water Classification History v1.0 (1984-2015) are
publicly available through Google Earth Engine (GEE) as annually aggregated raster
images. Each image contains a waterClass band with the following values: 0 = no
observations, 1 = not water, 2 = seasonal water (defined as water that is present for at
least one month but not an entire year), 3 = permanent water (defined as water that is
present for all twelve months). For more detailed information on the complete LANDSAT
processing workflow used to create the JRC, Pekel et al.(2016) provides a methodology of
how waterClasses were assigned based on raw LANDSAT data.
While the JRC dataset is the most extensive global surface water dataset available
to date, it is limited by the LANDSAT data from which it is derived. Even though LANDSAT
coverage began in 1984, portions of northeastern Siberia (Kolyma region and Central
Siberian plateau) as well as central Greenland were not included in totality until 1999.
Additionally, the JRC is limited by the water identification algorithm used by
Pekel et al.(2016), which divided pixels into water, land, or non-valid observations,
where non-valid observations may include snow and ice. This system therefore does not
classify permanently frozen lakes as water. Seasonally frozen lakes, however, would be
coded as entirely seasonal water.
Climate data
We used the Modern-Era Retrospective analysis for Research and Applications,
Version 2 (MERRA-2) (Gelaro et al. 2017) as the source for climate data. Both
precipitation and temperature datasets(Global Modeling and Assimilation Office (GMAO)
2015) were hourly aggregates with original spatial resolution of 0.5 x 0.625 decimal
degrees. From these broader datasets, we extracted the variables PRECTOTCORRLAND (total
precipitation land; bias corrected; in kg m-2 s-1, or volumetrically, mm s-1) for
precipitation and T2MMEAN (2-meter air temperature in K) for temperature. These subsets
were exported from NASA Goddard Earth Sciences (GES) and Data Information Services
Center (DISC) in a netCDF format for local analysis.
Population estimates
We used the Gridded Population of the World (GPW) version 3(Center for
International Earth Science Information Network - CIESIN - Columbia University et
al. 2005) for 1995 and GPW version 4(Center for International Earth Science Information
Network - CIESIN - Columbia University) un-adjusted population count data for 2000,
2005, 2010, and 2015 population estimates. Resolution for the GPW version 3 is 2.5
arc-minutes and is available for download from NASA’s Socioeconomic Data and
Applications Center (SEDAC). Resolution for GPW version 4 is 30 arc-seconds and is
currently hosted on Google Earth Engine. Detailed methodology for the development of
these datasets is available in Doxsey-Whitfield et al. (2015)
Data harmonization process
As this project involved harmonizing multiple global datasets at different
resolutions, our workflow required multiple steps, each of which resulted in a cleaned
data subset. Here we detail the steps taken to integrate the HydroLAKES, HydroBASINS,
JRC, climate, and human population datasets described above.
Step #1: Calculate lake surface area
Lake surface area for each lake from 1995 to 2015 was calculated using Google Earth
Engine(Gorelick et al. 2017). Lake polygons were uploaded and imported into Earth Engine
as shapefiles. These lake polygons, which represent typical shape and area for
individual lakes, were buffered by a specified distance to allow water area calculations
to account for increases in lake area beyond the HydroLAKES polygon borders as specified
in HydroLAKES. Buffered lake polygons were then used as boundaries within which to
summarize pixels from annual JRC data for each waterClass category (i.e., no data, not
water, seasonal water, or permanent water). We calculated total water as the sum of
seasonal and permanent water pixels. Resulting area values were exported in .csv format
to Google Drive, then downloaded for local analysis using the R statistical
environment(R Core Team 2019). Commented Google Earth Engine code for lake area
calculations (jrc_water_class_sum.txt), the R script for formatting Google Earth Engine
output (01_import_format_JRC.R)(Wickham and Henry 2018; Wickham et al. 2018; Dowle and
Srinivasan 2019), and associated input data are available in the Environmental Data
Initiative (EDI) GLCP repository (Labou et al.) within the entity glcp.tar.gz.
To evaluate how lake waterClass areas fluctuated with various buffer sizes, we
calculated lake waterClass areas for 1995, 2000, 2005, 2010, and 2015 with 30 m, 60 m,
90 m, and 120 m buffers. Preliminary tests indicated smaller buffers were insufficient
to capture large area increases, while larger buffers increased risk of overlapping
neighboring lakes (especially in dense lake areas), smaller ponds, or input/output
rivers and erroneously increasing lake area totals. Our analysis of waterClass areas
between buffer sizes and years indicated 90 m as the most appropriate distance.
Additional details are provided in the Technical Validation section.
We identified a minority of lakes that were unable to be included in the final data
product. One lake in North America was identified as having a broken geometry (Hylak_id
= 109424), making it incompatible with Earth Engine-based analyses. Rather than attempt
to repair the lake shapefile boundaries and potentially change the size and shape, we
chose not to include this lake. Additionally, a small number of lakes were identified to
be outside the range of reliable LANDSAT data. The available JRC data has a maximum
extent of 80 degree N and Pekel et al. (Pekel et al. 2016) note that LANDSAT images
above 78 degree N are sparse, partially due to the short LANDSAT observation season in
high northern latitudes. As such, we limited further processing to lakes whose entire
extent is below 78 degree N, which excluded 3,220 lakes (0.23% of original 1.4 million
lakes).
Given the potential for lakes in this area to have inaccurate area measurements
prior to 1999, we calculated ratios of no data pixel areas to not water, seasonal water,
permanent water, and total water pixel areas. These ratios will enable future users to
set desired thresholds of no data coverage that are specific to their research
questions. These ratios are provided within a secondary .csv file
(JRC_all_no_data_proportions_yearly_95thru15.csv) and can be merged efficiently with the
full GLCP using the provided R script (combine_data_availability_metrics_with_glcp.R).
Step #2: Match lakes with basins
Because HydroBASINS is derived from river networks, rather than lake pour points,
the HydroLAKES and HydroBASINS data do not come with a pre-existing 1:1 matching scheme
for lake and basin. To match lakes with their equivalent basins, we performed spatial
joins for the lake shapefiles and basin shapefiles to identify the smallest basin that
enclosed a lake in its entirety. With this basin matching scheme, there is potential for
some lakes to be assigned to basins which are larger than their actual basin. As a
result, future users are encouraged to compare GLCP-associated basin area to a lake’s
known basin area, if those data exist. Comparing HydroBASINS river basins to known lake
basins would enable researchers to determine if differences in basin assignment are
meaningful of their specific research question. All lakes that fell within a Pfafstetter
Level 12 basin (85% of lakes, Table 1) were tagged with the Level 12 basin identifier,
because no smaller sub-basins were available. The highest level Pfafstetter basin used
was Level 2 (Level 1 being near continent-level), which was sufficient to capture the
watersheds of very large lakes, such as the Laurentian Great Lakes and the Caspian Sea.
Of the original 3,786,218 HydroBASINS basins, 232,827 were paired with lakes (6.15%).
This basin matching procedure was performed within Google Earth Engine
(hylak_hybasin_matching.txt) and outputs were formatted locally using R
(07_lake_basin_matching.R) (Wickham et al. 2018; Dowle and Srinivasan 2019).
Using this lake/basin matching procedure, 1,949 lakes (0.14% of the original 1.4
million HydroLAKES lakes) were unable to be properly associated with a basin. Manual
investigation indicated that these lakes were either located on islands (645 lakes,
0.05% of the original HydroLAKES) or would be associated with only a Level 1 basin
(1,304 lakes, 0.09% of the original HydroLAKES). Lakes located on islands are excluded
from the GLCP because their natural basins are not included in the continental basin
schema that HydroBASINs employs. Similarly, the 1,304 lakes associated with Level 1
basins were consistently located on the boundary between neighboring basins and
therefore never completely enclosed in a single basin. This peculiarity is largely
because HydroBASINS is constructed for river networks, as opposed to lakes. Because it
is unrealistic for these 1,304 lakes (average total area: 2.39 km2) to be influenced by
near continental-scale climate and human population forcings, we excluded these lakes
from further processing.
Step #3: Calculate basin-level precipitation and temperature estimates
Once basins were associated with lakes, basin-level climate values were calculated.
Within the R environment, precipitation values from MERRA-2 were converted to annually
accumulated precipitation by aggregating hourly data for each gridcell for each year
(Reichle et al. 2017b; a). We also derived the average monthly volume of precipitation
for each gridcell for each year (1995-2015) by taking the mean of each year’s total
monthly precipitation volumes (summing_hourly_data_precip_mm.R)(Bivand et al. 2019;
Hijmans 2019; Pierce 2019; Revolution Analytics and Weston 2019). Temperature values
were similarly used to derive an average annual temperature for each year
(summing_hourly_data_temp_K.R) (Tierney et al. 2018; Bivand et al. 2019; Hijmans 2019;
Pierce 2019; Revolution Analytics and Weston 2019). The resulting yearly data were saved
as rasters. Yearly total precipitation, average monthly total precipitation, and
temperature rasters (1995-2015) were then resampled at 1/10th cell size through a
bilinear interpolation resampling. The original rectangular grids were converted to
squares, with spatial resolution of 0.05 x 0.05 decimal degrees. Because MERRA-2
gridcells were originally sized at 0.5 x 0.625 decimal degree resolution, the initial
conversion from a netCDF to a raster format induced extra space (e.g., 90.25 degree N in
raster). As such, resampled rasters were clipped to 90 degree N/S and 180 degree W/E and
converted to geotiff format for upload to Google Earth Engine
(manipulate_climate_rasters.R) (Bivand et al. 2019; Hijmans 2019; Pierce 2019, p. 4;
Revolution Analytics and Weston 2019).
For each basin associated with a lake, basin-level average and total precipitation,
as well as average temperature, were calculated for each year of interest in Earth
Engine. The process was similar to the one described for lake polygons and JRC data,
whereby the basins were used as boundaries from which to extract and aggregate pixels.
Results were exported as .csv files to Google Drive, then downloaded for local analysis
using R. R scripts for data aggregation of climate variables
(04_post_gee_processing_temp.R , 05_post_gee_processing_precip_sum.R,
06_post_gee_processing_precip_average.R)(Wickham et al. 2018; Wickham 2019; Dowle and
Srinivasan 2019) are available on the EDI GLCP repository (Labou et al.) within the
entity glcp.tar.gz.
This process resulted in 10 matched basins (of the original 232,837 matched basins;
0.004%), which were associated with 19 lakes, with missing values for climate variables.
These 10 basins ranged in size from 1.1 to 181.7 km2 with a median of 76.5 km2. These
basins and lakes were removed from the dataset. Manual assessment showed that these
basins were located at higher northern latitudes in the United States and the Russian
Federation. We note that other temperature and precipitation datasets are available;
subsequent analyses can incorporate alternative climate data sources to match with these
basins through the scripts and workflow provided in the EDI entity glcp.tar.gz.
Future users should also note that while HydroBASINS provided a boundary to
calculate climate variables for each lake, these calculations may be overestimates, as
many lakes’ actual basins may be smaller than their associated river basin. However, as
with the addition of new climate variables, subsequent analyses can also incorporate
different basin schemes as lake basin shapefiles become available.
Step #4: Calculate basin-level population estimates
While other data sources in this project are annual, the global population data we
used, which was the current best available at the global scale, was for 5-year
increments (1995, 2000, 2005, 2010, 2015). Rather than interpolate the intervening
years’ values, we chose to leave these blank so that future researchers can personalize
statistical methodology to best address these data gaps in context of a specific
question. Aside from blank values, numerous basins have population estimates as decimal
values. Rather than truncate these values which were produced through the aggregation
process within Google Earth Engine, we retain these so that future users may decide how
they wish to round or otherwise interpret these values in the context of their
particular research question.
For each 5-year increment, human population calculations were performed with a
technique similar to the climate data aggregations. GPWv3 (1995 data) was converted to a
geotiff and imported into Earth Engine. GPWv4 (2000, 2005, 2010, and 2015) rasters were
available through the Earth Engine interface. Basin-level population totals were
calculated from GPWv3 and GPWv4 data with basin polygons as spatial boundaries. Results
were exported as .csv files to Google Drive and then downloaded for local analysis
within the R environment. R scripts for data aggregation of population counts
(02_load_shp_GPWv3.R, 03_load_shp_GPWv4.R) (Wickham et al. 2018; Wickham 2019; Dowle and
Srinivasan 2019) are available on the EDI GLCP repository (Labou et al.) within the
entity glcp.tar.gz .
Like climate variables, population calculations have the potential to be
overestimated, as many lakes’ actual basins may be smaller than their associated river
basin. Additionally, future users should be cognizant of heterogeneous human populations
within the basin, which could skew analyses of relationships between human population
and lake area. As is currently calculated, populations estimates are at the basin scale,
whereas certain research questions may be more focused at population estimates within a
particular distance from the lake (e.g., 500 m). However, subsequent analyses can also
incorporate different basin schemes as lake basin shapefiles become available.
Step #5: Merge lake- and basin-level data
Lake- and basin-level output were merged within the R environment. The R script for
GLCP production (08_cleaning_glcp_production.R) (Wickham and Henry 2018; Wickham et
al. 2018; Dowle and Srinivasan 2019) is available in the EDI GLCP repository within the
entity glcp.tar.gz.
Synopsis of lake exclusions during data harmonization:
The final GLCP dataset contains 1,422,499 lakes. These lakes were used to generate
all dataset summary statistics reported below.
References
Bivand, R., T. Keitt, and B. Rowlingson. 2019. rgdal: Bindings for the Geospatial
Data Abstraction Library,.
Center for International Earth Science Information Network - CIESIN - Columbia
University. Gridded Population of the World, Version 4 (GPWv4): Population Count,
Revision 11.
Center for International Earth Science Information Network - CIESIN - Columbia
University, United Nations Food and Agriculture Programme - FAO, and Centro
Internacional de Agricultura Tropical - CIAT. 2005. Gridded Population of the World,
Version 3 (GPWv3): Population Count Grid.
Dowle, M., and A. Srinivasan. 2019. data.table: Extension of data.frame
,.
Doxsey-Whitfield, E., K. MacManus, S. B. Adamo, L. Pistolesi, J. Squires, O.
Borkovska, and S. R. Baptista. 2015. Taking advantage of the improved availability of
census data: a first look at the gridded population of the world, version 4. Papers in
Applied Geography 1: 226-234.
ESRI. 2015. ArcGIS Desktop: Release 10.3.1, Environmental Systems Research
Institute.
Gelaro, R., W. McCarty, M. J. Suarez, and others. 2017. The Modern-Era
Retrospective Analysis for Research and Applications, Version 2 (MERRA-2). Journal of
Climate 30: 5419-5454. doi:10.1175/JCLI-D-16-0758.1
Global Modeling and Assimilation Office (GMAO). 2015. MERRA-2 tavg1_2d_flx_Nx:
2d,1-Hourly,Time-Averaged,Single-Level,Assimilation,Surface Flux Diagnostics V5.12.4.
Goddard Earth Sciences Data and Information Services Center (GES DISC).
doi:10.5067/7MCPBJ41Y0K6
Gorelick, N., M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore. 2017.
Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of
Environment 202: 18-27.
Hijmans, R. J. 2019. raster: Geographic Data Analysis and Modeling,.
Labou, S. G., M. F. Meyer, M. R. Brousil, A. N. Cramer, and B. T. Luff. The global
lake area, climate, and population (GLCP) dataset.
Lehner, B., and G. Grill. 2013. Global river hydrography and network routing:
baseline data and new approaches to study the world’s large river systems. Hydrological
Processes 27: 2171-2186. doi:10.1002/hyp.9740
Lehner, B., K. Verdin, and A. Jarvis. 2008. New Global Hydrography Derived From
Spaceborne Elevation Data. Eos Trans. AGU 89: 93-94. doi:10.1029/2008EO100001
Messager, M. L., B. Lehner, G. Grill, I. Nedeva, and O. Schmitt. 2016. Estimating
the volume and age of water stored in global lakes using a geo-statistical approach.
Nature Communications 7: 13603. doi:10.1038/ncomms13603
Pekel, J.-F., A. Cottam, N. Gorelick, and A. S. Belward. 2016. High-resolution
mapping of global surface water and its long-term changes. Nature 540: 418-422.
doi:10.1038/nature20584
Pierce, D. 2019. ncdf4: Interface to Unidata netCDF (Version 4 or Earlier) Format
Data Files,.
R Core Team. 2019. R: A Language and Environment for Statistical Computing,.
Reichle, R. H., C. S. Draper, Q. Liu, M. Girotto, S. P. P. Mahanama, R. D. Koster,
and G. J. M. De Lannoy. 2017a. Assessment of MERRA-2 Land Surface Hydrology Estimates.
Journal of Climate 30: 2937-2960. doi:10.1175/JCLI-D-16-0720.1
Reichle, R. H., Q. Liu, R. D. Koster, C. S. Draper, S. P. Mahanama, and G. S.
Partyka. 2017b. Land surface precipitation in MERRA-2. Journal of Climate 30: 1643-1664.
Revolution Analytics, and S. Weston. 2019. doParallel: Foreach Parallel Adaptor for
the parallel Package,.
Tierney, L., A. J. Rossini, N. Li, and H. Sevcikova. 2018. snow: Simple Network of
Workstations,.
Wickham, H. 2019. stringr: Simple, Consistent Wrappers for Common String
Operations,.
Wickham, H., R. Francois, L. Henry, and K. Muller. 2018. dplyr: A Grammar of Data
Manipulation,.
Wickham, H., and L. Henry. 2018. tidyr: Easily Tidy Data with spread() and gather()
Functions,.