In order to harmonize several disparate datasets, we implemented the
workflow described below. For a more complete description of our
workflow and quality control measures, please see the manuscript Labou
et al. (Under Review).
Lake locations and boundaries
For the locations of lakes, we used the HydroLAKES database version
1.04, which unifies other lake datasets (e.g., SRTM Water Body Data,
Global Lakes and Wetlands Database) into a data product totaling
1,427,688 lakes of at least 10 hectares in surface area. The majority
of HydroLAKES lakes are defined as uncontrolled lakes (99.5%), with
the remainder identified as reservoirs (0.47%) and controlled lakes
(0.03%). HydroLAKES, which is available in the form of shapefiles,
includes an extensive number of attributes for lake polygons
including: lake surface area (polygon area), elevation, shoreline
development, total volume, average depth, residence time, latitude and
longitude of pour point, lake type, and others. The HydroLAKES v1.0
identifier ("Hylak_id") is retained in the GLCP to
facilitate future work making use of other attributes in the
HydroLAKES data, which are not included in the GLCP.
Hereafter, HydroLAKES lake polygons are referred to as lakes.
Basins
Because lakes are products of the landscapes in which they reside, we
calculated climate and population data for each lake's basin. To
delineate basin area, we used the HydroBASINS dataset, a basin analog
to the HydroLAKES dataset. The HydroBASINS version 1.c format 1
database5 is derived from the HydroSHEDS database6 with 15 arc-second
resolution data to identify river basins, watersheds, and sub-basins
globally. In HydroBASINS, basins are identified using the Pfafstetter
coding system, with Level 1 as the highest level (e.g., continent
level) and Level 12 as the smallest available sub-basin. Table 1
details the number and median size of basins within each Pfafstetter
level for basins used within the GLCP. We retain the original
HydroBASINS version 1.c identifier ("HYBAS_ID") for each
basin in the GLCP, for ease of future integration with existing
HydroBASINS attributes, such as distance from basin outlet to next
downstream sink and indicators of endorheic basins.
Hereafter, HydroBASINS polygons are referred to as basins.
Surface water extent
For changes in lake surface water area over time, we used the Joint
Research Centre (JRC) Global Surface Water Dataset described in Pekel
et al.2 , which leveraged LANDSAT imagery (30 meter resolution) from
March 1984 through October 2015 to identify changes in surface water
area for lakes, rivers, streams, and wetlands. The data are hosted by
the European Commission JRC and are formally referred to as the Global
Surface Water Dataset. Hereafter, we use the abbreviation
"JRC" to refer to this dataset.
The JRC data subsetted for Yearly Water Classification History v1.0
(1984-2015) are publicly available through Google Earth Engine as
raster images. Each image contains a "waterClass" band with
the following values: 0 = no observations, 1 = not water, 2 = seasonal
water (defined as water that is present for at least one month but not
an entire year), 3 = permanent water (defined as water that is present
for all twelve months). We calculated "total water" as the
sum of seasonal and permanent water pixels. For more detailed
information on the complete LANDSAT processing workflow used to create
the JRC, Pekel et al.2 provides a methodology of how waterClasses were
assigned based on raw LANDSAT data.
While the JRC dataset is the most extensive global surface water
dataset available to date, it is limited by the LANDSAT data from
which it is derived. Even though LANDSAT coverage began in 1984,
portions of northeastern Siberia (Kolyma region and Central Siberian
plateau) as well as central Greenland were not included in totality
until 1999. Given the potential for lakes in this area to have
inaccurate area measurements prior to 1999, we calculated ratios of
"no data" pixel areas to "not water",
"seasonal water", "permanent water", and
"total water" pixel areas. This ratio will enable future
users to set desired thresholds of "no data" coverage that
are specific to their research questions. Because of the computational
demand to produce these ratios for each lake and year, the ratios are
provided within a secondary .csv file
("JRC_all_no_data_proportions_yearly_95thru15.csv") and can
be merged efficiently with the full GLCP using the provided R script
("combine_data_availability_metrics_with_glcp.R").
Additionally, the JRC is limited by the water identification algorithm
used by Pekel et al.2, which divided pixels into water, land, or
non-valid observations, where non-valid observations may include snow
and ice. This system therefore does not classify permanently frozen
lakes as water. Seasonally frozen lakes, however, would be coded as
entirely seasonal water.
Population estimates
We used the Gridded Population of the World (GPW) version 37 for 1995
and GPW version 48 un-adjusted population count data for 2000, 2005,
2010, and 2015 population estimates. Resolution for the GPW version 3
is 2.5 arc-minutes and is available for download from NASA's
Socioeconomic Data and Applications Center (SEDAC). Resolution for GPW
version 4 is 30 arc-seconds and is currently hosted on Google Earth
Engine. Detailed methodology for the development of these datasets is
available in Doxsey-Whitfield et al.9.
Climate data
We used the Modern-Era Retrospective analysis for Research and
Applications, Version 2 (MERRA-2)10 as the source for climate data.
Both precipitation and temperature datasets11 were hourly aggregates
with original spatial resolution of 0.5 x 0.625 decimal degrees. From
these broader datasets, we extracted the variables PRECTOTCORRLAND
(total precipitation land; bias corrected; in kg m-2 s-1, or
volumetrically, mm s-1) for precipitation and T2MMEAN (2-meter air
temperature in K) for temperature. These subsets were exported from
NASA Goddard Earth Sciences Data (GES) and Information Services Center
(DISC) in a netCDF format for local analysis.
Data harmonization process
As this project involved harmonizing multiple global datasets at
different scales, certain lakes were excluded for methodological
reasons. Here we detail the exact steps taken to harmonize the JRC,
climate, and human population datasets described above.
Step 1: Calculate lake surface area
Lake surface area for each lake from 1995 to 2015 was calculated using
Google Earth Engine12. Lake polygons were uploaded and imported into
Earth Engine as shapefiles. These lake polygons, which represent
typical shape and area for individual lakes, were buffered to allow
calculations to account for increases in lake area beyond the
HydroLAKES polygon borders as specified in HydroLAKES. Pixel area
within each buffered lake boundary was then summed for each waterClass
category (e.g., no data, not water, seasonal water, or permanent
water). Resulting area values were exported in .csv format to Google
Drive, then downloaded for local analysis using R13. Commented Google
Earth Engine code for lake area calculations
("jrc_water_class_sum.txt") and the R script for data
aggregation ("01_import_format_JRC.R") are available on the
Environmental Data Initiative (EDI) GLCP repository within the folder
"glcp.tar.gz".
To evaluate how lake waterClass areas fluctuated with various buffer
sizes, we calculated lake waterClass areas for 1995, 2000, 2005, 2010,
and 2015 with 30 m, 60 m, 90 m, and 120 m buffers. Preliminary tests
indicated smaller buffers were insufficient to capture large area
increases, while larger buffers increased risk of overlapping
neighboring lakes (especially in dense lake areas), smaller ponds, or
input/output rivers and erroneously increasing lake area totals. Our
analysis of waterClass areas between buffer sizes and years suggested
90 m as the most appropriate distance. Additional details are provided
in the "Technical Validation" section.
The first step of the workflow identified a marginal fraction of lakes
to exclude from the final data product. One lake in North America was
identified as having a broken geometry (Hylak_id = 109424), making it
incompatible with Earth Engine-based analyses. Rather than attempt to
repair the lake shapefile boundaries and potentially change the size
and shape, we chose not to include this lake. Additionally, a number
of lakes were identified to be outside the range of reliable LANDSAT
data. The available JRC data has a maximum extent of 80degN, while
Pekel et al.2 note that LANDSAT images above 78degN are sparse,
partially due to the short LANDSAT observation season in high northern
latitudes. As such, we limited further processing to lakes whose
entire extent is below 78degN, which excluded 3,220 lakes (0.23% of
original 1.42 million lakes).
Step 2: Match lakes with basins
As HydroBASINS is derived from river networks, the HydroLAKES and
HydroBASINS data do not come with a pre-existing 1:1 matching scheme
for lake and basin. We performed spatial joins for the lake shapefiles
and basin shapefiles, wherein lakes whose boundaries fell within a
sub-basin boundary were tagged with that basin identifier (both basin
ID and Pfafstetter level). Because small lakes and their associated
basins are necessarily nested within subsequently larger basins, we
retained only the basin identifier at the highest Pfafstetter level
(i.e., smallest basin size) for each lake. For example, any lake that
falls within a Level 12 basin was tagged with the Level 12 basin
identifier, because no smaller sub-basins were available. Any lake
that is completely contained in a Level 11 basin (but not a Level 12
basin) was tagged with a Level 11 basin identifier. Approximately 85%
of the 1,422,499 GLCP lakes were matched at the Level 12 Pfafstetter
level (Table 1), the smallest basin level available. The highest level
Pfafstetter basin used was Level 2 (Level 1 being near
continent-level), which was sufficient to capture the watersheds of
very large lakes, such as the Laurentian Great Lakes and the Caspian
Sea. This basin matching procedure was performed within Google Earth
Engine ("hylak_hybasin_matching.txt") and outputs were
aggregated locally using R ("07_lake_basin_matching.R").
Using this lake/basin matching procedure, 1,949 lakes (0.14% of the
original 1.42 million HydroLAKES lakes) were unable to be properly
associated with a basin. Manual investigation showed that these lakes
were either located on islands (645 lakes, 0.05% of the original
HydroLAKES) or only associated with only a Level 1 basin (1,304 lakes,
0.09% of the original HydroLAKES). Lakes located on islands would be
excluded from the GLCP because their natural basins would not be
included in the continental basin schema that HydroBASINs employs.
Similarly, the 1,304 lakes associated with Level 1 basins were located
along the boundary of each nested basin level. This peculiarity is
largely because HydroBASINS is constructed for river networks, as
opposed to lakes. Because it is unrealistic for these 1,304 lakes to
be influenced by near continental-scale climate and human population
forcings, we excluded these lakes from further processes.
Step 3: Calculate basin-level precipitation and temperature estimates
Once basins were associated with lakes, basin-level climate values
were calculated. Within the R environment, bulk precipitation values
were converted to bulk volume precipitation by aggregating hourly data
for each gridcell for each year14,15. We also derived the average
monthly volume of precipitation for each grid cell for each year
(1995-2015) by taking the mean of each year's total monthly
precipitation volumes ("summing_hourly_data_precip.R").
Temperature values were similarly used to derive an average monthly
temperature for each year ("summing_hourly_data_temp.R").
The resulting yearly data were saved as rasters. Within ArcGIS16,
yearly total precipitation, average monthly total precipitation, and
temperature rasters (1995-2015) were resampled at 1/10th cell size.
The original rectangular grids were converted to squares, with spatial
resolution 0.05 x 0.05 decimal degrees through a bilinear
interpolation resampling. Because MERRA-2 grid cells are originally
sized at 0.5 x 0.625 decimal degree resolution, the conversion to a
raster format induced extra space (e.g., 90.25degN in raster). As
such, resampled rasters were clipped to 90degN/S and 180degW/E and
converted to geotiff format for upload to Google Earth Engine.
Basin-level average and total precipitation as well as average
temperature were calculated for each year of interest in Earth Engine,
exported as .csv files to Google Drive, and then downloaded for local
analysis using R. R scripts for data aggregation of climate variables
("04_post_gee_processing_temp.R ",
"05_post_gee_processing_precip_sum.R",
"06_post_gee_processing_precip_average.R") are available on
the EDI GLCP repository within the entity "glcp.tar.gz".
Preliminary analysis of climate data resulted in 10 matched basins
with NA values for climate variables. These 10 basins were associated
with 19 lakes and were removed because their climate data was
incomplete. Closer investigation revealed that these basins were
located generally at higher northern latitudes in Canada, the United
States, and Russia. We note that other temperature and precipitation
datasets are available; subsequent work is welcome to test alternative
climate data sources to match with these basins.
Step 4: Calculate basin-level population estimates
While other data sources in this project are annual, the global
population data we used, which was the best available at the global
scale, was for 5-year increments (1995, 2000, 2005, 2010, 2015).
Rather than interpolating the intervening years' values, we chose to
leave these blank so that future researchers can personalize
statistical methodology to best address these data gaps in context of
a specific question.
GPWv3 (1995 data) was converted to a geotiff and imported into Earth
Engine. GPWv4 (2000, 2005, 2010, and 2015) rasters were already
available through the Earth Engine interface. Basin-level population
totals were calculated using GPWv3 and GPWv4 data, exported as .csv
files to Google Drive, and downloaded for local analysis within the R
environment. R scripts for data aggregation of population counts
("02_load_shp_GPWv3.R", "03_load_shp_GPWv4.R") are
available on the EDI GLCP repository within the entity
"glcp.tar.gz".
Step 5: Merge lake- and basin-level data
Lake- and basin-level output were merged within the R environment. The
R script for GLCP production
("08_cleaning_glcp_production.R") is available in the EDI
GLCP repository within the entity "glcp.tar.gz".
Synopsis of data harmonization procedures:
A graphical summary of the harmonization process is provided in Figure
1.
The final GLCP dataset contains 1,422,499 lakes. These lakes were used
to generate all summary results reported below.
Code availability
All Google Earth Engine and R scripts are available from the
Environmental Data Initiative GLCP repository within the entity
"glcp.tar.gz". The EDI GLCP repository also includes a
standardized file structure that can be downloaded and run locally
with little alteration of the original R scripts if users wish to
reproduce the GLCP.