Data Package Metadata View Summary

Global lake area, climate, and population dataset

General Information

Data Package:
Local Identifier:	edi.394.6
Title:	Global lake area, climate, and population dataset
Alternate Identifier:	DOI PLACE HOLDER
Abstract:	An increasing population in conjunction with a changing climate necessitates a detailed understanding of water abundance at multiple spatial and temporal scales. Remote sensing has provided massive data volumes to track fluctuations in water quantity, yet contextualizing water abundance with other local, regional, and global trends remains challenging by often requiring large computational resources to combine multiple data sources into analytically-friendly formats. To bridge this gap and facilitate future freshwater research opportunities, we harmonized existing global datasets to create the Global Lake area, Climate, and Population (GLCP) dataset. The GLCP is a compilation of lake surface area for 1.42+ million lakes and reservoirs of at least 10 ha in size from 1995 to 2015 with co-located basin-level temperature, precipitation, and population data. The GLCP was created with FAIR (findable, accessible, interoperable, reusable) data principles in mind and retains unique identifiers from parent datasets to expedite interoperability. The GLCP offers critical data for basic and applied investigations of lake surface area, and water quantity, at local, regional, and global scales.
Publication Date:	2020-04-24

Time Period

Begin:

1995-01-01

End:

2015-10-31

People and Organizations
Contact:	Labou, Stephanie G (Center for Environmental Research, Education, & Outreach, Washington State University)
Contact:	Meyer, Michael F (School of the Environment, Washington State University) [ email ]
Contact:	Brousil, Matthew R (Center for Environmental Research, Education, & Outreach, Washington State University) [ email ]
Contact:	Cramer, Alli N (School of the Environment, Washington State University) [ email ]
Contact:	Luff, Bradley T (School of the Environment, Washington State University) [ email ]
Creator:	Labou, Stephanie G (Center for Environmental Research, Education, & Outreach, Washington State University)
Creator:	Meyer, Michael F (School of the Environment, Washington State University)
Creator:	Brousil, Matthew R (Center for Environmental Research, Education, & Outreach, Washington State University)
Creator:	Cramer, Alli N (School of the Environment, Washington State University)
Creator:	Luff, Bradley T (School of the Environment, Washington State University)

Data Entities
Data Table Name:	glcp.csv
Description:	lake area, climate, and population data
Data Table Name:	JRC_all_no_data_proportions_yearly_95thru15.csv
Description:	data availability metrics
Other Name:	combine_data_availability_metrics_with_glcp.R
Description:	Combines GLCP with data quality metrics
Other Name:	glcp_scripts.tar.gz
Description:	scripts only

Detailed Metadata

Data Entities

Data Table


Data:	https://pasta-s.lternet.edu/package/data/eml/edi/394/6/b3af2a6d3205ede2469d6d6ba410c101
Name:	glcp.csv
Description:	lake area, climate, and population data
Number of Records:	2000
Number of Columns:	15

Table Structure

Object Name:

glcp.csv

Size:

381000 bytes

Authentication:

bf417b7926e335c5ce767008e3b40abd Calculated By MD5

Text Format:

Number of Header Lines:

Record Delimiter:

\r\n

Orientation:

column

Simple Delimited:

Field Delimiter:	,
Quote Character:	"

Table Column Descriptions

Column Name:

year

Hylak_id

centr_lat

centr_lon

continent

country

bsn_lvl

HYBAS_ID

mean_monthly_precip_mm

total_precip_mm

mean_annual_temp_k

pop_sum

seasonal_km2

permanent_km2

total_km2

Definition:

Year, spans 1995-2015. Note that for the purposes of these data, 2015 ends on October 31.

HydroLAKES unique identifier of lake. Preserved from HydroLAKES input data to enable future merge with HydroLAKES attributes.

Lake centroid latitude.

Lake centroid longitude.

Continent on which lake is located (from HydroLAKES dataset).

Country in which lake is located (from HydroLAKES dataset).

Pfafstetter level of basin associated with lake.

HydroBASINS unique identifier of basin associated with lake. Preserved from HydroBASINS input data to enable future merge with HydroBASINS attributes.

Mean monthly basin-level precipitation

Annually accumulated basin-level precipitation

Mean annual basin-level temperature

Total basin-level human population. Note that this column only has valid values for 1995, 2000, 2005, 2010, and 2015.

Water area of seasonal water, as defined by Pekel et al. 2016.

Water area of permanent water, as defined by Pekel et al. 2016

Calculated total water as the sum of seasonal and permanent water.

Storage Type:

float

string

float

string

float

Measurement Type:

ratio

nominal

ratio

nominal

ratio

Measurement Values Domain:

Unit	nominalYear
Type	natural
Min	1995
Max	2015

Definition

HydroLAKES unique identifier of lake. Preserved from HydroLAKES input data to enable future merge with HydroLAKES attributes.

Unit	degree
Type	real
Min	-50.220863589
Max	74.535450083

Unit	degree
Type	real
Min	-160.685355169
Max	111.290672397

Definition

Continent on which lake is located (from HydroLAKES dataset).

Definition

Country in which lake is located (from HydroLAKES dataset).

Definition

Pfafstetter level of basin associated with lake.

Definition

HydroBASINS unique identifier of basin associated with lake. Preserved from HydroBASINS input data to enable future merge with HydroBASINS attributes.

Unit	millimeter
Type	real
Min	3.18527676112539
Max	230.822495540574

Unit	millimeter
Type	real
Min	59798.4922602635
Max	465717963.475415

Unit	kelvin
Type	real
Min	258.300686420877
Max	301.73118890012

Unit	number
Type	real
Min	0
Max	123176226.413027

Unit	squareKilometers
Type	real
Min	0
Max	4861.72457627024

Unit	squareKilometers
Type	real
Min	0
Max	354482.304159207

Unit	squareKilometers
Type	real
Min	0
Max	359344.028735477

Missing Value Code:

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Code	NA
Expl	not available

Accuracy Report:

Accuracy Assessment:

Coverage:

Methods:

Data Table


Data:	https://pasta-s.lternet.edu/package/data/eml/edi/394/6/c3ca4da728dcdaca1176fc52a641bd0e
Name:	JRC_all_no_data_proportions_yearly_95thru15.csv
Description:	data availability metrics
Number of Records:	2000
Number of Columns:	8

Table Structure

Object Name:

JRC_all_no_data_proportions_yearly_95thru15.csv

Size:

204720 bytes

Authentication:

6ff8f96267d08e473bdf2f6aaed14341 Calculated By MD5

Text Format:

Number of Header Lines:

Record Delimiter:

\r\n

Orientation:

column

Simple Delimited:

Field Delimiter:	,
Quote Character:	"

Table Column Descriptions

Column Name:

Hylak_id

year

no_obvs_km2

not_water_km2

no_data_to_not_water

no_data_to_seasonal

no_data_to_permanent

no_data_to_total

Definition:

HydroLAKES unique identifier of lake. Preserved from HydroLAKES input data to enable future merge with HydroLAKES attributes.

Year, spans 1995-2015. Note that for the purposes of these data, 2015 ends on October 31.

Area of no data, as defined by Pekel et al. 2016.

Area of not water, as defined by Pekel et al. 2016.

Calculated ratio of no_data_km2 derived area to not_water_km2 derived area.

Calculated ratio of no_data_km2 derived area to seasonal_km2 derived area.

Calculated ratio of no_data_km2 derived area to permanent_km2 derived area.

Calculated ratio of no_data_km2 derived area to total_km2 calculated area.

Storage Type:

string

float

Measurement Type:

nominal

ratio

Measurement Values Domain:

Definition

HydroLAKES unique identifier of lake. Preserved from HydroLAKES input data to enable future merge with HydroLAKES attributes.

Unit	nominalYear
Type	natural
Min	1995
Max	1995

Unit	squareKilometers
Type	real
Min	0
Max	172684.579419413

Unit	squareKilometers
Type	real
Min	0
Max	3472.50779594895

Unit	number
Type	real
Min	0
Max	375014.821894353

Unit	number
Type	real
Min	0
Max	358456.45166867

Unit	number
Type	real
Min	0
Max	958982.899537222

Unit	number
Type	real
Min	0
Max	845491.826774098

Missing Value Code:

Code	Inf
Expl	infinitive
Code	NA
Expl	Not Available

Code	Inf
Expl	infinitive
Code	NA
Expl	Not Available

Code	Inf
Expl	infinitive
Code	NA
Expl	Not Available

Code	Inf
Expl	infinitive
Code	NA
Expl	Not Available

Accuracy Report:

Accuracy Assessment:

Coverage:

Methods:

Non-Categorized Data Resource

Name:

combine_data_availability_metrics_with_glcp.R

Entity Type:

unknown

Description:

Combines GLCP with data quality metrics

Physical Structure Description:

Object Name:

combine_data_availability_metrics_with_glcp.R

Size:

794 bytes

Authentication:

7bb37b6fdb57df825a93970abf80a256 Calculated By MD5

Externally Defined Format:

Format Name:

unknown

Data:

https://pasta-s.lternet.edu/package/data/eml/edi/394/6/fcf5b281665990a6943e59abae62c275

Non-Categorized Data Resource

Name:

glcp_scripts.tar.gz

Entity Type:

unknown

Description:

scripts only

Physical Structure Description:

Object Name:

glcp_scripts.tar.gz

Size:

76845 bytes

Authentication:

d00604e08ccd57ec718bd422eb1589e2 Calculated By MD5

Externally Defined Format:

Format Name:

unknown

Data:

https://pasta-s.lternet.edu/package/data/eml/edi/394/6/b03a3f6b47a440f9f10c23c52be3f056

Data Package Usage Rights

This information is released under the Creative Commons license - Attribution - CC BY (https://creativecommons.org/licenses/by/4.0/). The consumer of these data (Data User herein) is required to cite it appropriately in any publication that results from its use. The Data User should realize that these data may be actively used by others for ongoing research and that coordination may be necessary to prevent duplicate publication. The Data User is urged to contact the authors of these data if any questions about methodology or results occur. Where appropriate, the Data User is encouraged to consider collaboration or co-authorship with the authors. The Data User should realize that misinterpretation of data may occur if used out of context of the original study. While substantial efforts are made to ensure the accuracy of data and associated documentation, complete accuracy of data sets cannot be guaranteed. All data are made available as is.

The Data User should be aware, however, that data are updated periodically and it is the responsibility of the Data User to check for new versions of the data. The data authors and the repository where these data were obtained shall not be liable for damages resulting from any use or misinterpretation of the data. Thank you.

Keywords

By Thesaurus:
(No thesaurus)	Hydrology, lentic systems, environmental synthesis
LTER Controlled Vocabulary	lake

Methods and Protocols

These methods, instrumentation and/or protocols apply to all data in this dataset:

Methods and protocols used in the collection of this data package

Description:

In order to harmonize several disparate datasets, we implemented the workflow described below. For a more complete description of our workflow and quality control measures, please see the manuscript Meyer & Labou et al. (2020). The authors recommend that when citing the GLCP to please cite both the data descriptor publication as well as the data product hosted on EDI.

The authors recommend that future users implement similar packages used in this workflow for manipulating the GLCP. In particular, loading the GLCP into certain environments, such as R, often require the use of external packages. Most notably, the fread() function from the data.table package (Dowle & Srinivasan, 2019) can quickly read the GLCP into the R environment through the command:

glcp = fread(x = 'glcp.csv', header = TRUE, integer64 = 'character').

Data Sources

Lake locations and boundaries

For the locations of lakes, we used the HydroLAKES database version 1.0 (Messager et al. 2016) (https://www.hydrosheds.org/page/hydrolakes), which incorporates multiple lake datasets (e.g., Shuttle Radar Topology Mission, Water Body Data, Global Lakes and Wetlands Database) and includes 1,427,688 lakes of at least 10 hectares in surface area. The majority of HydroLAKES lakes are defined as uncontrolled lakes (99.5%), with the remainder identified as reservoirs (0.47%) and controlled lakes (0.03%). HydroLAKES, which is available in the form of shapefiles, includes an extensive number of attributes for lake polygons including: lake surface area (polygon area), elevation, shoreline development, total volume, average depth, residence time, latitude and longitude of pour point, lake type, and others. The HydroLAKES v1.0 identifier (Hylak_id) is retained in the GLCP to facilitate future work making use of other attributes in the HydroLAKES data, which are not included in the GLCP. Additionally, the GLCP also contains the latitude (centr_lat) and longitude (centr_lon) of each lake’s centroid, which were calculated within ArcGIS version 3.1 (ESRI 2015).

Hereafter, HydroLAKES lake polygons are referred to as lakes.

Basins

Because lakes are products of the landscapes in which they reside, we calculated climate and population values relative to each lake’s basin. To identify basin boundaries, we used the HydroBASINS dataset, a basin-level analog to the HydroLAKES dataset. The HydroBASINS version 1.c format 1 database (Lehner and Grill 2013) includes 3,786,218 unique basins and is derived from the HydroSHEDS database (Lehner et al. 2008), which uses 15 arc-second resolution data to identify river basins, watersheds, and sub-basins globally. In HydroBASINS, basins are identified using the Pfafstetter coding system, with Level 1 as the highest level (i.e., continent level) and Level 12 as the smallest available sub-basin. Table 1 details the number and median size of basins within each Pfafstetter level for basins used within the GLCP. We retain the original HydroBASINS version 1.c identifier (HYBAS_ID) for each basin in the GLCP, for ease of future integration with existing HydroBASINS attributes, such as distance from basin outlet to next downstream sink and indicators of endorheic basins.

Hereafter, HydroBASINS polygons are referred to as basins.

Surface water extent

For changes in lake surface water area over time, we used the Joint Research Centre (JRC) Global Surface Water Dataset described in Pekel et al.(2016), which used LANDSAT imagery (30 meter resolution) from March 1984 through October 2015 to identify changes in surface water area for lakes, rivers, streams, and wetlands. The data are hosted by the European Commission JRC and are formally referred to as the Global Surface Water Dataset. Hereafter, we use the abbreviation JRC to refer to this dataset.

The JRC data subsetted for Yearly Water Classification History v1.0 (1984-2015) are publicly available through Google Earth Engine (GEE) as annually aggregated raster images. Each image contains a waterClass band with the following values: 0 = no observations, 1 = not water, 2 = seasonal water (defined as water that is present for at least one month but not an entire year), 3 = permanent water (defined as water that is present for all twelve months). For more detailed information on the complete LANDSAT processing workflow used to create the JRC, Pekel et al.(2016) provides a methodology of how waterClasses were assigned based on raw LANDSAT data.

While the JRC dataset is the most extensive global surface water dataset available to date, it is limited by the LANDSAT data from which it is derived. Even though LANDSAT coverage began in 1984, portions of northeastern Siberia (Kolyma region and Central Siberian plateau) as well as central Greenland were not included in totality until 1999.

Additionally, the JRC is limited by the water identification algorithm used by Pekel et al.(2016), which divided pixels into water, land, or non-valid observations, where non-valid observations may include snow and ice. This system therefore does not classify permanently frozen lakes as water. Seasonally frozen lakes, however, would be coded as entirely seasonal water.

Climate data

We used the Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2) (Gelaro et al. 2017) as the source for climate data. Both precipitation and temperature datasets(Global Modeling and Assimilation Office (GMAO) 2015) were hourly aggregates with original spatial resolution of 0.5 x 0.625 decimal degrees. From these broader datasets, we extracted the variables PRECTOTCORRLAND (total precipitation land; bias corrected; in kg m-2 s-1, or volumetrically, mm s-1) for precipitation and T2MMEAN (2-meter air temperature in K) for temperature. These subsets were exported from NASA Goddard Earth Sciences (GES) and Data Information Services Center (DISC) in a netCDF format for local analysis.

Population estimates

We used the Gridded Population of the World (GPW) version 3(Center for International Earth Science Information Network - CIESIN - Columbia University et al. 2005) for 1995 and GPW version 4(Center for International Earth Science Information Network - CIESIN - Columbia University) un-adjusted population count data for 2000, 2005, 2010, and 2015 population estimates. Resolution for the GPW version 3 is 2.5 arc-minutes and is available for download from NASA’s Socioeconomic Data and Applications Center (SEDAC). Resolution for GPW version 4 is 30 arc-seconds and is currently hosted on Google Earth Engine. Detailed methodology for the development of these datasets is available in Doxsey-Whitfield et al. (2015)

Data harmonization process

As this project involved harmonizing multiple global datasets at different resolutions, our workflow required multiple steps, each of which resulted in a cleaned data subset. Here we detail the steps taken to integrate the HydroLAKES, HydroBASINS, JRC, climate, and human population datasets described above.

Step #1: Calculate lake surface area

Lake surface area for each lake from 1995 to 2015 was calculated using Google Earth Engine(Gorelick et al. 2017). Lake polygons were uploaded and imported into Earth Engine as shapefiles. These lake polygons, which represent typical shape and area for individual lakes, were buffered by a specified distance to allow water area calculations to account for increases in lake area beyond the HydroLAKES polygon borders as specified in HydroLAKES. Buffered lake polygons were then used as boundaries within which to summarize pixels from annual JRC data for each waterClass category (i.e., no data, not water, seasonal water, or permanent water). We calculated total water as the sum of seasonal and permanent water pixels. Resulting area values were exported in .csv format to Google Drive, then downloaded for local analysis using the R statistical environment(R Core Team 2019). Commented Google Earth Engine code for lake area calculations (jrc_water_class_sum.txt), the R script for formatting Google Earth Engine output (01_import_format_JRC.R)(Wickham and Henry 2018; Wickham et al. 2018; Dowle and Srinivasan 2019), and associated input data are available in the Environmental Data Initiative (EDI) GLCP repository (Labou et al.) within the entity glcp.tar.gz.

To evaluate how lake waterClass areas fluctuated with various buffer sizes, we calculated lake waterClass areas for 1995, 2000, 2005, 2010, and 2015 with 30 m, 60 m, 90 m, and 120 m buffers. Preliminary tests indicated smaller buffers were insufficient to capture large area increases, while larger buffers increased risk of overlapping neighboring lakes (especially in dense lake areas), smaller ponds, or input/output rivers and erroneously increasing lake area totals. Our analysis of waterClass areas between buffer sizes and years indicated 90 m as the most appropriate distance. Additional details are provided in the Technical Validation section.

We identified a minority of lakes that were unable to be included in the final data product. One lake in North America was identified as having a broken geometry (Hylak_id = 109424), making it incompatible with Earth Engine-based analyses. Rather than attempt to repair the lake shapefile boundaries and potentially change the size and shape, we chose not to include this lake. Additionally, a small number of lakes were identified to be outside the range of reliable LANDSAT data. The available JRC data has a maximum extent of 80 degree N and Pekel et al. (Pekel et al. 2016) note that LANDSAT images above 78 degree N are sparse, partially due to the short LANDSAT observation season in high northern latitudes. As such, we limited further processing to lakes whose entire extent is below 78 degree N, which excluded 3,220 lakes (0.23% of original 1.4 million lakes).

Given the potential for lakes in this area to have inaccurate area measurements prior to 1999, we calculated ratios of no data pixel areas to not water, seasonal water, permanent water, and total water pixel areas. These ratios will enable future users to set desired thresholds of no data coverage that are specific to their research questions. These ratios are provided within a secondary .csv file (JRC_all_no_data_proportions_yearly_95thru15.csv) and can be merged efficiently with the full GLCP using the provided R script (combine_data_availability_metrics_with_glcp.R).

Step #2: Match lakes with basins

Because HydroBASINS is derived from river networks, rather than lake pour points, the HydroLAKES and HydroBASINS data do not come with a pre-existing 1:1 matching scheme for lake and basin. To match lakes with their equivalent basins, we performed spatial joins for the lake shapefiles and basin shapefiles to identify the smallest basin that enclosed a lake in its entirety. With this basin matching scheme, there is potential for some lakes to be assigned to basins which are larger than their actual basin. As a result, future users are encouraged to compare GLCP-associated basin area to a lake’s known basin area, if those data exist. Comparing HydroBASINS river basins to known lake basins would enable researchers to determine if differences in basin assignment are meaningful of their specific research question. All lakes that fell within a Pfafstetter Level 12 basin (85% of lakes, Table 1) were tagged with the Level 12 basin identifier, because no smaller sub-basins were available. The highest level Pfafstetter basin used was Level 2 (Level 1 being near continent-level), which was sufficient to capture the watersheds of very large lakes, such as the Laurentian Great Lakes and the Caspian Sea. Of the original 3,786,218 HydroBASINS basins, 232,827 were paired with lakes (6.15%). This basin matching procedure was performed within Google Earth Engine (hylak_hybasin_matching.txt) and outputs were formatted locally using R (07_lake_basin_matching.R) (Wickham et al. 2018; Dowle and Srinivasan 2019).

Using this lake/basin matching procedure, 1,949 lakes (0.14% of the original 1.4 million HydroLAKES lakes) were unable to be properly associated with a basin. Manual investigation indicated that these lakes were either located on islands (645 lakes, 0.05% of the original HydroLAKES) or would be associated with only a Level 1 basin (1,304 lakes, 0.09% of the original HydroLAKES). Lakes located on islands are excluded from the GLCP because their natural basins are not included in the continental basin schema that HydroBASINs employs. Similarly, the 1,304 lakes associated with Level 1 basins were consistently located on the boundary between neighboring basins and therefore never completely enclosed in a single basin. This peculiarity is largely because HydroBASINS is constructed for river networks, as opposed to lakes. Because it is unrealistic for these 1,304 lakes (average total area: 2.39 km2) to be influenced by near continental-scale climate and human population forcings, we excluded these lakes from further processing.

Step #3: Calculate basin-level precipitation and temperature estimates

Once basins were associated with lakes, basin-level climate values were calculated. Within the R environment, precipitation values from MERRA-2 were converted to annually accumulated precipitation by aggregating hourly data for each gridcell for each year (Reichle et al. 2017b; a). We also derived the average monthly volume of precipitation for each gridcell for each year (1995-2015) by taking the mean of each year’s total monthly precipitation volumes (summing_hourly_data_precip_mm.R)(Bivand et al. 2019; Hijmans 2019; Pierce 2019; Revolution Analytics and Weston 2019). Temperature values were similarly used to derive an average annual temperature for each year (summing_hourly_data_temp_K.R) (Tierney et al. 2018; Bivand et al. 2019; Hijmans 2019; Pierce 2019; Revolution Analytics and Weston 2019). The resulting yearly data were saved as rasters. Yearly total precipitation, average monthly total precipitation, and temperature rasters (1995-2015) were then resampled at 1/10th cell size through a bilinear interpolation resampling. The original rectangular grids were converted to squares, with spatial resolution of 0.05 x 0.05 decimal degrees. Because MERRA-2 gridcells were originally sized at 0.5 x 0.625 decimal degree resolution, the initial conversion from a netCDF to a raster format induced extra space (e.g., 90.25 degree N in raster). As such, resampled rasters were clipped to 90 degree N/S and 180 degree W/E and converted to geotiff format for upload to Google Earth Engine (manipulate_climate_rasters.R) (Bivand et al. 2019; Hijmans 2019; Pierce 2019, p. 4; Revolution Analytics and Weston 2019).

For each basin associated with a lake, basin-level average and total precipitation, as well as average temperature, were calculated for each year of interest in Earth Engine. The process was similar to the one described for lake polygons and JRC data, whereby the basins were used as boundaries from which to extract and aggregate pixels. Results were exported as .csv files to Google Drive, then downloaded for local analysis using R. R scripts for data aggregation of climate variables (04_post_gee_processing_temp.R , 05_post_gee_processing_precip_sum.R, 06_post_gee_processing_precip_average.R)(Wickham et al. 2018; Wickham 2019; Dowle and Srinivasan 2019) are available on the EDI GLCP repository (Labou et al.) within the entity glcp.tar.gz.

This process resulted in 10 matched basins (of the original 232,837 matched basins; 0.004%), which were associated with 19 lakes, with missing values for climate variables. These 10 basins ranged in size from 1.1 to 181.7 km2 with a median of 76.5 km2. These basins and lakes were removed from the dataset. Manual assessment showed that these basins were located at higher northern latitudes in the United States and the Russian Federation. We note that other temperature and precipitation datasets are available; subsequent analyses can incorporate alternative climate data sources to match with these basins through the scripts and workflow provided in the EDI entity glcp.tar.gz.

Future users should also note that while HydroBASINS provided a boundary to calculate climate variables for each lake, these calculations may be overestimates, as many lakes’ actual basins may be smaller than their associated river basin. However, as with the addition of new climate variables, subsequent analyses can also incorporate different basin schemes as lake basin shapefiles become available.

Step #4: Calculate basin-level population estimates

While other data sources in this project are annual, the global population data we used, which was the current best available at the global scale, was for 5-year increments (1995, 2000, 2005, 2010, 2015). Rather than interpolate the intervening years’ values, we chose to leave these blank so that future researchers can personalize statistical methodology to best address these data gaps in context of a specific question. Aside from blank values, numerous basins have population estimates as decimal values. Rather than truncate these values which were produced through the aggregation process within Google Earth Engine, we retain these so that future users may decide how they wish to round or otherwise interpret these values in the context of their particular research question.

For each 5-year increment, human population calculations were performed with a technique similar to the climate data aggregations. GPWv3 (1995 data) was converted to a geotiff and imported into Earth Engine. GPWv4 (2000, 2005, 2010, and 2015) rasters were available through the Earth Engine interface. Basin-level population totals were calculated from GPWv3 and GPWv4 data with basin polygons as spatial boundaries. Results were exported as .csv files to Google Drive and then downloaded for local analysis within the R environment. R scripts for data aggregation of population counts (02_load_shp_GPWv3.R, 03_load_shp_GPWv4.R) (Wickham et al. 2018; Wickham 2019; Dowle and Srinivasan 2019) are available on the EDI GLCP repository (Labou et al.) within the entity glcp.tar.gz .

Like climate variables, population calculations have the potential to be overestimated, as many lakes’ actual basins may be smaller than their associated river basin. Additionally, future users should be cognizant of heterogeneous human populations within the basin, which could skew analyses of relationships between human population and lake area. As is currently calculated, populations estimates are at the basin scale, whereas certain research questions may be more focused at population estimates within a particular distance from the lake (e.g., 500 m). However, subsequent analyses can also incorporate different basin schemes as lake basin shapefiles become available.

Step #5: Merge lake- and basin-level data

Lake- and basin-level output were merged within the R environment. The R script for GLCP production (08_cleaning_glcp_production.R) (Wickham and Henry 2018; Wickham et al. 2018; Dowle and Srinivasan 2019) is available in the EDI GLCP repository within the entity glcp.tar.gz.

Synopsis of lake exclusions during data harmonization:

The final GLCP dataset contains 1,422,499 lakes. These lakes were used to generate all dataset summary statistics reported below.

References

Bivand, R., T. Keitt, and B. Rowlingson. 2019. rgdal: Bindings for the Geospatial Data Abstraction Library,.

Center for International Earth Science Information Network - CIESIN - Columbia University. Gridded Population of the World, Version 4 (GPWv4): Population Count, Revision 11.

Center for International Earth Science Information Network - CIESIN - Columbia University, United Nations Food and Agriculture Programme - FAO, and Centro Internacional de Agricultura Tropical - CIAT. 2005. Gridded Population of the World, Version 3 (GPWv3): Population Count Grid.

Dowle, M., and A. Srinivasan. 2019. data.table: Extension of data.frame

Doxsey-Whitfield, E., K. MacManus, S. B. Adamo, L. Pistolesi, J. Squires, O. Borkovska, and S. R. Baptista. 2015. Taking advantage of the improved availability of census data: a first look at the gridded population of the world, version 4. Papers in Applied Geography 1: 226-234.

ESRI. 2015. ArcGIS Desktop: Release 10.3.1, Environmental Systems Research Institute.

Gelaro, R., W. McCarty, M. J. Suarez, and others. 2017. The Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2). Journal of Climate 30: 5419-5454. doi:10.1175/JCLI-D-16-0758.1

Global Modeling and Assimilation Office (GMAO). 2015. MERRA-2 tavg1_2d_flx_Nx: 2d,1-Hourly,Time-Averaged,Single-Level,Assimilation,Surface Flux Diagnostics V5.12.4. Goddard Earth Sciences Data and Information Services Center (GES DISC). doi:10.5067/7MCPBJ41Y0K6

Gorelick, N., M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore. 2017. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment 202: 18-27.

Hijmans, R. J. 2019. raster: Geographic Data Analysis and Modeling,.

Labou, S. G., M. F. Meyer, M. R. Brousil, A. N. Cramer, and B. T. Luff. The global lake area, climate, and population (GLCP) dataset.

Lehner, B., and G. Grill. 2013. Global river hydrography and network routing: baseline data and new approaches to study the world’s large river systems. Hydrological Processes 27: 2171-2186. doi:10.1002/hyp.9740

Lehner, B., K. Verdin, and A. Jarvis. 2008. New Global Hydrography Derived From Spaceborne Elevation Data. Eos Trans. AGU 89: 93-94. doi:10.1029/2008EO100001

Messager, M. L., B. Lehner, G. Grill, I. Nedeva, and O. Schmitt. 2016. Estimating the volume and age of water stored in global lakes using a geo-statistical approach. Nature Communications 7: 13603. doi:10.1038/ncomms13603

Pekel, J.-F., A. Cottam, N. Gorelick, and A. S. Belward. 2016. High-resolution mapping of global surface water and its long-term changes. Nature 540: 418-422. doi:10.1038/nature20584

Pierce, D. 2019. ncdf4: Interface to Unidata netCDF (Version 4 or Earlier) Format Data Files,.

R Core Team. 2019. R: A Language and Environment for Statistical Computing,.

Reichle, R. H., C. S. Draper, Q. Liu, M. Girotto, S. P. P. Mahanama, R. D. Koster, and G. J. M. De Lannoy. 2017a. Assessment of MERRA-2 Land Surface Hydrology Estimates. Journal of Climate 30: 2937-2960. doi:10.1175/JCLI-D-16-0720.1

Reichle, R. H., Q. Liu, R. D. Koster, C. S. Draper, S. P. Mahanama, and G. S. Partyka. 2017b. Land surface precipitation in MERRA-2. Journal of Climate 30: 1643-1664.

Revolution Analytics, and S. Weston. 2019. doParallel: Foreach Parallel Adaptor for the parallel Package,.

Tierney, L., A. J. Rossini, N. Li, and H. Sevcikova. 2018. snow: Simple Network of Workstations,.

Wickham, H. 2019. stringr: Simple, Consistent Wrappers for Common String Operations,.

Wickham, H., R. Francois, L. Henry, and K. Muller. 2018. dplyr: A Grammar of Data Manipulation,.

Wickham, H., and L. Henry. 2018. tidyr: Easily Tidy Data with spread() and gather() Functions,.

People and Organizations

Creators:

Individual:	Stephanie G Labou
Organization:	Center for Environmental Research, Education, & Outreach, Washington State University
Id:	https://orcid.org/0000-0001-5633-5983

Individual:

Michael F Meyer

Organization:

School of the Environment, Washington State University

Email Address:

michael.f.meyer@wsu.edu

Id:

https://orcid.org/0000-0002-8034-9434

Individual:

Matthew R Brousil

Organization:

Center for Environmental Research, Education, & Outreach, Washington State University

Email Address:

matthew.brousil@wsu.edu

Id:

https://orcid.org/0000-0001-8229-9445

Individual:

Alli N Cramer

Organization:

School of the Environment, Washington State University

Email Address:

allison.cramer@wsu.edu

Id:

https://orcid.org/0000-0002-0356-5782

Individual:

Bradley T Luff

Organization:

School of the Environment, Washington State University

Email Address:

bradley.luff@wsu.edu

Contacts:

Individual:	Stephanie G Labou
Organization:	Center for Environmental Research, Education, & Outreach, Washington State University
Id:	https://orcid.org/0000-0001-5633-5983

Individual:

Michael F Meyer

Organization:

School of the Environment, Washington State University

Email Address:

michael.f.meyer@wsu.edu

Id:

https://orcid.org/0000-0002-8034-9434

Individual:

Matthew R Brousil

Organization:

Center for Environmental Research, Education, & Outreach, Washington State University

Email Address:

matthew.brousil@wsu.edu

Id:

https://orcid.org/0000-0001-8229-9445

Individual:

Alli N Cramer

Organization:

School of the Environment, Washington State University

Email Address:

allison.cramer@wsu.edu

Id:

https://orcid.org/0000-0002-0356-5782

Individual:

Bradley T Luff

Organization:

School of the Environment, Washington State University

Email Address:

bradley.luff@wsu.edu

Temporal, Geographic and Taxonomic Coverage

Temporal, Geographic and/or Taxonomic information that applies to all data in this dataset:

Time Period

Begin:

1995-01-01

End:

2015-10-31

Geographic Region:

Description:

Global

Bounding Coordinates:

Northern:	78	Southern:	-78
Western:	-180	Eastern:	180

Project

Parent Project Information:

Title:

NSF Graduate Research Fellowship

Personnel:

Individual:	Michael F Meyer
Id:	https://orcid.org/0000-0002-8034-9434
Role:	Principal Investigator

Funding:

NSF DGE-1347973

Maintenance

Maintenance:

Description:	completed
Frequency:

Other Metadata

Copyright 2024 Environmental Data Initiative. This material is based upon work supported by the National Science Foundation under grants #2223103 and #2223104. Any opinions, findings, conclusions, or recommendations expressed in the material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Please contact us with questions, comments, or for technical assistance regarding this web site or the Environmental Data Initiative. Please read our privacy policy to know what information we collect about you and to understand your privacy rights.

EDI is a collaboration between the University of New Mexico and the University of Wisconsin – Madison, Center for Limnology:

Data Package Metadata View Summary

Global lake area, climate, and population dataset

Data Entities

Data Table

Data Table

Non-Categorized Data Resource

Non-Categorized Data Resource

Data Package Usage Rights

Keywords

Methods and Protocols

These methods, instrumentation and/or protocols apply to all data in this dataset:

Data Sources

Basins

Surface water extent

Climate data

Population estimates

Data harmonization process

Synopsis of lake exclusions during data harmonization:

References

People and Organizations

Temporal, Geographic and Taxonomic Coverage

Project

Parent Project Information:

Maintenance

Recently Added

Recently Updated

Data Package Metadata View Summary

Global lake area, climate, and population dataset

+/- Data Entities

Data Table

Data Table

Non-Categorized Data Resource

Non-Categorized Data Resource

+/- Data Package Usage Rights

+/- Keywords

+/- Methods and Protocols

These methods, instrumentation and/or protocols apply to all data in this dataset:

Data Sources

Basins

Surface water extent

Climate data

Population estimates

Data harmonization process

Synopsis of lake exclusions during data harmonization:

References

+/- People and Organizations

+/- Temporal, Geographic and Taxonomic Coverage

+/- Project

Parent Project Information:

+/- Maintenance

Data Entities

Data Package Usage Rights

Keywords

Methods and Protocols

People and Organizations

Temporal, Geographic and Taxonomic Coverage

Project

Maintenance