Returning Data from SlideRule in the GeoParquet Format

2022-12-23

Overview

This tutorial walks you through the steps necessary to return data from SlideRule in the GeoParquet format.

Prerequisites: This walk-through assumes you have already installed the SlideRule Python client and familiar with how to use it. See the installation instructions in the reference documentation if you need help installing the SlideRule Python client. See the Making Your First Request tutorial if you’ve never made a SlideRule processing request before.

Background

GeoParquet is a cloud-optimized format for storing geospatial datasets. It is built on top of Apache’s Parquet format and is fully compatible with all Parquet-based tools. The official specification for GeoParquet can be found here: https://github.com/opengeospatial/geoparquet.

By default, SlideRule uses its own native streaming format for de/serializing data across a network. As processing results are produced by SlideRule, they are immediately transmitted over the network to the requestor. When the SlideRule Python client is being used, these results are used to construct a GeoDataFrame on the fly. While this approach as some advantages - namely low latency responses and low memory usage on the server - it also has some drawbacks. Chief among them is the processing required to construct the DataFrame. Data returned by the SlideRule service can be thought of as rows of information. As each row is received it is appended to the final DataFrame. But because DataFrames are columnar-based, each time data is appended, costly memory allocations and data copies result. Effort has been made to optimize this processing in the client, but ultimately only so much can be done and the problem remains that it is encumbent on the client to parse, rearrange, and construct the DataFrame.

For small responses (less than 1M points), things are okay. But as responses get larger, the client is unable to keep up with the SlideRule servers, and can bottleneck the process or even crash if it runs out of memory. To address these shortcomings, SlideRule supports sending responses back as GeoParquet files. When a GeoParquet file is requested, the results of the request are built entirely on the servers as a GeoParquet file, and then the final file is streamed back to the client where it is directly written to disk. This allows large requests to consume server-side resources in parsing, rearranging, and building a DataFrame-like structure. Clients can then choose to open the resulting GeoParquet file immediately, or open it at some later time with different software.

Requesting the GeoParquet Format

The “output” parameter is used to request the GeoParquet format.

Step 1: Import and initialize the SlideRule Python package for ICESat-2.

>>> from sliderule import sliderule, icesat2
>>> icesat2.init("slideruleearth.io")

Step 2: Create parameters for a typical atl06p processing request (it could also be atl03sp).

>>> grand_mesa = sliderule.toregion('grandmesa.geojson')
>>> parms = {
    "poly": grand_mesa["poly"],
    "srt": icesat2.SRT_LAND,
    "cnf": icesat2.CNF_SURFACE_HIGH,
    "len": 40.0,
    "res": 20.0
}

The grandmesa.geojson file used in this example can be downloaded by navigating to our downloads page; alternatively, you can create your own GeoJSON file at geojson.io.

Step 3: Specify the GeoParquet format.

>>> parms["output"] = { "path": "grandmesa.parquet", "format": "parquet", "open_on_complete": True }

The “path” parameter is the name of the local file the client will write the parquet output to.

The “format” parameter is what specifies that the GeoParquet format is requested.

The “open_on_complete” means that the client will return a GeoDataFrame of the opened GeoParquet file when making a call to the atl06p or atl03sp functions. If this option is false, then the client returns the name of the file.

Step 4: Issue the processing request to SlideRule.

>>> gdf = icesat2.atl06p(parms)

When this completes (~45 seconds), the gdf variable should now contain all of the results of the elevations calculated by SlideRule; and there should be a grandmesa.parquet file in the directory where Python was run from.

Step 5: Display a summary of the results.

>>> print(gdf)

	extent_id 	distance 	segment_id 	rgt 	rms_misfit 	delta_time 	gt 	dh_fit_dy 	n_fit_photons 	lat 	h_sigma 	lon 	pflags 	spot 	h_mean 	cycle 	w_surface_window_final 	dh_fit_dx 	geometry
0 	3319153005902693763 	4.331749e+06 	216001 	737 	0.328377 	2.755125e+07 	60 	0.0 	31 	38.935416 	0.117401 	-108.032116 	0 	1 	2103.522185 	1 	3.000000 	0.112198 	POINT (38.93542 -108.03212)
1 	3319153005902693767 	4.331769e+06 	216002 	737 	0.347173 	2.755125e+07 	60 	0.0 	30 	38.935596 	0.088707 	-108.032137 	0 	1 	2105.439653 	1 	3.000000 	0.097912 	POINT (38.93560 -108.03214)
2 	3319153005902693771 	4.331790e+06 	216003 	737 	0.406988 	2.755125e+07 	60 	0.0 	10 	38.935776 	0.143279 	-108.032158 	0 	1 	2106.078998 	1 	4.484487 	-0.003341 	POINT (38.93578 -108.03216)
3 	3319153005902693775 	4.331810e+06 	216004 	737 	0.297643 	2.755125e+07 	60 	0.0 	40 	38.935956 	0.085242 	-108.032179 	0 	1 	2106.949294 	1 	3.000000 	0.106623 	POINT (38.93596 -108.03218)
4 	3319153005902693779 	4.331830e+06 	216005 	737 	0.280220 	2.755125e+07 	60 	0.0 	73 	38.936135 	0.032952 	-108.032201 	0 	1 	2109.070343 	1 	3.000000 	0.100320 	POINT (38.93614 -108.03220)
... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	...
262158 	1328563070116566907 	4.354303e+06 	217125 	295 	2.831141 	1.505993e+08 	60 	0.0 	184 	39.139200 	0.208896 	-108.295364 	0 	1 	1608.092662 	17 	16.216058 	-0.506227 	POINT (39.13920 -108.29536)
262159 	1328563070116566915 	4.354343e+06 	217127 	295 	3.100837 	1.505993e+08 	60 	0.0 	166 	39.139559 	0.242325 	-108.295405 	0 	1 	1596.932221 	17 	17.165378 	-0.239357 	POINT (39.13956 -108.29541)
262160 	1328563070116566911 	4.354323e+06 	217126 	295 	2.389764 	1.505993e+08 	60 	0.0 	169 	39.139380 	0.183829 	-108.295384 	0 	1 	1600.750765 	17 	13.859626 	-0.156372 	POINT (39.13938 -108.29538)
262161 	1328563070116566919 	4.354363e+06 	217128 	295 	3.053285 	1.505993e+08 	60 	0.0 	166 	39.139739 	0.236985 	-108.295426 	0 	1 	1591.588886 	17 	17.106291 	-0.291532 	POINT (39.13974 -108.29543)
262162 	1328563070116566923 	4.354383e+06 	217129 	295 	2.411602 	1.505993e+08 	60 	0.0 	176 	39.139919 	0.181812 	-108.295448 	0 	1 	1586.793077 	17 	12.933652 	-0.179786 	POINT (39.13992 -108.29545)

262163 rows × 19 columns

For a full description of all of the fields returned from the atl06p function, see the elevations documentation.