Arrow Output
Overview
By default, SlideRule returns all processing results in a native (i.e. custom to SlideRule) format as soon as they are generated. Those results are streamed back to the client and used by the client to construct a (Geo)DataFrame that is presented to the user. But sometimes it is desirable to have SlideRule build a (Geo)DataFrame on the server, and then stream that dataframe back to the client for easy reconstruction. This could be because the dataframe is quite large and the environment the client is running in does not have the resources to build the dataframe. Or it could be that the results need to be stored directly in an S3 bucket and having the dataframe already built expedites that process.
To support this functionality, SlideRule uses the Apache Arrow library to build dataframes in either Parquet, CSV, or Feather formats. When using Parquet, the server also provides the option for using the GeoParquet convention to populate a geometry column and metadata compatible with GeoPandas.
Parameters
To control writing the data to an Arrow supported format, the output
parameter is used.
output
: settings to control how SlideRule outputs resultspath
: the full path and filename of the file to be constructed by the client, NOTE - the path MUST BE less than 128 charactersformat
: the format of the file constructed by the servers and sent to the client (currently, only GeoParquet is supported, specified as “parquet”)open_on_complete
: boolean; if true then the client is to open the file as a DataFrame once it is finished receiving it and writing it out; if false then the client returns the name of the file that was writtenas_geo
: if theparquet
format is specified, write the data compliant with theGeoParquet
specificationwith_checksum
: include a checksum of the returned file in the responsewith_validation
: run the Apache Arrow validation routine on the resulting file before returning it to the userregion
: AWS region when the output path is an S3 bucket (e.g. “us-west-2”)asset
: the name of the SlideRule asset from which to get credentials for the optionally supplied S3 bucket specified in the output pathcredentials
: the AWS credentials for the optionally supplied S3 bucket specified in the output pathaws_access_key_id
: AWS access key idaws_secret_access_key
: AWS secret access keyaws_session_token
: AWS session token
parms {
"output": { "path": "grandmesa.parquet", "format": "parquet", "open_on_complete": True }
}
S3 Output to User Bucket
SlideRule supports writing the output to an S3 bucket instead of streaming the output back to the client. In order to enable this behavior, the output.path
field must start with “s3://” followed by the bucket name and object key. For example, if you wanted the result to be written to a file named “grandmesa.parquet” in your S3 bucket “mybucket”, in the subfolder “maps”, then the output.path would be “s3://mybucket/maps/grandmesa.parquet”. When writing to S3, it is required by the user to supply the necessary credentials. Here is an example code snippet for writing to your own bucket.
import os
import argparse
import configparser
# Setup Config Parser for Credentials
home_directory = os.path.expanduser('~')
aws_credential_file = os.path.join(home_directory, '.aws', 'credentials')
config = configparser.RawConfigParser()
# Read AWS Credentials
config.read(aws_credential_file)
# Populate Output Parameters
parms["output"] = {
"path": "s3://mybucket/myfile.parquet",
"format": "parquet",
"open_on_complete": False,
"region": "us-west-2",
"credentials": {
"aws_access_key_id": config.get('default', 'aws_access_key_id'),
"aws_secret_access_key": config.get('default', 'aws_secret_access_key'),
"aws_session_token": config.get('default', 'aws_session_token')
}
}
S3 Staging
SlideRule also supports writing the output to its own S3 bucket for times when temporary storage is needed and the user does not have access to a bucket they own.
To use this feature, the following parameters can be used:
"output": {
"asset": "sliderule-stage",
"path": "myfile.parquet",
"open_on_complete": False,
}
The sliderule-stage
asset tells sliderule to stage the output in SlideRule’s own bucket. The full path of the file is then returned back to the user so that the user can later open the file directly. For instance, in the above example, a call to icesat2.atl06p
could return “s3://sliderule-public/myfile.parquet”. If no path
is specified, the server code generates a random file name and uses it to store the results. The generated file name including the path is returned back to the user.
There are a couple of constraints to using this feature:
In order to access the file staged in the SlideRule owned bucket, the user has to have read-access to the bucket. Typically, this is only provided through partnered organizations like CryoCloud. When running on CryoCloud, a user can specify the “sliderule-stage” asset, and know that they returned file path is immediately accessible from their environment.
Files are only stored for a short period of time before they are automatically deleted. Typically, the retention time is 2 weeks.