GeoPython 2022

[WORKSHOP CANCELLED] Pangeo Forge: Crowdsourcing Open Data in the Cloud
2022-06-22, 14:00–15:30, Room 2

Pangeo Forge is a new open-source platform that aims to make it easy to extract data from traditional data repositories and deposit it in cloud storage in analysis-ready, cloud-optimized (ARCO) formats. This workshop will teach users how to use Pangeo Forge and contribute to the growing, community driven data library.


Geospatial datacubes--large, complex, interrelated multidimensional arrays with rich metadata--arise in analysis-ready geopspatial imagery, level 3/4 satellite products, and especially in ocean / weather / climate simulations and [re]analyses, where they can reach Petabytes in size. The scientific python community has developed a powerful stack for flexible, high-performance analytics of databcubes in the cloud. Xarray provides a core data model and API for analysis of such multidimensional array data. Combined with Zarr or TileDB for efficient storage in object stores (e.g. S3) and Dask for scaling out compute, these tools allow organizations to deploy analytics and machine learning solutions for both exploratory research and production in any cloud platform. Within the geosciences, the Pangeo open science community has advanced this architecture as the “Pangeo platform” (http://pangeo.io/)

However, there is a major barrier preventing the community from easily transitioning to this cloud-native way of working: the difficulty of bringing existing data into the cloud in analysis-ready, cloud-optimized (ARCO) format. Typical workflows for moving data to the cloud currently consist of either bulk transfers of files into object storage (with a major performance penalty on subsequent analytics) or bespoke, case-by-case conversions to cloud optimized formats such as TileDB or Zarr. The high cost of this toil is preventing the scientific community from realizing the full benefits of cloud computing. More generally, the outputs of the toil of preparing scientific data for efficient analysis are rarely shared in an open, collaborative way.

To address these challenges, we are building Pangeo Forge (https://pangeo-forge.org/), the first open-source cloud-native ETL (extract / transform / load) platform focused on multidimensional scientific data. Pangeo Forge consists of two main elements. An open-source python package--pangeo_forge_recipes (https://github.com/pangeo-forge/pangeo-forge-recipes)–makes it simple for users to define “recipes” for extracting many individual files, combining them along arbitrary dimensions, and depositing ARCO datasets into object storage. These recipes can be “compiled” to run on many different distributed execution engines, including Dask, Prefect, and Apache Beam. The second element of Pangeo Forge is an orchestration backend which integrates tightly with GitHub as a continuous-integration-style service.

We are using Pangeo Forge to populate a multi-petabyte-scale shared library of open-access, analysis-ready, cloud-optimized ocean, weather, and climate data spread across a global federation of public cloud storage–not a “data lake” but a “data ocean”. Inspired directly by the success of Conda Forge, we aim to leverage the enthusiasm of the open science community to turn data preparation and cleaning from a private chore into a shared, collaborative activity. By only creating ARCO datasets via version-controlled recipe feedstocks (GitHub repos), we also maintain perfect provenance tracking for all data in the library.

You will leave this workshop with a clear understanding of how to access this data library, craft your own Pangeo Forge recipe, and become a contributor to our growing collection of community-sourced recipes.