»Open source web-based tool for quality control of large spatial datasets«
2019-06-26, 14:00–14:15, Room 1
The OSS QC tool is a web-based engine for checking vector and raster datasets. This talk presents lessons learnt while developing a configurable architecture for running resource-intensive data quality control jobs with Django, pyWPS, GDAL and PostGIS open-source geospatial technologies in a web-based environment.
Elements of spatial data quality such as logical consistency, semantic accuracy, attribute accuracy and completeness are essential items when publishing spatial datasets. Such requirements include dataset and layer naming conventions, spatial reference system, attribute fields and values, geometry, minimum mapping unit, minimum mapping width, completeness, topological consistency and INSPIRE metadata conformance. For each dataset, there can be a specific set of data quality requirements and mapping rules defined by product requirements.
The OSS QC tool has been developed to automate and speed-up quality control of land use and land cover datasets produced for the European Environmental Agency (EEA). These datasets include Urban Atlas, Riparian Zones and Natura 2000 vector products as well as Forest Type, Grassland, Imperviousness 100 meter and 20 meter resolution raster products covering the whole European Union. The developed tool consists of three components: A Django web console for uploading zipped vector or raster files, a PyWPS web processing service for managing quality control jobs and generating reports, and a suite of configurable check functions that utilize PostGIS, gdal and numpy libraries. The system is deployed in docker containers and can be installed on local PC, dedicated server or cloud hosting. The web processing service allows launching multiple quality control jobs in parallel. The QC tool is also extensible: for a new type of dataset, product specifications and mapping rules can be configured by editing a product definition .json configuration file or by adding a customized check function python module.
Major challenges encountered while developing vector check functions were inconsistencies in tolerance and resolution between ArcGIS and PostGIS, optimizing performance of PostGIS geometrical queries, and checking minimum mapping unit with complex rules and exceptions (polygon touching boundary, polygon touching road, different area thresholds for various land use / land cover classes). For raster checks, the key issue was checking minimum mapping area for large 10-metre resolution rasters. Using connected component labeling (scikit-image) and processing the raster in smaller tiles in parallel allowed the QC tool to process rasters several Gigabytes in size without running out of memory.
The source code and documentation of the OSS QC tool is available at https://github.com/eea/copernicus_quality_tools. The presentation will provide insights and recommendations how to automate spatial data quality control and how to deal with long-running web processing service jobs operating on large spatial datasets.