Marcin Niemyjski
I'm a surveyor by training but I've been developing towards the Data Science for the last two years. During my engineering studies I became interested in GIS, LiDAR and data processing, which I am currently exploring as part of my Master's degree in Geographic Information Systems. I see spatial data industry as a jigsaw puzzle whose pieces are open datasets, Python, SQL and Machine Learning techniques.
Sessions
The Web Map Service (WMS) is the most popular standard of sharing data remotely. It is commonly used as a basemaps, a way of presenting governmental spatial data, and as a data source when creating vector datasets. Creating a WMS requires original data to be read and then rendered. This process can be slow, especially if the source data is heavy and not optimized. This is the case, for example, with Sentinel 1 global satellite data, which is a collection of daily revisions with a total volume of 250 GB per one day. Here we demonstrate an efficient way to share such a very large data set as WMS using Mapserver scaled with Kubernetes.
Mapserver is used as engine of our WMS, because of it speed and ease of automation. In order to optimise the performance of the service and therefore the user experience, it is recommended to store the data in the right format, with the right file structure also being aware of limitations of storage, bucket or disk read speed. GDAL provides a set of options that can be executed in a single command to overwrite the original data with new, cloud optimized. It is usually good practice to store selected zoom levels as a cache, but for time series data that is enriched daily, the cache is not overwritten as new data arrives, but is incremented.
Despite its popularity and advantages, WMS as a standard of serving data has its limitations. The potentially large disk read time is multiplied by the number of users sending requests. Tests using JMeter (100 users sending 100 GetMap requests in a loop) have shown that on a relatively strong processor (32CPU), the greatly increased traffic acts as a distributed denial-of-service (DDoS) - the server stops responding.
This problem is solved using Kubernetes (K8s) which allows metric-based automatic horizontal scaling of containerised applications, in this case – Mapserver. Prometheus as a K8s cluster monitoring tool allows custom metrics to be defined e.g., number of http requests per time interval. Prometheus makes it possible to distribute the traffic between newly created pods so that all requests can be answered.
The aim of the talk is to stimulate discussion, confront the idea with experts and demonstrate good practice in creating a publicly accessible WMS, with a focus on optimising speed under heavy source data conditions, supported by a working example and statistics.