Patryk Grzybowski
Data Scientist in the area of Earth Observation.
Sessions
Destination Earth (DestinE) is a flagship initiative led by the European Commission, implemented by the European Organisation for the Exploitation of Meteorological Satellites (EUMETSAT), the European Space Agency (ESA) and the European Centre for Medium-Range Weather Forecasts (ECMWF). It aims to create highly detailed Digital Twins (DTs) of the Earth, enabling precise simulations for a variety of uses. Currently, the initiative focuses on two primary Digital Twins: the Weather Extremes Digital Twin (ExtremeDT) and the Climate Change Adaptation Digital Twin (ClimateDT). Over the coming years, the scope of Digital Twins is set to expand, necessitating improved access to data and streamlined methods for working with it. This is where the Destination Earth Data Lake (DEDL) plays a pivotal role, offering comprehensive data discovery, access, and processing services tailored to the needs of DestinE users.
The DEDL operates on two key levels: ‘Data Discovery and Access’ and ‘Edge Services’. DEDL Discovery and Data Access services is provided by Harmonized Data Access (HDA) tool which provides a single, federated entry point to the services and data, including resources from existing datasets and complementary sources such as in-situ and socio-economic data. Notably, it also provides access to the unique datasets generated by DestinE’s Digital Twins. By combining these sources, users can seamlessly explore, integrate, and analyze both existing services and the innovative data produced by the Digital Twins. What is more, all this data is provided as a full archive immediately available to the user. The services rely on use of the SpatioTemporal Asset Catalogs (STAC) standard which means:
• The search in the dataset is done according to the STAC protocol;
• The Federated Catalog search proxy component converts STAC queries into queries adapted to the underlying catalog and returns the results to the user in STAC format;
• The services are presented in service catalog.
Edge Services offered by DEDL provides:
• Cloud Computing
• STACK Application Development Environment
• Hook Services
The cloud computing service is powered by the ISLET infrastructure, a distributed Infrastructure as a Service (IaaS) built on OpenStack, using the Horizon interface. It allows users to manage virtual machines, s3 storage, and run advanced computations via a graphical user interface (GUI) or command-line interface (CLI). For more complex tasks, Kubernetes integration is available. A standout feature of ISLET is its proximity to data sources, operating near High-Performance Computing (HPC) facilities. This is achieved through data bridges, enabling efficient processing of large datasets, including those from Digital Twins, in conjunction with HPC systems.
The STACK environment supports application development using JupyterHub and DASK, with Python, and R languages. Users can create DASK clusters on selected infrastructure or cloud sites to process data directly where it resides, removing the need for extensive local setup and optimization.
Hook Services is a set of pre-defined workflows which could be used by users as a ready-to-use processors, e. g. : Sentinel-2: MAJA Atmospheric Correction; , Sentinel-2: SNAP-Biophysical; Sentinel-1: Terrain-corrected backscatter. It also enables workflow functions to generate on-demand higher-level products, such as temporal composites.
The DestinE Data Lake is a transformative initiative that revolutionizes how Earth Observation data is managed and utilized. By integrating innovative infrastructure (ISLET), data services (HDA), reliable processors (Hook Services), and user-friendly development tools (STACK), DEDL enables unprecedented levels of data harmonization, federation, and processing. Moreover, the DEDL plays a crucial role in empowering DestinE users by providing them with seamless access to vast datasets and advanced computational tools. It simplifies the process of data exploration, integration, and analysis, enabling researchers, policymakers, and developers to focus on innovation and decision-making rather than technical barriers. By offering a comprehensive suite of services designed to work close to the data, DEDL ensures that users can efficiently utilize the wealth of information generated by the Digital Twins and maximize the impact of their work. This cutting-edge system enhances climate research capabilities and supports sustainable development efforts on a scale previously unattainable.
The Destination Earth Data Lake Lab (DestinE-DataLake-Lab) is a comprehensive GitHub repository designed to facilitate users' interaction with the Destination Earth Data Lake (DEDL) services. Developed by EUMETSAT and partners, this repository offers a collection of Jupyter Notebook examples and Python tools that demonstrate how to effectively utilize various DEDL services, including Harmonized Data Access (HDA), STACK, and HOOK.
Harmonized Data Access (HDA)
The HDA service provides users with streamlined access to a diverse range of datasets within the DEDL ecosystem. Within the repository, the HDA directory contains Jupyter Notebook examples that guide users through the process of discovering available services, listing and searching for STAC collections, and retrieving specific data items. These examples are instrumental in helping users understand how to interact with the HDA API, manage authentication, and perform data queries efficiently.
STACK Service
The STACK service is designed to facilitate near-data processing by leveraging DASK, a flexible parallel computing library in Python. In the STACK directory of the repository, users will find Jupyter Notebook examples that illustrate how to set up and utilize DASK for processing large datasets distributed across different cloud locations. These examples demonstrate the deployment of DASK clusters, execution of parallel computations, and optimization of data processing workflows, enabling users to perform complex analyses efficiently.
HOOK Service
The HOOK service offers Function-as-a-Service (FaaS) capabilities, allowing users to define and execute workflows within the DEDL environment. The HOOK directory in the repository provides Jupyter Notebook examples that guide users through the process of creating, deploying, and managing workflows using the HOOK service. These tutorials cover various aspects, including defining functions, setting up triggers, and monitoring workflow execution, thereby enabling users to automate data processing tasks effectively.
Getting Started
To begin utilizing the resources provided in the DestinE-DataLake-Lab repository, users are encouraged to clone the repository into their local environment or access it through the DEDL-provided JupyterHub - STACK Service. The repository includes a requirements.txt file that lists the necessary Python dependencies. Users should create a virtual environment, install the required packages, and select the appropriate kernel when running the provided notebooks. Detailed instructions for setting up the environment and installing dependencies are available in the repository's README file.
Additional Resources
For further information and comprehensive documentation on DEDL services, users can refer to the DestinE Data Lake documentation. This resource provides in-depth guides, API references, and additional tutorials to assist users in maximizing their utilization of DEDL services. Moreover, the DestinE Data Portfolio and Data Lake Edge services offer valuable insights into the available datasets and services within the DEDL ecosystem.
Summary
In summary, the DestinE-DataLake-Lab repository serves as a valuable resource for users aiming to effectively engage with the Destination Earth Data Lake services. By providing practical examples and comprehensive guides, it empowers users to harness the full potential of DEDL's offerings, facilitating efficient data access, processing, and workflow management within the Destination Earth initiative.