07-17, 12:00–12:30 (Europe/Sarajevo), PA01
The Open Container Initiative (OCI) 1.1 specification has expanded container registries beyond traditional software images, enabling them to store and distribute a wide variety of digital artifacts, from software build artifacts to machine learning (ML) models and arbitrarily large data blobs. As the volume of Earth Observation (EO) data generated by satellites and remote sensing applications continues to increase, scalable and efficient distribution methods are becoming essential. OCI registries are well-suited for end-to-end supply chains due to their built-in capabilities for integrity verification and attestations, such as quality assurance, allowing for the application of common tooling and best practices across various steps in the supply chain. Their layered design allows for selective retrieval of specific parts, and optimizations like compression and deduplication can be applied individually, making them ideal for managing EO data of arbitrary size.
Challenges in Optimizing OCI Registries for EO Data
However, despite these advantages, significant challenges remain in optimizing both client-side parallelization and addressing server-side limitations of existing OCI registries. A critical research question arises: How should EO data be structured within OCI registries to maximize performance? While OCI registries support multiple storage layers and optimizations, the practical implications of storing EO data in this format have not been thoroughly explored. Key concerns include whether OCI registries can effectively support arbitrarily sized EO data and how different storage layouts affect retrieval speed and storage efficiency.
Investigating Best Practices for EO Data Storage in OCI Registries
This research paper investigates how to structure EO data within OCI registries to optimize performance. By examining various physical data layouts—such as chunking data into blocks or organizing data into multiple layers—the goal is to identify best practices for storing and accessing large EO datasets. Benchmarking common OCI client tooling against a variety of OCI-compliant registries, including public offerings like DockerHub and Quay.io, managed services like AWS ECR, and bespoke cloud-based implementations, will help evaluate retrieval latency, throughput, and parallelization techniques to enhance the efficiency of EO data distribution at scale. The research paper will also examine the impact of different compression, deduplication, and data layout strategies on storage efficiency and retrieval performance.
Advantages of OCI Registries for EO Data Storage and Distribution
An OCI image, as specified through the Linux Foundation's Open Container Interface, is actually a collection of multiple components. At the top level is an index of all the other included components. It references, in a JSON format, all the other layers with their digest or the cryptographic hash of the content itself. The OCI distribution spec describes how clients pull images from a registry, which is done layer by layer.
OCI registries offer several inherent benefits that make them attractive for EO data storage and distribution. These include:
- Layered Storage Model: OCI artifacts utilize a layered approach, allowing incremental and block-wise storage and retrieval, enabling efficient updates and minimizing redundant data transfers.
- Efficient Distribution: Content-addressable storage allows fetching only changed layers, which minimizes bandwidth and storage costs and supports incremental updates.
- Versioning and Tagging: Version control is inherent in OCI registries, enabling precise tracking of updates. This is crucial as data moves through various stages of processing, validation, and final distribution.
- Attestation and Integrity: Data integrity is ensured using cryptographic hashes, verifying the authenticity and trustworthiness of the supply chain, from raw input to final products.
Addressing Practical Limitations in OCI Registries
Despite these benefits, the practical limitations of OCI registries for handling large-scale EO data are not fully understood. Specifically, the impact of physical data layout on retrieval speed and storage efficiency requires further investigation. This research paper will explore several strategies for structuring EO data within OCI registries:
- Chunking and Layering Strategies: Investigating whether data should be stored in large monolithic layers or smaller, granular chunks, and evaluating the effects of compression and deduplication on retrieval performance.
- Client-Side Parallelization: Analyzing the impact of parallelized downloads on pull speeds and comparing performance improvements with different concurrent retrieval configurations.
- Server-Side Constraints: Assessing registry performance limits, including bandwidth throttling and API rate limits, and comparing different OCI registry offerings and implementations.
Benchmarking and Evaluation Metrics
The research paper will employ a benchmark-based approach to evaluate different storage layouts and retrieval optimizations. Key metrics for evaluation include:
- Latency: Measuring the time required to pull (and extract) EO datasets from OCI registries.
- Throughput: Assessing how registry performance scales with concurrent downloads.
- Storage Overhead: Analyzing the efficiency of deduplication and compression techniques.
Test datasets will include EO imagery and EO time-series data stored in cloud-native formats like COGs and Zarrs, which inherently support chunked data structures (compressed and uncompressed). By comparing different layouts and access patterns, insights will be derived into the most effective way to structure EO data within OCI registries.
Research Questions and Expected Contributions
This research paper seeks to establish best practices for storing EO data in OCI registries by answering the following questions:
- What are the practical limitations of OCI registries for handling arbitrarily large EO datasets?
- How should EO data be physically structured within OCI to optimize performance?
- What are the trade-offs between different storage layouts in terms of retrieval speed, storage efficiency, and scalability?
By systematically evaluating these aspects, this research paper will contribute to the broader adoption of OCI registries for EO data management, ensuring efficient, scalable, and interoperable distribution. The findings will also guide future optimizations in registry implementations to better support large-scale geospatial datasets.
Goal
OCI registries offer a promising avenue for distributing EO data at scale. However, the performance implications of storing large datasets in this format remain underexplored. This research paper will benchmark various OCI registry implementations, investigating the impact of data structuring, parallelization, and registry limitations. By identifying best practices for EO data storage in OCI registries, the efficiency of geospatial data distribution can be enhanced while leveraging the robust ecosystem of container registries already in place.
"OCI Artifacts Specification" Open Container Initiative. (2021). Retrieved from https://github.com/opencontainers/artifacts
"OCI Artifacts Explained. Are they real? Kind of!" Lorenc, D. (2021). Retrieved from https://dlorenc.medium.com/oci-artifacts-explained-8f4a77945c13
Select at least one general theme that best defines your proposal – I make my conference contribution available under the CC BY 4.0 license. The conference contribution comprises the abstract, the text contribution for the conference proceedings, the presentation materials as well as the video recording and live transmission of the presentation – yesStefan Achtsnit is a seasoned mathematical computer scientist with >20 years of experience contributing to both enterprise and scientific research software. Focused on advancing the adoption of cutting-edge Data Management and DataOps best practices, particularly in the Earth Observation (EO) domain, he is deeply engaged in software engineering, cloud-native operations, and data management communities, actively sharing his expertise.