CEOS Data Cube System Requirement

This document is intended to allow a user to estimate the hardware specifications required for a system to run the CEOS Data Cube given data and performance requirements. Note that the CEOS Data Cube is currently supported only on Linux based systems.

Storage

The amount of hard drive space required by the system is dependent on the amount of data that the system is responsible for managing. For a general Landsat specific estimate of storage requirements, refer to the following formula:

Estimate = (number of path/rows) * (365/revisit time in days) * (years of storage) * (data_size) * (margin_of_safety_multiplier)

The recommended margin_of_safety_multiplier above is 3 to 4. It provides for future expansion and additional space for any pre/post-processing add-ons.

Example:

Included below is a table of raw data-volume for satellite imagery based on varying spatial and temporal requirements.

Landsat 7 SR product

1 path/row 5 path/rows 10 path/rows 15 path/rows 20 path/rows 30 path/rows
1 yr 39.6 gb 198.0 gb 396.0 gb 594.0 gb 792.0 gb 990.0 gb
5 yr 198.0 gb 990.0 gb 1.98 tb 2.97 tb 3.96 tb 4.95 tb
10 yr 396.0 gb 1.98 tb 3.96 tb 5.94 tb 7.92 tb 9.9 tb
15 yr 594.0 gb 2.97 tb 5.94 tb 8.91 tb 11.88 tb 14.85 tb
20 yr 792.0 gb 3.96 tb 7.92 tb 11.88 tb 15.84 tb 19.8 tb

These data volumes can be used to estimate usage using the above formula. Please refer to the appendix for examples related to Sentinel 2 Level-1C and MODIS Level 1B.

CPU Requirements for Concurrency

The CPU Requirement for concurrency support from hardware may vary based on individual needs. Minimum requirements to support multiple concurrent users are detailed below:

  1. At least one core per desired concurrent user: Note that to leverage parallel-processing, more cores per user is desired. For example, an analysis run for a custom built user interface (like the CEOS datacube ui) can be split into 5+ parts that are then computed concurrently for that same user.

  2. At least 1GB of memory per core: Note that depending on the analysis case, the requirements may need to be lower or higher. Certain algorithms allow for temporal as well as geospatial chunking. For such algorithms memory usage per core is fairly low. 3

  3. Shared storage with capacity for all data: NFS storage is our current method of shared storage. Note that transfer speeds may become a limitation in the future, but have not been an issue thus far. Feasibility with other big data analytics shared storage solutions will be explored in the future.

Concurrency and Chunking in Practice

A ‘chunk’ refers to a smaller portion of a larger dataset; large analysis regions are broken into many smaller subsets and processed independently. 'Chunking' is a common practice in dealing with large datasets in a resource constrained environment. Note that the chunking procedure is valid for all data processes, not just Landsat 7. Detailed below are examples of concurrent chunking.

  1. Chunking Landsat 7 SR products: For Landsat data, pulling in one square degree for one acquisition is around 400Mb of memory. In a low memory setup, it is possible to chunk analyses into smaller geographic chunks, e.g. splitting a one square degree analysis into ten 0.1 square degree analyses and combining the end result.

  2. Chunking for Standard Mosaic process on LS7 Imagery: We try to chunk the required data for each analysis case so that they use a constant amount of memory during execution. Generally, our goal is around a recommended 1Gb of memory per running process. This is adjusted by modifying chunk sizes. For a standard mosaic analysis, our chunk size is 0.5 square degrees, loading five chunks at a time, while a median pixel mosaic has a chunk size of 0.01 square degrees and loads all available scenes at once.

Memory Requirements

Memory requirements will vary based on individual needs and is dependent on personal design decisions centered around the use of chunking and distributed systems, though at least 1Gb per processor core is recommended.

Minimum 8GB dynamic memory is sufficient for small scale setups or analysis on small regions.

The “Existing Specs” section below lists example specifications for existing systems as well as their performance. Existing Specs Several existing CEOS Data Cube system specifications are listed below:

Existing Specs

Several existing CEOS Data Cube system specifications are listed below:

Scale CPU Memory Storage Uses and Capabilities
Locally Hosted Dev environment 1 x 8 core processor 64Gb 8Tb Development server for 1-4 users, includingCeos-cube UIhosting and Jupyter notebooks
AWS hosted production environment 2x 36 core processors 128Gb 16 Tb Server capabilities (Ceos-cube UI) and processing capabilities for ~20-30 users
Large Scale production environment 32x 8 core processors 1536Gb 24Tb Data access/analysis for ~100 or more users at environmental agencies

Primitive Benchmarking

Several existing UI algorithms available in the CEOS Data Cube framework were used to perform primitive benchmarking. Some of those preliminary results are described in this section. The time is measured as the UI task execution times. The tasks were fully parallelized and chunked into 10-200 chunks for multicore processing.

Larger systems are capable of higher concurrency with the same number of active users. While execution times may be similar between systems, multiple concurrent jobs could be executed at once on the higher number of core systems. This difference can be seen with the tasks that require a larger number of chunked tasks like the median pixel custom mosaic; the median pixel mosaic splits into 200 tasks, while the most recent pixel splits into 10.

Execution time is referring to a full (2000-2016) analysis over a 1 degree square in Colombia. TSM was done on a 1 degree square on Lake Chad. There are no differing application settings between machines to make use of more cores, so the execution times of our AWS servers could be optimized at the expense of the number of supported concurrent users if desired. These analysis cases involve the processing of 91 scenes for mosaicking, fractional cover, and water detection and 314 scenes for the TSM analysis.

System Custom Mosaic (Median) Custom Mosaic (Most Recent) Water Detection TSM Fractional Cover
AMA Offices[1] 767s 177s 175s 597s 361s
AMA AWS Instance (1 Node) 298s 67s 127s 357s 113s
AMA AWS Instance (2 nodes) 189s 62s 106s 318s 99s

[1] AMA Office Setup: Linux server with 64GB RAM, 64-bit 1.2 GHz Intel latest generation Xeon, 8 cores with hyperthreading processor, 8TB Hard Drive.

It can be seen above that having more cores decreases the processing time significantly due to the larger machines (AWS) having a sufficient number of cores to concurrently process all chunks at once. There is a clear drop off when comparing 36 and 72 cores, as most of the analyses were only split into 10-20 chunks. The main distinguishing point between the two AWS benchmarks is the ability to concurrently process more tasks at once. This is seen with the median pixel custom mosaics; the performance gain is still significant between the two AWS benchmarks, signifying the ability for the AWS instances to process more concurrent users.

Appendix

Additional Examples:

Included below is a table of raw data-volume for satellite imagery based on varying spatial and temporal requirements.

Sentinel 2 Level-1C

1 path/row 10 path/rows 20 path/rows 30 path/rows 40 path/rows 50 path/rows
1 yr 36.5 GB 365.0 TB 730.0 TB 1.095 TB 1.46 TB 1.825 TB
5 yr 182.5 GB 1.825 TB 3.65 TB 5.475 TB 7.3 TB 9.125 TB
10 yr 365.0 GB 3.65 TB 7.3 TB 10.95 TB 14.6 TB 18.25 TB
15 yr 547.5 GB 5.475 TB 10.95 TB 16.425 TB 21.9 TB 27.375 TB
20 yr 730.0 GB 7.3 TB 14.6 TB 21.9 TB 29.2 TB 36.5 TB

MODIS Level 1B

1 path/row 10 path/rows 20 path/rows 30 path/rows 40 path/rows 50 path/rows
1 yr 120.45 GB 1.2045 TB 2.409 TB 3.6135 TB 4.818 TB 6.0225 TB
5 yr 602.25 GB 6.0225 TB 12.045 TB 18.0675 TB 24.09 TB 30.1125 TB
10 yr 1.2045 TB 12.045 TB 24.09 TB 36.135 TB 48.18 TB 60.225 TB
15 yr 1.80675 TB 18.0675 TB 36.135 TB 54.2025 TB 72.27 TB 90.3375 TB
20 yr 2.409 TB 24.09 TB 48.18 TB 72.27 TB 96.36 TB 120.45 TB