Abstract

Indexes in data systems accelerate data access by adding structure to otherwise unstructured data at the cost of index construc- tion and maintenance. Data systems, and particularly, the underlying indexing data structures are designed to offer favorable ingestion (and query) performance for the two extremes of data sortedness, i.e., unsorted data (often assumed to follow a uniform random distribution) or fully- sorted data. However, in practice, data may arrive with an intermediate degree of pre-sortedness. In such cases, where data arrives nearly (but not necessarily fully) sorted, the intuition is that the indexing cost should be lower than when ingesting unsorted data. Such sortedness-aware index designs lack from the literature. In fact, there is a need for a framework to explore how index designs may be able to exploit pre-existing sortedness during data ingestion to amortize the index construction cost.

In this paper, we present Benchmark on Data Sortedness, BoDS for short, that highlights the performance of data systems in terms of in- dex construction and navigation costs when operating on data ingested with variable sortedness. To quantify data sortedness, we use the state- of-the-art (K,L)-sortedness metric. Specifically, BoDS benchmarks the indexing performance of a data system as we vary the two fundamental components of the metric: (i) K, that measures how many elements are out-of-order in a data collection; and (ii) L, that measures by how much the out-of-order entries are displaced from their respective in-order po- sitions; as well as (iii) the distribution of L. We present in detail the benchmark, and we run it on PostgreSQL, a popular, production-grade relational database system. Unsurprisingly, we observe that PostgreSQL cannot exploit data sortedness; however, through our experiments we show the headroom for improvement, and we lay the groundwork for ex- perimentation with sortedness-aware index designs.

Proceedings of the TPC Technology Conference on Performance Evaluation & Benchmarking (TPCTC), 2022
Aneesh Raman, Konstantinos Karatsenidis, Subhadeep Sarkar, Matthaios Olma, Manos Athanassoulis

Official Page | Local PDF | Presentation Video | Slides | Code