Scaling k-Nearest Neighbors Queries (The right way)
Atoshum Samuel Cahsai, Nikos Ntarmos, Christos Anagnostopoulos and Peter Triantafillou
University Of Glasgow, University Of Glasgow, University of Glasgow, University Of Glasgow

Recently parallel / distributed processing approaches have been proposed for processing k-Nearest Neighbours (kNN) queries over very large (multi-dimensional) datasets aiming to ensure scalability. However, this is typically achieved at the expense of efficiency. With this paper we offer a novel approach that alleviates the performance problems associated with state of the art methods. The essence of our approach, which differentiates it from related research, rests on (i) adopting a coordinatorbased distributed processing algorithm, instead of those employed over data-parallel execution engines (such as Hadoop/MapReduce or Spark), and (ii) on a way to organize data, to structure computation, and to index the stored datasets that ensures that only a very small number of data items are retrieved from the underlying data store, communicated over the network, and processed by the coordinator for every kNN query. Our approach also pays special attention to ensuring scalability in addition to low query processing times. Overall, kNN queries can be processed in just tens of milliseconds (as opposed to the (tens of) seconds required by state of the art. We have implemented our approach, using a NoSQL DB (HBase) as the data store, and we compare it against the state-of-the-art: the Hadoop-based Spatial Hadoop (SHadoop) and the Spark-based Simba methods. We employ different datasets of various sizes, showcasing the contributed performance advantages. Our approach outperforms the state of the art, by 2-3 orders of magnitude, and consistently for dataset sizes ranging from hundreds of millions to hundreds of billions of data points. We also show that the key constituent performance overheads incurred during query processing (such as the number of data items retrieved from the data store, the required network bandwidth, and the processing time at the coordinator) scale very well, ensuring the overall scalability of the approach.