Description of Work | CHOROLOGOS: Semantic Spatio-textual Data Analysis and Processing

The combination of spatio-textual data with spatio-temporal data at scale opens up new research directions, while at the same time challenges existing data processing solutions. As a result, the following main research and technological challenges need to be addressed by the project:

Formulation of novel query types: The acquisition of massive complex data, described by spatial, temporal and textual dimensions, has motivated the research of novel query types, in order to retrieve data in flexible, expressive, and meaningful ways. Consequently, miscellaneous interesting query types have emerged, which raise challenges for query processing algorithms. These query types include reverse query operators (reverse top-k, reverse k-NN, etc.), why-not operators, queries that retrieve groups of objects (instead of single objects), complex joins, pattern queries, optimal location queries, and so on. The resulting challenge for CHOROLOGOS is to formulate meaningful and useful query types, along with the theoretical properties which will enable pruning the search space effectively.
Indexing structures for data combining space, time and text: In the past, several indexing structures have been proposed for mobility data (spatio-temporal or trajectory indexes), while more recently spatio-textual indexes have emerged too. Both these types of indexes have to face significant challenges research-wise, related to the dynamic nature of the temporal dimension in the former case, and to the high-dimensionality of text in the latter case. Designing efficient index structures for the combination of the three types of dimensions (space, time and text) is far more difficult to accomplish. CHOROLOGOS targets exactly this pressing need for efficient access methods of spatio-temporal-textual data, in order to increase the performance of data processing and analysis.
Efficient query processing algorithms: The combination of multiple dimensions in conjunction with complex query types (e.g., joins) has typically a devastating effect on the performance of query processing, since the size of the search space increases significantly. To address this challenge, efficient algorithms are sought that can eagerly prune the search space, and retrieve the query result as fast as possible. CHOROLOGOS is going to exploit algorithms belonging to the filter-and-refine paradigm, where effective filtering of candidate results drastically reduces the combinations of objects that need to be evaluated in the refinement phase, leading to performance gains. To design such algorithms, we are going to derive appropriate search bounds that provide guarantees about the correctness of filtering. Moreover, branch-and-bound algorithms will be proposed that capitalize on the derived search bounds and prune the search space.
Parallel and scalable framework for processing massive data: Last, but not least, as part of innovation, CHOROLOGOS will deliver a parallel data processing version of the algorithmic framework, in order to meet the scalability challenges posed by today’s massive data sets (social networks, surveillance networks, IoT and sensor networks). For the development, we will use a state-of-the-art parallel data processing framework, such as Apache Spark or Apache Flink, which offer salient features, including fault-tolerance, resource management (e.g., when coupled with YARN), following the common research practice that extends such frameworks to produce application-targeted prototypes (e.g., SpatialHadoop, ST-Hadoop, LocationSpark, etc.). The major underlying challenge in this context is to achieve efficient data partitioning, fair work allocation, and efficient indexing at local as well as global level.