< Back to previous page

Project

Optimizing advanced analytic tasks over distributed data (R-9422)

In the era of big data, companies and scientific institutions are facing data that comes in varieties and volumes never encountered before. At the same time, new needs and expectations exist about the insight and intelligence that can be derived from these datasets using predictive analytics via statistical and machine-learning models and algorithms. While sampling has been a common used technique to bridge the gap between large datasets and deep analytics via expert tools, today, driven by cheap storage and processing capacity, a huge desire exists to use the entire dataset to leverage value in the most refined and holistic way possible. In this proposal, we focus on the support of advanced big data analytics by a new generation of distributed query engines. Here the term big data analytics is used as an umbrella term for complex tasks that combine traditional query operations, like table joins, and operations from linear algebra, like matrix multiplication. In particular, we aim to support big data analytics from a database perspective, where a distributed query engine provides a solid supporting environment for effective computation and optimization of typical advanced analytic tasks. The overall goal of this project is to contribute to a better fundamental understanding of how complex data analytic workflows can be executed in a big data setting, where distribution and parallelization are key.
Date:1 Jan 2019 →  31 Dec 2022
Keywords:big data, data science, distributed query processing
Disciplines:Computer theory not elsewhere classified, Other computer engineering, information technology and mathematical engineering not elsewhere classified