Background

The CloudIX project is a research project funded by the European Commission under the 7th Framework Programme for Research (Marie Curie Actions, Intra-European Fellowships) with partial support from the Norwegian Research Council.

Objectives

The aim of the CloudIX project (Cloud-based Indexing and Query Processing) is to conduct innovative research on indexing and advanced query processing in the cloud, focusing mainly in the MapReduce programming model. CloudIX aims to develop a unifying framework that treats multidimensional data in the cloud as “first-class” citizens, by providing built-in support for storage, effective access and efficient query processing, without compromising the salient features of MapReduce. The key objective of CloudIX is to increase the performance of MapReduce jobs significantly, by providing mechanisms for selective access to data, avoidance of wasteful processing, and support of early termination during query processing. Another important objective of the project is to investigate on novel, advanced query types, and identify corresponding efficient algorithms for query processing and cost-aware optimization.

Description of Work

CloudIX focuses on efficient support for online analytical processing (OLAP) applications in the cloud. In this context, arguably the most popular framework for contemporary large-scale data analytics is MapReduce/Hadoop, mainly due to its salient features that include scalability, fault-tolerance, ease of programming, and flexibility. However, despite its merits, MapReduce has evident performance limitations in miscellaneous analytical tasks, and this has stimulated a significant body of research, including CloudIX, that aim at improving its efficiency, while maintaining its desirable properties.

CloudIX has identified a list of significant limitations and shortcomings of the MapReduce framework, which form the main reason for its reduced performance in various analytical processing tasks. The research activities of CloudIX address some of these limitations and demonstrate that improved performance can be achieved by specific modifications. In particular, CloudIX researches and develops technical solutions in the following areas:
  1. Data indexing at local (within HDFS blocks) and global level (statistics about HDFS blocks of a file)
  2. Selected access to subsets of a file that are sufficient to produce the desired result, without accessing the entire contents in a brute-force manner
  3. Early termination of query processing, as soon as user-specified condition is fulfilled
  4. Advanced query operators and efficient processing algorithms

Results

The overall result of the CloudIX project is to propose mechanisms for boosting the performance of large-scale analytical query processing in the cloud. The proposed mechanisms include efficient access methods, advanced query processing algorithms, safe termination conditions for ceasing query processing and avoiding wasteful processing, and rank-aware processing of advanced query operators.
In practice, the research activities of CloudIX have led to significant results published in high impact venues related to data management research. In particular, main research results include a survey article published in the VLDB Journal on the limitations and weaknesses of MapReduce that provides an overview of existing approaches that try to improve its limitations, papers on advanced query processing published in top-tier conferences (ACM SIGMOD, IEEE ICDE, SSTD), papers on distributed processing of query operators published in Springer’s Distributed and Parallel Databases Journal and SSDBM, and a position paper in the Cloud Intelligence workshop describing how to achieve rank-awareness and early termination in MapReduce.

Impact

In the era of "Big Data" characterized by the unprecedented volume of data, the velocity of data generation and the variety of the structure of data, support for large-scale data analytics constitutes a particularly challenging task. In this context, the requirements for analyzing vast-sized data corpora to extract useful information become more and more intense. To cope with this challenge, efficient and effective data analysis methods and tools are required, thus addressing the needs of scientific data management, social network analysis and mining, large-scale web data analysis, various internet-scale services and applications, etc. CloudIX targets such data analytics applications and provides efficient techniques that speed up the performance of large-scale data analysis. The proposed techniques have been applied in the MapReduce framework, which is a popular data analysis framework for batch processing, and has a significant user base. In consequence, the research findings of CloudIX can be employed to reduce the processing time of large-scale data analysis, make feasible the management of datasets of increasing size, and assist various types of users dealing with "Big Data" (ranging from scientists to professionals) to perform their everyday analytical tasks more easily and save work hours.

For more information please contact Dr. Christos Doulkeridis, email: cdoulk at idi dot ntnu dot no

For more information about the research group and the department, please visite the respective home pages:
The Data and Information Management Group
Department of computer and information science