Background
The CloudIX project is a research project funded by the European
Commission under the 7th Framework Programme for Research (Marie Curie
Actions, Intra-European Fellowships) with partial support from the
Norwegian Research Council.
Objectives
The aim of the CloudIX project (Cloud-based Indexing and Query
Processing) is to conduct
innovative research on indexing and advanced query processing in the
cloud, focusing mainly in the
MapReduce programming model. CloudIX aims to develop a unifying
framework that treats
multidimensional data in the cloud as Âfirst-class citizens, by
providing built-in support for storage,
effective access and efficient query processing, without compromising
the salient features of
MapReduce. The key objective of CloudIX is to increase the performance
of MapReduce jobs
significantly, by providing mechanisms for selective access to data,
avoidance of wasteful
processing, and support of early termination during query
processing. Another important objective
of the project is to investigate on novel, advanced query types, and
identify corresponding efficient
algorithms for query processing and cost-aware optimization.
Description of Work
CloudIX focuses on efficient support for online analytical processing
(OLAP) applications in the
cloud. In this context, arguably the most popular framework for
contemporary large-scale data
analytics is MapReduce/Hadoop, mainly due to its salient features that
include scalability,
fault-tolerance, ease of programming, and flexibility. However,
despite its merits, MapReduce has
evident performance limitations in miscellaneous analytical tasks, and
this has stimulated a
significant body of research, including CloudIX, that aim at improving
its efficiency, while
maintaining its desirable properties.
CloudIX has identified a list of significant limitations and
shortcomings of the MapReduce
framework, which form the main reason for its reduced performance in
various analytical processing
tasks. The research activities of CloudIX address some of these
limitations and demonstrate that
improved performance can be achieved by specific modifications. In
particular, CloudIX researches
and develops technical solutions in the following areas:
- Data indexing at local (within HDFS blocks) and global level
(statistics about HDFS blocks of a
file)
- Selected access to subsets of a file that are sufficient to
produce the desired result, without
accessing the entire contents in a brute-force manner
- Early termination of query processing, as soon as user-specified
condition is fulfilled
- Advanced query operators and efficient processing algorithms
Results
The overall result of the CloudIX project is to propose mechanisms for
boosting the performance of
large-scale analytical query processing in the cloud. The proposed
mechanisms include efficient
access methods, advanced query processing algorithms, safe termination
conditions for ceasing query
processing and avoiding wasteful processing, and rank-aware processing
of advanced query
operators.
In practice, the research activities of CloudIX have led to
significant results published in high impact
venues related to data management research. In particular, main
research results include a survey
article published in the VLDB Journal on the limitations and
weaknesses of MapReduce that
provides an overview of existing approaches that try to improve its
limitations, papers on advanced
query processing published in top-tier conferences (ACM SIGMOD, IEEE
ICDE, SSTD), papers on
distributed processing of query operators published in SpringerÂs
Distributed and Parallel Databases
Journal and SSDBM, and a position paper in the Cloud Intelligence
workshop describing how to
achieve rank-awareness and early termination in MapReduce.
Impact
In the era of "Big Data" characterized by the unprecedented
volume of data, the velocity of data
generation and the variety of the structure of data, support for
large-scale data analytics constitutes a
particularly challenging task. In this context, the requirements for
analyzing vast-sized data corpora
to extract useful information become more and more intense. To cope
with this challenge, efficient
and effective data analysis methods and tools are required, thus
addressing the needs of scientific
data management, social network analysis and mining, large-scale web
data analysis, various
internet-scale services and applications, etc. CloudIX targets such
data analytics applications and
provides efficient techniques that speed up the performance of
large-scale data analysis. The
proposed techniques have been applied in the MapReduce framework,
which is a popular data
analysis framework for batch processing, and has a significant user
base. In consequence, the
research findings of CloudIX can be employed to reduce the processing
time of large-scale data
analysis, make feasible the management of datasets of increasing size,
and assist various types of
users dealing with "Big Data" (ranging from scientists to
professionals) to perform their everyday
analytical tasks more easily and save work hours.
For more information please contact Dr. Christos Doulkeridis, email: cdoulk
at idi dot ntnu dot no
For more information about the research group and the department,
please visite the respective home pages:
The Data and Information
Management Group
Department of computer and
information science