|Title:||Efficient big data system in the cloud : resource provisioning and scheduling|
Hong Kong Polytechnic University -- Dissertations
|Department:||Department of Computing|
|Pages:||xviii, 116 p. : ill. ; 30 cm.|
|Abstract:||Big data and big data analysis have achieved tremendous popularity recently. This new generation of systems, beyond conventional data analysis with sampled data, represents a new era in data exploration as well as utilization across petabytes and zettabytes datasets. First proposed by Google, MapReduce has become the de facto standard framework in parallel processing for big data applications. Nevertheless, MapReduce framework is also criticized for its inefficiency in performance. Thus, many studies investigate different aspects of MapRedcue framework to improve its performance. A MapReduce system consumes such resources as computing, memory, disk, network. How these resources are managed determines the performance of a MapReduce system. In this thesis, we investigate several challenges in managing resources for MapReduce system in the cloud. We study resource management at two different levels: Cluster level and Machine level. In cluster level, we investigate challenges in building clusters in the cloud. In machine level, we investigate challenges in scheduling tasks of MapReduce jobs among machines. First, we focus on the resource provisioning problem for building clusters in the cloud. Since running a MapReduce system needs a cluster which consists of a set of network-connected machines, this infrastructure requirement prevents small companies from making use of big data techniques. Cloud offers these small / mediate companies a choice and cloud providers are delighted to offer this service. However, we find that running MapReduce systems in cloud is not straight forward. MapReduce systems are both computational-intensive and IO-intensive which causes severe interference if machines for building the cluster are fine-chosen. We investigate this interference with detailed measurement, formulate it into a resource provisioning problem and propose a set of novel algorithms to solve the problem.|
Second, we focus on task scheduling after a cluster is built. A key feature distinguishing MapReduce from previous parallel models is that it interleaves parallel and sequential computation. MapReduce distributes a job into map tasks and reduce tasks. They are parallelized across cluster. However, reduce tasks must wait until all map tasks finish because reduce tasks rely on all intermediate data produced by map tasks. To fully utilize the cluster, multiple MapReduce jobs with different importance can be scheduled together to efficiently utilize computation resources. Scheduling these tasks efficiently is complicated. We mathematically formulate this special task scheduling problem and develop a 3-approximation algorithm. Comprehensive simulations and real experiments prove the advantage ofour approach. Third, we study data locality for task scheduling in real MapReduce system. Data locality is very important because data migration introduces large network communications. We formulate a task scheduling problem with consideration of data locality and develop an algorithm within a constant factor to the optimal solution. We further develop a heuristic algorithm and achieve better performance. We validate the advantage of our approaches with comprehensive simulations and real experiments.
|Rights:||All rights reserved|
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item: