Full metadata record
DC FieldValueLanguage
dc.contributorDepartment of Computingen_US
dc.creatorYuan, Yi-
dc.identifier.urihttps://theses.lib.polyu.edu.hk/handle/200/7449-
dc.languageEnglishen_US
dc.publisherHong Kong Polytechnic University-
dc.rightsAll rights reserveden_US
dc.titleEfficient big data system in the cloud : resource provisioning and schedulingen_US
dcterms.abstractBig data and big data analysis have achieved tremendous popularity recently. This new generation of systems, beyond conventional data analysis with sampled data, represents a new era in data exploration as well as utilization across petabytes and zettabytes datasets. First proposed by Google, MapReduce has become the de facto standard framework in parallel processing for big data applications. Nevertheless, MapReduce framework is also criticized for its inefficiency in performance. Thus, many studies investigate different aspects of MapRedcue framework to improve its performance. A MapReduce system consumes such resources as computing, memory, disk, network. How these resources are managed determines the performance of a MapReduce system. In this thesis, we investigate several challenges in managing resources for MapReduce system in the cloud. We study resource management at two different levels: Cluster level and Machine level. In cluster level, we investigate challenges in building clusters in the cloud. In machine level, we investigate challenges in scheduling tasks of MapReduce jobs among machines. First, we focus on the resource provisioning problem for building clusters in the cloud. Since running a MapReduce system needs a cluster which consists of a set of network-connected machines, this infrastructure requirement prevents small companies from making use of big data techniques. Cloud offers these small / mediate companies a choice and cloud providers are delighted to offer this service. However, we find that running MapReduce systems in cloud is not straight forward. MapReduce systems are both computational-intensive and IO-intensive which causes severe interference if machines for building the cluster are fine-chosen. We investigate this interference with detailed measurement, formulate it into a resource provisioning problem and propose a set of novel algorithms to solve the problem.en_US
dcterms.abstractSecond, we focus on task scheduling after a cluster is built. A key feature distinguishing MapReduce from previous parallel models is that it interleaves parallel and sequential computation. MapReduce distributes a job into map tasks and reduce tasks. They are parallelized across cluster. However, reduce tasks must wait until all map tasks finish because reduce tasks rely on all intermediate data produced by map tasks. To fully utilize the cluster, multiple MapReduce jobs with different importance can be scheduled together to efficiently utilize computation resources. Scheduling these tasks efficiently is complicated. We mathematically formulate this special task scheduling problem and develop a 3-approximation algorithm. Comprehensive simulations and real experiments prove the advantage ofour approach. Third, we study data locality for task scheduling in real MapReduce system. Data locality is very important because data migration introduces large network communications. We formulate a task scheduling problem with consideration of data locality and develop an algorithm within a constant factor to the optimal solution. We further develop a heuristic algorithm and achieve better performance. We validate the advantage of our approaches with comprehensive simulations and real experiments.en_US
dcterms.extentxviii, 116 p. : ill. ; 30 cm.en_US
dcterms.isPartOfPolyU Electronic Thesesen_US
dcterms.issued2014en_US
dcterms.educationalLevelAll Doctorateen_US
dcterms.educationalLevelPh.D.en_US
dcterms.LCSHBig data.en_US
dcterms.LCSHDatabase management.en_US
dcterms.LCSHCloud computing.en_US
dcterms.LCSHHong Kong Polytechnic University -- Dissertationsen_US
dcterms.accessRightsopen accessen_US

Files in This Item:
File Description SizeFormat 
b27473090.pdfFor All Users1.31 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/7449