Processing math: 100%

Towards big data processing in clouds: An online cost-minimization approach

  • Received: 01 July 2015 Revised: 01 August 2015 Published: 01 January 2016
  • Due to its elastic and on-demand nature of resource provisioning, cloud computing provides a cost effective and powerful technology for the processing of big data. Under this paradigm, Data Service Provider (DSP) may rent geographically distributed datacenters to process their large amount of data. As the data are dynamically generated and the resource pricing varies over time, moving the data from differently geographic locations to different datacenters while provisioning adequate computation resource to process them is an essential task to achieve cost effectiveness for DSP. In this paper, a joint online approach is proposed to address this task. We formulate the problem into a joint stochastic optimization problem, which is then decoupled into two independent subproblems via the Lyapunov framework. Our method is able to minimize the long-term time average cost including computing cost, storage cost, bandwidth cost and latency cost. Theoretical analysis shows that our online algorithm can produce a solution within an upper bound to the optimal solution achieved through offline computing and guarantee that the data processing can be completed with preset delays.

    Citation: Weidong Bao, Wenhua Xiao, Haoran Ji, Chao Chen, Xiaomin Zhu, Jianhong Wu. Towards big data processing in clouds: An online cost-minimization approach[J]. Big Data and Information Analytics, 2016, 1(1): 15-29. doi: 10.3934/bdia.2016.1.15

    Related Papers:

    [1] Nick Cercone . What's the Big Deal About Big Data?. Big Data and Information Analytics, 2016, 1(1): 31-79. doi: 10.3934/bdia.2016.1.31
    [2] Hamzeh Khazaei, Marios Fokaefs, Saeed Zareian, Nasim Beigi-Mohammadi, Brian Ramprasad, Mark Shtern, Purwa Gaikwad, Marin Litoiu .
     How do I choose the right NoSQL solution? A comprehensive theoretical and experimental survey 
    . Big Data and Information Analytics, 2016, 1(2): 185-216. doi: 10.3934/bdia.2016004
    [3] Richard Boire . UNDERSTANDING AI IN A WORLD OF BIG DATA. Big Data and Information Analytics, 2018, 3(1): 22-42. doi: 10.3934/bdia.2018001
    [4] M Supriya, AJ Deepa . Machine learning approach on healthcare big data: a review. Big Data and Information Analytics, 2020, 5(1): 58-75. doi: 10.3934/bdia.2020005
    [5] Enrico Capobianco . Born to be Big: data, graphs, and their entangled complexity. Big Data and Information Analytics, 2016, 1(2): 163-169. doi: 10.3934/bdia.2016002
    [6] Ali Asgary, Jianhong Wu . ADERSIM-IBM partnership in big data. Big Data and Information Analytics, 2016, 1(4): 277-278. doi: 10.3934/bdia.2016010
    [7] Yaguang Huangfu, Guanqing Liang, Jiannong Cao . MatrixMap: Programming abstraction and implementation of matrix computation for big data analytics. Big Data and Information Analytics, 2016, 1(4): 349-376. doi: 10.3934/bdia.2016015
    [8] Yang Yu . Introduction: Special issue on computational intelligence methods for big data and information analytics. Big Data and Information Analytics, 2017, 2(1): i-ii. doi: 10.3934/bdia.201701i
    [9] Pankaj Sharma, David Baglee, Jaime Campos, Erkki Jantunen . Big data collection and analysis for manufacturing organisations. Big Data and Information Analytics, 2017, 2(2): 127-139. doi: 10.3934/bdia.2017002
    [10] Zhouchen Lin . A Review on Low-Rank Models in Data Analysis. Big Data and Information Analytics, 2016, 1(2): 139-161. doi: 10.3934/bdia.2016001
  • Due to its elastic and on-demand nature of resource provisioning, cloud computing provides a cost effective and powerful technology for the processing of big data. Under this paradigm, Data Service Provider (DSP) may rent geographically distributed datacenters to process their large amount of data. As the data are dynamically generated and the resource pricing varies over time, moving the data from differently geographic locations to different datacenters while provisioning adequate computation resource to process them is an essential task to achieve cost effectiveness for DSP. In this paper, a joint online approach is proposed to address this task. We formulate the problem into a joint stochastic optimization problem, which is then decoupled into two independent subproblems via the Lyapunov framework. Our method is able to minimize the long-term time average cost including computing cost, storage cost, bandwidth cost and latency cost. Theoretical analysis shows that our online algorithm can produce a solution within an upper bound to the optimal solution achieved through offline computing and guarantee that the data processing can be completed with preset delays.


    1. Introduction

    The cloud computing paradigm offers a convenient way for users to dynamically adjust its computing resources rented from cloud service providers (CSPs) according to the demand in a Pay-As-You-Go (PAYG) manner. Specifically, in cloud computing, benefited from the development of virtualization technology[3], VMs (Virtual Machines) resources can be scaled up and down to match the applications demands. Compared with traditional approaches, the cloud computing paradigm eliminates users' costs of purchasing and maintaining their own infrastructures.

    The elastic and on-demand nature of resource provisioning attracts a lot of users to deploy their applications, especially computation-intensive big data analysis in the clouds. At the age of big data, data analysis is more and more important for applications such as financial analysis, social interaction web sites, astronomical telescope service. For example, Facebook-like social media sites can uncover usage patterns and hidden correlations by analyzing the web site history records (e.g., click records, activity records et al.) to facilitate its marketing decision. We call this kind of organization as Data Service Provider (DSP) in this paper. Under this paradigm, the DSPs should solve two problems in the first place: 1) How to transfer the large-scale data sets from various locations into clouds and 2) How many resources such as computing resource and storage resource should be rented in the clouds for processing?

    Although much efforts has been made to design computing models for fast big data analysis, such as Mapreduce[6] and Spark[27], the problems of moving large-scale data to the clouds and provisioning adequate resources at the same time in the clouds is rarely considered in the community. Currently, for the data moving problem, practices such as copying the data into large-scale hard drives for physically transportation[2,15] and even moving the entire machine [1] to datacenters are adopted. These methods not only incur undesirable delays but also insecure case, given that hard drives may be damaged from transportation accident. For the resource provision problem, some works have been done to copy with dynamic workload in clouds[16,21]. But these methods often considered the data moving problem and resource provisioning problem in isolation.

    In this paper, targeting the analysis of big data from different locations with the MapReduce-like framework in the clouds, we propose an online approach which systematically address the data moving problem and resource provisioning problem, with the goal of over all cost minimization of running big data analytic in the couds. To achieve this goal, we first formulate the problem into a jointly stochastic optimization problem, and then, apply the Lyapunov Optimization framework. Such a stochastic system does not require predicting the future system states and makes decisions only based on current system state[13]. Based on the drift-plus-penalty function transformation, we propose an online algorithm that is able to move data from multiple regions to distributed datacenters in an online manner and dynamically rent the near optimal number of computing resource and storage resource needed to satisfy user requirements for serving data analysis.

    The major contributions of this work are summarized as follows:

    ● We propose a novel framework that systematically handles data moving from multiple locations to multiple datacenters and resource renting in each datacenter in a nearly optimal manner. In particular, we consider the bandwidth cost, computing cost, storage cost and delay cost as the overall cost and guarantee the data can be processed within a desirable delay. In our framework, VMs in the cloud have different types and are priced dynamically.

    ● We propose an algorithm to solve the jointed stochastic problem using the Lyapunov optimization framework, which is able to make decisions of resource renting and data moving online. Moreover, the algorithm can have a distributed implementation.

    ● We conduct performance analyses for the algorithm theoretically, which demonstrate that the algorithm approximates the optimal solution within provable bounds and is capable of processing the tasks within a preset delay.

    The remainder of this paper is organized as follows: Section 2 summarizes related works; Section 3 describes the system modeling and the problem formulation; Section 4 gives the online algorithm for solving the problem; Section 5 analyzes the proposed algorithm; Section 6 concludes the paper.


    2. Related work

    Recent years have witnessed the proliferation of cloud-based service in both academic and industry. Much efforts has been made to migrate the applications such as cloud-based live streaming [9,21], cloud-based online game [18], cloud-based conference [7] and social media applications[24] etc. into clouds. Majority of these studies have focused on how to scale up and down the resource in the clouds to meet user demand or migrate the workflow into clouds.

    Few studies have been conducted to move large scale data into clouds. Paper [4] studied how to transfer data to the cloud provider via the Internet and courier services. Study [5] proposed a solution to minimize the transfer latency under a budget constraint. In [11], the authors studied the data streaming storage for real time big data processing. Different from our study, these work deal with the data transfer problem on static scenario in which the data amount is fixed, while our work consider dynamically generated data. In addition, aforementioned studies considered a single datacenter while our work takes into account multiple datacenters. The most relevant work is Zhang et al [28] which proposed an online algorithm to migrate dynamically generated data from various locations to the clouds for processing. However, our work significantly differs since we consider the resource provisioning and data moving as simultaneously and applied the Lyapunov framework to address the problem.

    There is also a line of research on resource provisioning in clouds. In the clouds, the server pool and the capacity of each server become elastic. Studies [16,12] considered elastic server capacity supported by virtualization technologies. Work [16] proposed adaptive request allocation and service capacities scaling mechanism mainly to cope with the flash crowd. Study [23] took into account of the VM renting cost and storage cost when making scheduling decisions. Different from these works which often need certain mechanisms to predict the future workloads, our work does not rely on any future information on big data tasks since the Lyapunov optimization framework is adopted. Also, studies on how to scheduling the tasks with different objectives in clouds have been conducted. Works [29,31] proposed efficient scheduling strategies for real-time tasks with energy minimization while studies [31,20] developed task scheduling algorithms under the consideration of fault-tolerant. These works are often within one single datacenter.

    In addition, the Lyapunov optimization technique was first proposed in [17] to address the network stability problem and then was introduced into cloud computing to deal with job admission and resource allocation problem [19,10]. Yao et al. [26] extends it from the single time scale to two-time-scale for achieving electricity cost reduction in geographically distributed datacenters. Recently, this approach is used for resource management in cloud-based video service [22,25]. In our work, we utilize this approach to simultaneously address the data moving from multiple locations to multiple datacenters and resource provisioning in each datacenter.

    To summarize, our work differs from existing works as follows. 1) Firstly, we address the problems of data moving and resource provisioning systematically and design an online algorithm that can be implemented distributedly. 2) Secondly, with the Lyapunov framework, our method does not rely on the prediction of future big data processing workload, which significantly differs from the assumptions made in [21,23].


    3. Modeling and formulation


    3.1. System modeling

    We consider such a system scenario as presented in Fig. 1: A DSP (e.g., a global astronomical telescope department) manage multiple geographical data locations that continuously produces large volumes of data. The DSP deploys their data analytics application in cloud and connects the data source to different datacenters located in multiple places. All the data are moved to the the datacenters and processed in the corresponding datacenter with distributed computing model such as MapReduce framework. In the system, the DSP observes the state of the datacenter (e.g., VM price, datacenter load state, network state) and decides the amount of data to be moved to each datacenter and the amount of resource rented from each datacenter, with cost minimization consideration. Finally, the datacenters return the analysis results to DSP after the data have been processed and analyzed.

    Figure 1. System architecture.

    Formally, considering the geo-distributed datacenters set D with size of D=|D|, indexed by d(1dD). A set K of distinct types of VMs (with size K=|K|), each with specific capacity vk under different configurations of CPU and memory, are provided in each datacenter. Data are dynamically generated from R=|R| different data location (indexed by r,1rR), denoted as set R. Data from any location can be moved to any datacenter for analytics via virtual private networks (VPNs). And the data transmission bandwidth between a data generation location rR and a datacenter dD is large as well. To be realistic, we also assume that the bandwidth Brd on a VPN link (r,d) from data location r to data center d is limited, and constitutes the bottleneck in the system. In addition, The data generation in each location is independent and the prices of the resource (e.g., VM, storage) in each datacenter are varying in both spatial and temporal domain.

    The system operates according to time slots, denoted by t=0,1,...,T. In every time slot, the DSP need to make the decision of moving how much data from data location r to datacenter d and renting how many resources to support its data processing, storage from each datacenter. Our goal is therefore to minimize the over all cost of big data analytics in clouds as well as guarantee the delay in the long run. For ease of reference, important notations are summarized in Table 1.

    Table 1. IMPORTANT NOTATIONS.
    D set of datacenters distributed over multiple regions
    R set of data locations
    K set of VM types
    ar(t) amount of the data generated from region r at t
    Armax max amount of data generated from region r
    λdr(t) amount of the data allocated to d from region r at t
    nk,maxd max number of VMs of type-k in datacenter d
    nkd(t) number of type-k VM provisioned in datacenter d at t
    pkd(t) price of type-k VM in datacenter d at t
    sd price of storage in datacenter d
    bdr price of bandwidth between location r and datacenter d
    vk data processing rate of type-k VM
    εd preset constant for controlling queueing delay in Hd(t)
    l max delay of data process
    Hd(t) unprocessed data in datacenter d at t
    Zd(t) Virtual queue associate with Hd(t) to guarantee its delay
     | Show Table
    DownLoad: CSV

    3.2. Problem formulation

    In this subsection, we first formulate the cost incurred in the system and then define the objective of the problem mathematically.

    As aforementioned, the system runs in a time-slotted fashion and the data are dynamically generated over different regions in each time slot. Let ar(t) be the amount of data generated from the r-th region at time slot t. Since the data generated from each location can be moved to any datacenter for analytics, we denote λdr(t) as amount of the data allocated to d from region r at t and Armax as the max number of data generated in location r. Hence, we have:

    ar(t)Armax,r,t[1,T]. (1)
    ar(t)=dDλdr(t),r,t[1,T]. (2)

    The goal of the DSP is to minimize the over all cost incurred in the system by optimizing the amount of data allocated to each datacenter and the number of resources needed. Specifically, the following cost components are considered in this paper: bandwidth cost, latency cost, storage cost and computing cost. Each of the cost is defined as follows.

    (1) Usually, the bandwidth price is varying over different VPN links because they often belong to different Internet service providers. Let bdr be the price of transferring 1 GB data between data location rR and datacenter dD, then the bandwidth cost of the at t is:

    Cb(t)=dDrRλdr(t)bdr. (3)

    (2) Storage cost is an important factor to be considered in choosing the datacenter for data analytics since it often has large amount of data for big data application. Let sd represent the price of storing 1 GB data for one time slot period in datacenter dD, then the storage cost at t is:

    Cs(t)=dDrRλdr(t)sd. (4)

    (3) Due to the variance of VM price over time slots, the number of VMs rented from datacenter has important impact on the over all cost of the system as well as QoS of the big data application. Let nkd(t) be the number of VMs rented from datacenter d at time slot t. pkd(t) be the type-k VM price in datacenter d at time slot t, which is diverse in both spatial and time space. Then the computing cost is defined as follows.

    Cp(t)=dDkKnkd(t)pkd(t). (5)

    (4) The latency incurred by upload data to the datacenters is an important performance measure, which is to be minimized in the data moving process. Ldr is the latency between the data location rR and the datacenter dD. These delays are determined by the respective geographic distance and can be obtained by a simple command such as Ping. As suggested in [28], we convert the latency into monetary cost. Therefore, we can define the latency cost as:

    Cl(t)=dDrRαλdr(t)Ldr(t), (6)

    where α is a weight converting latency into a monetary cost.

    Based on above cost formulation, the overall cost incurred in the system can be derived as:

    C(t)=Cp(t)+Cs(t)+Cb(t)+Cl(t). (7)

    Hence, the problem of minimizing the time-average cost of data moving and processing within a long-term period [0,T] can be formulated as:

    P1.min:ˉC (8)
    s.t.:ar(t)Armax,r,t[1,T] (9)
    ar(t)=dDλdr(t),r,t[1,T] (10)
    0nkd(t)nk,maxd,d,k,t[1,T] (11)

    where ¯C=limT1TT1t=1E{C(t)}. The constraint (10) is to ensure that the sum of data allocated to each datacenter at one time slot is equal to the total amount data generated at that time slot. The constraint (11) ensures that the number of VMs required is within the capacity that a datacenter can provide.

    From the problem formulation presented above, as the data generation is unknown, we know that the problem is a constrained stochastic optimization problem and our objective is to minimize the long-term average cost by optimizing the amount of data allocated to each datacenter as well as the number of the VMs rented inthe datacenter. To deal with this problem, a recent developed optimization technique is adopted in this paper. The details of solution by using Lyapnov optimization framework is presented in the next section.


    4. Online algorithm design

    In this section, we exploit the Lyapunov optimization theory to design our online control algorithm. An outstanding feature of this method is that it does not require future information about workload. By greedily minimizing the drift-plus-penalty in each time slot, it also can be proved to approach a time averaged cost that is arbitrarily close to optimum, while still maintaining system stability.

    According to the standard optimization framework theory in [13], we first transform the problem P1 to an optimization problem of minimizing the Lyapunov drift-plus-penalty term and then design the corresponding online algorithm.


    4.1. Problem transformation

    Let Hd(t) be the amount of unprocessed data in datacenter d at time slot t. Initially, we define Hd(0)=0, and then the evolution of the queue Hd(t) can be described as below:

    Hd(t+1)=max[Hd(t)kKnkd(t)vk,0]+rRλdr(t). (12)

    The above queue update implies that the amount of departed data and newly-arrived data are kKnkd(t)vk and rRλdr(t), respectively.

    To guarantee that the worst-case queuing delay in queue Hd(t), dD, is bounded by the max workload delay l, we design a related virtual queue Zd(t) according to the ε-persistent service technique for delay bounding in [14]. Similarly, the backlog of virtual queue Zd(t) is initialized as Zd(0)=0, then updated as follows:

    Zd(t+1)=max[Zd(t)+1{Hd(t)>0}(εdkKnkd(t)vk)1{Hd(t)=0}kKnk,maxdvk,0], (13)

    where the indicator function 1{Hd(t)>0} equals to 1 when Hd(t)>0, and 0 otherwise. Similarly, 1{Hd(t)=0} equals to 1 when Hd(t)=0, and 0 otherwise. εd is preset constant that can be gauged to control the queuing delay bound. It can be proved that we are able to guarantee all data can be processed with delays at most l time slots if the algorithm can guarantee the length of Hd(t) and Zd(t) over time slots. It is also proved that l can be set as l=[(Hmaxd+Zmaxd)/εd], where Hmaxd and Zmaxd are the bound of queues Hd(t) and Zd(t) respectively. Details please see in theorem 5.3.

    Let Z(t)=(Zd(t)), and H(t)=(Hd(t)),dD denote the matrix of virtual queue and actual queue respectively. Then, we use Θ(t)=[Z(t),H(t)] to denote the combined matrix of actual queues and virtual queues. According to Lyapunov framework [13], we define the Lyapunov functions as follows:

    L(Θ(t))=12dD{Zd(t)2+Hd(t)2}, (14)

    where L(Θ(t)) measures the queue backlogs in the system. The one-slot Lyapunov drift is:

    Δ(Θ(t))=E{L(Θ(t+1))L(Θ(t))|Θ(t)}. (15)

    In the sense of Lyapunov optimization framework, the drift-plus-penalty can be obtained by adding the the cost incurred by the system to the above Lyapunov drift, namely,

    Δ(Θ(t))+VE{C(t)|Θ(t)}, (16)

    where V is a non-negative parameter, that can control the tradeoff between the system stability and cost. The larger the V is, the smaller the cost is, and vice versa. Hence, the original problem P1 can be transformed into the following problem P2:

    P2.min:(16) (17)
    s.t.:(9)(10)(11). (18)

    To solve problem P2, rather than directly minimize the drift-plus-penalty expression (16), we seek to minimize the upper bound for it, without undermining the optimality and performance of the algorithm according to [13]. Therefore, the key is to find an upper bound on problem P2. It can be proved that the expression (16) is bounded as:

    Δ(Θ(t))+VE{C(t)|Θ(t)}=B+E{dDkKnkd(t)(Vpkd(t)Hd(t)vkZd(t)vk)|Θ(t)}+E{dDrRλdr(t)(Vsd+Vbdr+VαLdr+Hd(t))|Θ(t)}, (19)

    where B=12dD{2(kKnk,maxdvk)2+(εd)2+(rRArmax)}. Detailed proofs please see the Appendix A.


    4.2. Online control algorithm design

    Fortunately, a careful investigation of the R.H.S of inequality (19) reveals that the optimization problem can be equivalently decoupled into two subproblems: 1) data allocation and 2) resource provisioning. The details of solving the two subproblems are presented as follows.

    1) Data Allocation: To minimize the R.H.S of (19), by observing the relationship among variables, the part related to Data Allocation can be extracted from the R.H.S of (19) as:

    E{dDrRλdr(t)(Vsd+Vbdr+VαLdr+Hd(t))|Θ(t)}. (20)

    Furthermore, since the data generated from each location are independent, the centralized minimization can be implemented independently and distributedly. Considering the data allocation in location r at time t, we should solve the following problem.

    mindDλdr(t)[Vsd+Vbdr+VαLdr+Hd(t)]s.t(9)(10). (21)

    In fact, the above problem is a generalized min-weight problem where the amount of data from location r moved to datacenter d λdr(t) is weighted by the queue backlog Hd(t), bandwidth price bd, storage price sd and the latency cost L(r,d). By using linear program theory (e.g., Simplex Method), we can get the following solution:

    λdr(t)={ar(t) d=d0 else, (22)

    where d=mind[Vsd+Vbdr+VαLdr+Hd(t)]. Obviously, the solution exhibits that the data generated from location r will incline to be moved to the datacenter with the shortest weighted workload queue and the minimal operation price (e.g., VM price, Storage price etc.) at current time slot.

    2) Resource Provisioning: The left part of R.H.S (19) related to variable nkd(t) can be considered as resource provisioning problem if we remove the constant term. Therefore, we can get the optimal VM provisioning strategy by solving the following problem:

    min E{dDkKnkd(t)(Vpkd(t)Hd(t)vkZd(t)vk)|Θ(t)}s.t(11). (23)

    Since the resource provisioning problems in each datacenter are independent, similar to data allocation problem, (23) can be solved distributedly within each datacenter. For a single datacenter d, the resource provisioning problem can be further rewritten as (24).

    minE{kKnkd(t)(Vpkd(t)Hd(t)vkZd(t)vk)|Θ(t)}s.t(11). (24)

    The optimal solution to the above linear problem is:

    nkd(t)={nk,maxd,if Hd(t)+Zd(t)>Vpkd(t)vk 0,ifHd(t)+Zd(t)Vpkd(t)vk. (25)

    The above solution indicates that a type-k is preferred to be rented in t when its price, pkd(t), is small, and VMs whose capacity, vk, is large are more likely to be rented too.

    Obviously, the two complex problems of data allocation and resource provisioning have been solved efficiently by using Lyapunov framework so far. The simple strategy facilitates the online deployment of the algorithm in the real world systems. The detail of its online algorithm is presented in Algorithm 1.

    Table . .
    Algorithm 1: Procedures of the Proposed online Algorithm in Time Slot t
    1Input:
    2 Hd(t),Zd(t),ar(t),vk,sd,bdr,Ldr,nk,maxd,Armax,pkd(t),V,α (dD,rR,kK)
    3 Output:
    4 nkd(t),λdr(t) (dD,rR,kK)
    5 Resource provisioning:
    6 foreach datacenter dD do
    7 Getting the VM provisioning strategy (nkd(t)) by solving the problem (23) using (25);
    8 Data Allocation:
    9 foreach rR do
    10 Getting the data allocation strategy λdr(t) by solving the problem (20) using (22);
    11 Update the queues Hd(t),Zd(t) according to queue dynamic equation (12)(13) respectively.
     | Show Table
    DownLoad: CSV

    5. Performance analysis

    Next, to show its priority, we analyze theoretically the performance of the algorithm 1 in terms of cost optimality, queueing delay bound, and the worst delay of data processing.

    Theorem 5.1. (Cost Optimality) Suppose the data generation rate ar(t),rR is identical independent distribution over time slots, for any control parameter V>0, the algorithm can achieve a time average cost related with the optimal one as follows.

    limsupT1TT1t=0E{C(t)}C+BV, (26)

    where C is the infimum of the time average cost when choosing the optimal control action, representing the theoretically optimal solution to the optimization, B is the same as defined in (19).

    Proof. Please see the Appendix B.

    This theorem exhibits that the gap between the time average cost obtained by the algorithm proposed in this paper and the optimal cost obtained offline is with O(1/V). In particular, by choosing the control variable V, the time-average cost O is arbitrarily close to the optimal cost C.

    Theorem 5.2. (Queues Delay Bound) Assume εd satisfies εd<kKnk,maxdvk. Let Hmaxd and Zmaxd be the upper bound of queue Hd(t) and virtual queue Zd(t) respectively. We have:

    Zmaxd=Vpmaxdvmin+εd, (27)

    and

    Hmaxd=Vpmaxdvmin+rRArmax (28)

    where pmaxd is the max price for each type of VM over time slots, and vmin is the minimal capacity among all kinds of VMs.

    Proof. Please see the Appendix C.

    This theorem shows that the queue backlog is with O(V). It means that, to keep the queue backlog stable, we should choose a small V. Notice that decreasing V will cause a larger cost as shown in (26), the cost and system stability has an [O(1/V),O(V)] tradeoff. In reality, given the acceptable cost we can choose the V to maximize the system stability, and vice versa.

    Theorem 5.3. (Worst Case Delay)Assume that the system running in First-in-First-Out manner, then the worst delay of the data processing in queue d is bounded by the l defined below:

    l=[Hmaxd+Zmaxd/εd], (29)

    where [x] denotes the minimal integer among those greater or equal to x and Hmaxd,Zmaxd are defined in (28) and (27).

    Proof. Please see the Appendix D.

    This implies that the data arriving at any time slot t in Qd can be completed within l time slots, demonstrating that our algorithm is able to guarantee the QoS (Quality of Service) for DSP. In addition, as can be seen, given the system parameters, by choosing a suitable εd, the QoS for the DSP can be changed. Also, with different setting of εd for dD, we can achieve heterogeneous QoS for different datacenters.


    6. Conclusion

    Targeting the processing of big data from different locations in geo-distributed datacenters, we propose a systematical way for data moving and resource provisioning with the goal of cost minimization. The model takes into consideration the case that data analysis application is running in dynamic environment (e.g., unpredictable data generation, dynamic VM pricing). By using the Lyapunov technique, we transformed the original problem into two independent subproblems that can be solved efficiently online. Theoretical analysis demonstrates that the algorithm is able to maintain the stability of the dynamic system and complete the data processing within some time slots. It remains to further validate the effectiveness of the proposed algorithm via extensive experiments. Other considerations that may be further incorporated into our proposed framework include data processing relevance between two consecutive time slots, data processing migration among datacenters etc.


    Appendix A. Derivation of bound for the Drift-plus-penalty

    For the vectors of Θ(t)=[Z(t);H(t)], note that (max[xy,0]+z)2x2+y2+z2+x(zy), then we have:

    Zd(t+1)2Zd(t)2{1(H(t)>0)(εdkKnkd(t)vk)1(H(t)=0)kKnk,maxdvk}+2Zd(t){1(H(t)>0)(εdkKnkd(t)vk)1(H(t)=0)kKnk,maxdvk}(εd)2+(kKnk,maxdvk)2+2Zd(t)(εdkKnkd(t)vk) (30)
    Hd(t+1)2Hd(t)2(kKnkd(t)vk)2+(rRλdr(t))2+2Hd(t)(rRλdr(t)kKnkd(t)vk) (31)

    Since λdr(t), nkd(τ) are bound by Armax, nk,maxd respectively. By defining B=12dD{2(kKnk,maxdvk)2+(εd)2+(rRArmax)}, we can get the 1-slot Lyapunov drift as follows:

    Δ(Θ(t))=12dDE{Hd(t+1)2Hd(t)2|Θ(t)}+12dDE{Zd(t+1)2Zd(t)2|Θ(t)}B+dDE{Hd(t)(rRλdr(t)kKnkd(t)vk)|Θ(t)}+dDE{Zd(t)(εdkKnkd(t)vk)|Θ(t)} (32)

    So far, by adding the term VE{C(t)} to the above expression, we can get the drift-plus-penalty bound in (19).


    Appendix B. Proof of Theorem 5.1

    To prove this theorem, we first give the following lemma.

    Lemma B.1. (Existence of Optimal Randomized Stationary Policy): There exists at least one policy π that chooses feasible solution (nk,πd(t),λd,πr(t) for dD,rR,kK,t[1,T]) and satisfies that:

    E{C(t)}=C,E{rRλd,πr(t)}E{kKnk,πd(t)vk},εdE{kKnk,πd(t)vk} (33)

    where C is the theoretical lower bound of the cost. This lemma can be proved by using Caratheodory's theorem in [8].

    Based on lemma B.1, next we prove the time averaged cost bound of our algorithm in (26) as follows.

    Proof. From lemma B.1, it can be derived that there exist a constant δ>0 that satisfies:

    E{rRλd,πr(t)}E{kKnk,πd(t)vk}δ (34)
    εdE{kKnk,πd(t)vk}δ (35)

    Therefore, recall that our algorithm seek to minimize the right-hand-side of the inequality in (19) by choosing the decision variables among all feasible decisions at each time slot and apply lemma B.1, (34) and (35) into (19), we can obtain:

    Δ(Θ(t))+VE{C(t)|Θ(t)}B+VCδdDE{Hd(t)}δdDE{Zd(t)} (36)

    Taking the expectation for (36) and using the fact that Δ(Θ(t))=E{L(Θ(t+1))L(Θ(t))|Θ(t)}, it can be given that:

    E{L(Θ(t+1))L(Θ(t))}+VE{C(t)}B+VCδdDE{Hd(t)}δdDE{Zd(t)} (37)

    With the law of telescoping sums over t=0,...,T1 and then dividing the result by T gives:

    E{L(Θ(T))L(Θ(0))}T+VTT1t=0E{C(t)}B+VCδTT1t=0dDE{Hd(t)}δTT1t=0dDE{Zd(t)} (38)

    Rearranging the terms and considering the fact that L(Θ(0))}=0,Hd(t)0,Zd(t)0, we obtain:

    1TT1t=0E{C(t)}C+BV (39)

    Now (26) follows by taking a limit as T.


    Appendix C. Proof of Theorem 5.2

    Proof. For Zd(t), as known to us, Zd(0)=0<Zmaxd. For any time slot t[0,T], if Zd(t)Vpmaxdvmin, then Zd(t+1)Vpmaxdvmin+εd (because Zd(t+1) will increase at most εd as shown in (13)). If Zd(t)>Vpmaxdvmin, then Hd(t)+Zd(t)>Vpmaxdvmin>Vpkdvk, and the queue will decrease by kKnk,maxdvk (refer to (25)). Note that εd<kKnk,maxdvk, hence, Zd(t+1)<Zd(t)Zmaxd. The bound of queue Zd(t) is proved.

    Similarly, for Hd(t), Hd(0)=0<Hmaxd. For any time slot t, if Hd(t)Vpmaxdvmin, then Hd(t+1)Vpmaxdvmin+rRArmax (because Hd(t+1) will increase at most rRArmax as shown in (12)). If Hd(t)>Vpmaxdvmin, then Hd(t)+Zd(t)>VpmaxdvminVpkdvk and the queue will decrease by kKnk,maxdvk (refer to (25)). Noting that rRArmaxkKnk,maxdvk, so that the queue cannot increase on the next slot, i.e., Hd(t+1)<Hd(t)Hmaxd. Hence, the bound of queue Hd(t) is proved.


    Appendix D. Proof of Theorem 5.3

    Proof. If there is a time slot τ[t+1,t+l] satisfies that Hd(τ)=0, then the data arriving at time slot t are processed within l time slots. For the case that Hd(τ)>0 for any τ[t+1,t+l], the virtual queue has the departure rate nkd(τ) as show in (13). Therefore, we have:

    Zd(τ+1)=max[Zd(τ)+εdkKnkd(t)vk,0]Zd(τ)+εdkKnkd(t)vk (40)

    Summing the above inequality over time slots τ[t+1,t+l], we have:

    Zd(t+l+1)Zd(t+1)+ldεdt+l+1τ=t+1kKnkd(τ)vk. (41)

    Hence, it can be derived that t+l+1τ=t+1kKnkd(τ)vklεdZmaxd. Note lεdZmaxdHmaxd when l=[Hmaxd+Zmaxd/εd], we have:

    t+l+1τ=t+1kKnkd(τ)vkHmaxdHd(t). (42)

    Since the system running in First-in-First-out model, the data arriving at time slot t will be processed before that arrives after time slot t. As the total amount of data processed within l time slots surpass the Hd(t), then all the data arriving at time slot t will be served within l time slots, thus the worst delay is l time slots.


    [1] [ Moving an elephant:Large scale hadoop data migration at facebook, http://www.facebook.com/notes/paul-yang/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920.
    [2] [ AWS Import/Export, http://aws.amazon.com/importexport/.
    [3] [ P. Barham, B. Dragovic and K. Fraser, Xen and the art of virtualization, SIGOPS Operating Systems Review, 37(2003), 164-177.
    [4] [ B. Cho and I. Gupta, New algorithms for planning bulk transfer via internet and shipping networks, in Proc. IEEE ICDCS, (2010), 305-314.
    [5] [ B. Cho and I. Gupta, Budget-constrained bulk data transfer via internet and shipping networks, in Proc. ACM ICAC, (2011), 71-80.
    [6] [ J. Dean and S. Ghemawat, MapReduce:Simplified data processing on large clusters, Communications of the ACM, 51(2008), 107-113.
    [7] [ Y. Feng, B. Li and B. Li, Airlift:Video conferencing as a cloud service using interdatacenter networks, in Proceedings of the IEEE International Conference on Network Protocols(ICNP'12), (2012), 1-11.
    [8] [ L. Georgiadis, M. J. Neely and L. Tassiulas, Resource allocation and cross-layer control in wireless networks, Foundations and Trends in Networking, 1(2006), 1-144.
    [9] [ Z. Huang, C. Mei, L. Li and T. Woo, CloudStream:Delivering high-quality streaming videos through a cloud-based SVC proxy, in Proceedings of the IEEE INFOCOM, (2011), 201-205.
    [10] [ F. Liu, Z. Zhou, H. Jin, B. Li, B. Li and H. Jiang, On arbitrating the power-performance tradeoff in SaaS clouds, IEEE Transactions on Parallel and Distributed Systems, 25(2014), 2648-2658.
    [11] [ X. Mo and H. Wang, Asynchronous index strategy for high performance real-time big data stream storage, in Network Infrastructure and Digital Content (IC-NIDC), (2012), 232-236.
    [12] [ X. Nan, Y. He and L. Guan, Optimal resource allocation for multimedia cloud based on queuing model, in Proc. of IEEE MMSP Workshop, (2011), 1-6.
    [13] [ M. J. Neely, Stochastic Network Optimization with Application to Communication and Queueing Systems, Morgan and Claypool, 2010.
    [14] [ M. J. Neely, Opportunistic scheduling with worst case delay guarantees in single and multi-hop networks, in Proc. of INFOCOM, (2011), 1728-1736.
    [15] [ E. E. Schadt, M. D. Linderman, J. Sorenson, L. Lee and G. P. Nolan, Computational solutions to large-scale data management and analysis, Nat Rev Genet, 11(2010), 647-657.
    [16] [ J. Tang, W. P. Tay and Y. Wen, Dynamic request redirection and elastic service scaling in cloud-centric media networks, IEEE Transactions on Multimedia, 16(2014), 1434-1445.
    [17] [ L. Tassiulas and A. Ephremides, Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks, IEEE Transactions on Automatic Control, 37(1992), 1936-1948.
    [18] [ C. Union, Homepage http://www.cloudunion.cn/.
    [19] [ R. Urgaonkar, U. Kozat, K. Igarashi and M. J. Neely, Resource allocation and power management in virtualized data centers, in Proceedings of the IEEE Network Operations and Management Symp(NOMS'10), (2010), 479-486.
    [20] [ J. Wang, W. Bao, X. Zhu, L. T. Yang and Y. Xiang, FESTAL:Fault-tolerant elastic scheduling algorithm for real-time tasks in virtualized clouds, IEEE Transactions on Computers, 64(2014), 2445-2558.
    [21] [ F. Wang, J. Liu and M. Chen, CALMS:Cloud-assisted live media streaming for globalized demands with time/region diversities, in Proceedings of the IEEE INFOCOM, (2012), 199-207.
    [22] [ D. Wu, Z. Xue and J. He, iCloudAccess:Cost-effective streaming of videogames from the cloud with low latency, IEEE Transactions on Circuits and Systems for Video Technology, 28(2014), 1405-1416.
    [23] [ Y. Wu, C. Wu, B. Li, X. Qiu and F.C.M. Lau, Cloudmedia:When cloud on demand meets video on demand, In Proc. of IEEE ICDCS, (2011), 268-277.
    [24] [ Y. Wu, C. Wu, B. Li, L. Zhang, Z. Li and F. Lau, Scaling social media applications into geo-distributed clouds, in Proc. IEEE INFOCOM, (2012), 684-692.
    [25] [ W. Xiao, W. Bao, X. Zhu, C. Wang, L. Chen and L. T. Yang, Dynamic request redirection and resource provisioning for cloud-based video services under heterogeneous environment, IEEE Transactions on Parallel and Distributed Systems, pp (2015), p1.
    [26] [ Y. Yao, L. Huang and A. B. Sharma, L. Golubchik and M. J. Neely, Power cost reduction in distributed data centers:A two-time-scale approach for delay tolerant workloads, IEEE Transactions On Parallel and Distributed Systems, 25(2014), 200-211.
    [27] [ M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker and I. Stoica. Spark:cluster computing with working sets, In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing(HotCloud'10), Berkeley, CA, USA, (2010), p10.
  • This article has been cited by:

    1. Guanlin Wu, Weidong Bao, Xiaomin Zhu, Wenhua Xiao, Ji Wang, Optimal Dynamic Reserved Bandwidth Allocation for Cloud-Integrated Cyber-Physical Systems, 2017, 5, 2169-3536, 26224, 10.1109/ACCESS.2017.2769665
    2. Jun Yang, Mengchen Liu, Zeru Wei, Chao Han, Wei Li, Yiming Miao, 2018, RoCoSense: Integrating Robotics, Smart Clothing and Big Data Clouds for Emotion Sensing, 978-1-5386-2070-0, 1323, 10.1109/IWCMC.2018.8450363
  • Reader Comments
  • © 2016 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(3697) PDF downloads(546) Cited by(2)

Article outline

Figures and Tables

Figures(1)  /  Tables(1)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog