Investigation of Amazon and Google for Fault Tolerance Strategies in Cloud Computing Services

Cloud computing has recently become an attractive topic due to its ability to offer information technology solutions through virtual machines as on-demand services to share and consume resources over the Internet. As a result of rapid development in such services, the necessity of fault tolerance in the cloud is a major concern with reliability, availability and dependability which are more critical to this new service type. This work investigates techniques and means of tolerating cloud services as well as cloud customers’ systems/enterprises execution over the cloud safe from failures. Failures in cloud enabled services should be expected to occur hence they should be handled. The essential features of implementing fault tolerance strategies guarantee the business continuity, avoid financial lost, recovering systems from failures, and provide disaster recovery as well. The specific focus is to explore scenarios of avoiding/recovering from failures through redundancy, checkpoint and replication. Commercial IaaS providers such as Amazon’s AWS and Google’s GCE are taken as examples as they tolerate their infrastructure from failures; in this way a robust architecture with fault tolerance property could be built for a system/enterprise. Hence, general conceptual steps with fault tolerance considerations have been proposed.


Introduction
Nowadays, cloud computing is a popular paradigm.It can be identified as a business model that offers online computing resources.It aims to deliver on-demand large amount of IT infrastructure such as (servers, storage applications, network, services) for multi-users to share and consume as a utility in a virtualized manner and highly abstracted from cloud service providers (e.g.AWS, Azure, GCP).However, virtualization considers as the backbone of cloud computing Kepes (2011) in which cloud technology could never be without it.The concept of virtualizing a computer system's resources is based on adding a layer between the hardware and operating system that allows diverse operating system instances working simultaneously upon a single server.Physical resources, including: processors, memory, storage, network, and I/O devices are dynamically partitioned and shared.Carlin and Curran (2012) Fault tolerance is an important key issue in cloud computing.It is the means to guarantee the availability and reliability (dependability) of critical cloud services as well as applications' executions and fulfilling their functions correctly even in the presence of failure affecting a system resources (e.g.hardware, software, network, overflow, timeout, power, or database lose).Basically, fault tolerance techniques are employed through the procurement or the development level of the system, so that, it is a survival attribute of cloud computing systems to satisfy the quality of service (QoS) requirement which are offered in service level agreement (SLA).(Ganga, Karthik, & Paul, 2012;Latchoumy & Khader, 2011;Pullum, 2001) The remainder of this paper is organized as follows: next section surveys the relation between fault tolerance and dependability in a distributed system.We addressed fault tolerance approaches, methods, and techniques on the cloud in this section.Next section explores fault tolerance within commercial infrastructure cloud computing providers.Following that, we proposed a general conceptual model with fault tolerance considerations.Next section addresses related work.Finally, the conclusion of the paper and our future work are presented in the last section.

Background
Fault tolerance aims to achieve dependability in a system.Hence, system dependability is an essential objective of fault tolerance.Generally speaking, dependability is justified the trustworthiness of a system to deliver services to its customers Dubrova (2013).Figure 1 clarifies the tree of the major dependability characteristics

Dependability attributes
The dependability tree illustrates its attributes; two primary attributes are reliability, and availability: • Reliability: is the continuity of delivering correct services without disruption like loss of data or code reset during execution; it is a function of time so that it is related to the mean time to failure (MTTF) and the mean time between failure (MTBF).Moreover, mean time to repair (MTTR) is the difference between MTTF and MTBF.As a result, MTBF=MTTF+MTTR (Latchoumy & Khader, 2011).

•
Availability: can be defined as the immediate readiness of the system to perform the services or the tasks when it is asked to do it.Availability = (MTTF) / (MTTR + MTTF) (Selic, 2006).Basically, availability in the cloud can be attained by redundancy throughout the replication of services or data and spreading them across various resources (Patel, Taghavi, Bakhtiyari, & Junior, 2013).System availability is measured by downtime per year.Table 1 shows standard values of availability and their corresponding downtime (Dubrova, 2013).

Dependability threats
On one hand, a fault is software or hardware failure, defect, shortcoming, or flaw that leads to an incorrect or an inaccurate value on the computational level, which is called an error.On the other hand, a failure is a system problem that is caused by the error.In other words, faults are the sources of errors and errors are the sources of failures.So that, a fault is the root of failure as defended in (Dave & Raghuvanshi, 2012).

Dependability means
Four techniques are used to achieve dependability.On one hand, means that are deployed during the process of software configuration or construction with the goal of providing trust services to customers (Avizienis, Laprie, Randell, & Landwehr, 2004), they are: • Fault tolerance, which is the subject of this work that aims to avoid an application or service failures' events if a fault happens.Early examples include (Laprie, 1995).• Fault avoidance/prevention, which aims to prevent or reduce faults from occurring or introduce as possible.(Lussier et al., 2005) On the other hand, techniques are deployed after the software development with the ability to reach confidence such as presented in (Pullum, 2001): • Fault removal, which detects and eliminates the existence of fault during the development and operational life of the system.
• Fault forecasting, that evaluates, estimates or ranks the system behavior during the present or activation of faults.
Fault tolerance solutions are based on redundancy.So that, redundancy is one of the keynote principle for supporting availability and reducing the risk of single points of failure (VMware, 2009).A diversity of resource redundancy techniques and structures exist for fault tolerance mechanism which can be categorized into four essential categories (H.Li, Shang, Dang, & Jin, 2009): • Spatial redundancy: Extra copies of the hardware computing resources are added to find out and overcome the impact of component failure.In advance, static, dynamic, or hybrid hardware redundancy can be used depending on the system structure complication (Runge, 2012).
• Temporal redundancy: Also called time redundancy, by which additional time can be used to re-execute the same task several times on the same resources in the event of a failure.Time redundancy has an advantage for detecting fault effectively with a low hardware cost (H.Li et al., 2009).
• Information redundancy: Additional information, which is called check bits are added to the original to data help protecting from soft error.This form of redundancy needs hardware redundancy to process check bits.
• Software redundancy: Extra software is added to override software failure.
Interestingly, managing and recovering failure in cloud form is different from what is happening in traditional datacenters.The redundancy in the cloud is managed and formulated in the software instead of hardware components.For example, Amazon EC2 offers a wide variety options such as direct failed virtual machine to another virtual machine image VMI, replications, or live migration and all failed virtual machine configurations are same to original (Babcock, 2010).
In most current cloud, checkpointing and replication are the two common fault tolerance redundancy strategies which are used in case of a system outage.Various types of checkpointing strategies had been investigated by researchers, but there are three popular checkpointing fault tolerance strategies Full checkpointing: It is the traditional mechanism, which saves the total state of the application or the system periodically to a storage platform.The drawback of this mechanism is the time which is consumed to make a snapshot of a whole system.And also the consumed of a large storage to save the whole system running states (Gokuldev & Valarmathi, 2013).

Table1. Standard availability values
Incremental checkpointing: Typically, the first checkpoint is full while the subsequent checkpoints only save pages that have been modified.This procedure produces a large recovery overhead due to the system must recover from the starting checkpoint (Garg & Singh, 2011).
Hybrid checkpointing: It is a combination between the full checkpointing and the incrementing strategies.Hence, a balance between the checkpointing overhead and the fault recovery overhead should achieve (Sun, Chang, Miao, & Wang, 2013).

Replication
The availability of replicated resources is a key requirement for the forming of fault tolerant systems in the cloud.Simply, replication means several copies of an application with the same input-set are executed simultaneously on alternative sites (Ghoreyshi, 2013).For example, proxy server and the caching in web browser can be considered as a form of replication.The essential goal of replication is to guarantee that at least one replica can complete the task correctly in case the others fail (Latchoumy & Khader, 2011).More than one replication mechanisms such as active, passive, or semi-active have been used in the cloud computing.
Basically, Hadoop, Ha proxy (high availability proxy) and Amazon EC2 are the common tools that are implemented in order to manage replicas on different locations (Ganga et al., 2012).Moreover, three important problems should be addressed to achieve efficient replication: "Which data should be replicated?","When to replicate in cloud systems", "How many suitable new replicas should be created in the cloud?", and "Where the new replica should be placed?"(Sun et al., 2013).

Fault tolerance collective studies for commercial cloud IaaS providers
To understand how fault tolerance nowadays operates in cloud computing services, Figure 2 and Figure 3 explore Amazon web services (AWS) and Google Compute Engine (GCE) infrastructure as a service.Many tools and features are provided within the service that capable of designing fault tolerance (FT), disaster recovery (DR), and high available (HA) systems.In other side there are further infrastructure building blocks include fault tolerance solutions by default.
2006 is the start point of Amazon Web Services.At the core of Amazon virtual computing resources is an EC2 instance "virtual machine server".Amazon EC2 is constructed over multiple (eleven) geographical locations known as regions.In December 2013 Google Cloud launched Google Compute Engine GCE as a new entry to the market providing IaaS.Google group datacenters are distributed in (three) regions worldwide.Both providers overcome failure in a region, throughout splits each region into two or more Zones which are geographically isolated from each other in the same region, but they are connected with low latency network connection.
AWS first step in EC2 is to launch a VM instance throughout, creating an Amazon Machine Image (AMI), it is a master template that helps to define instances such as (web server, application server, etc...).From one AMI multiple instances can be established.Moreover, AMI's instances can be scaled up/down (Amazon Web Services, 2016).In GCE different virtual machine servers can be launched and all resources within a region can be accessed by zones within it throughout static IP address that is provided by default and GCE promise to launch VMI with instances management property (Ferraioli, 2014).It is safety for critical application/system to keep a spare of an instance in different AZ and keeps it running, in case a hot instance fails, the activity is taken by just redirecting users' requests to a new instance.Moreover, VM instances are not replicated automatically in the same or different regions, it is customers' mission and need their interaction.
An instance local storage is not persistent, so that other infrastructure storage component such as Amazon Elastic Block Storage (EBS) in AWS and Persistent Disk PD in GCE are required.As a result, the root device volume of an instance is Amazon EBS/PD volume.Also, an instance hot data should store in Amazon EBS/Amazon S3/GCE PD when the instance dies a customer can replace it by attaching the volume to a new instance (Varia & Mathew, 2014).
Checkpointing "snapshot" strategy is Amazon EBS and GCE redundantly feature that reduces the possibility of failure throughout automatically take an image of the instance volume.
Replication feature is implemented in Amazon EBS, EBS volume data can be replicated with ease across multiple servers in a region AZs.Automatically, Amazon EBS snapshots are reserved in Amazon S3.While GCE snapshot is a global resource, this mean PD snapshot can be applied to new PD volume in the same zone, different zone, or different region.Many VM instances can be connected to one PD volume with read-only mode (Amazon Web Services, 2016; Google, 2016a).

Fig. 2. Amazon Web Service (AWS) Infrastructure Building Block
Amazon Relational Database Service (RDS) is an infrastructure component that provides fully featured virtual database server on the cloud with the capability of online access and use customers' databases such as MySQL, Oracle, SQL server, or PostgreSQL).Amazon RDS automatically snapshot data which are stored on Amazon S3 for more protection and availability.A customer's database can be replicated in multi-availability Zone AZs within the same region.There is no direct service to replicate to another region.Moreover, Amazon SimbleDB is a non-relational database functionality that creates, stores, and queries varied sets of data, with automatic administration infrastructure, hardware, software, etc. Geo-replication of data among multiple Availability Zones AZs within the same region is done automatically (Baron & Kotecha, 2013).Whilst, Google database services are deployed over Google Cloud Platform and google developer are working for launching database services within their IaaS.

EC2 elastic compute cloud
• regions and AZs • 99.95% uptime  Comparing these features make sense which provider covers a customer requirement.It will be interesting to see that Amazon AWS and Google Compute Engine are similar in their main infrastructure as service components.AWS and GCE try to offer high-quality and high-

GCE Instance
•Regions, Zones •Advance Routing •public IP availability service, resilient infrastructure, redundant infrastructure, easy access to resources, independent storage, create VMs snapshot, and high-level of network options, at least offer 99.95% monthly uptime as well.Note, however, that they use common fault tolerance strategies which are checkpointing and replication strategies to survive customers' applications and enterprises.Some differences are in their efficiency of infrastructure resources, for example (Google PD volume has the capability to be attached to multiple VM instances with read-only mode, billing scheme mode is applied in GCE versus hourly billing scheme within AWS, live migration of VM instances in case of a datacenter maintenance is done automatically in Google CE, AWS datacenters spreads across eleven regions, while GCE has three regions Clearly, failure happens all the time and the cloud vendors design geo-location to tolerate regions' failure, more facilities help surviving AZs, network problem can be solved by load balancing technique, IPs, DNS routing, and VPNs, and deploy multiple instances overcome instance failure.Therefore, the critical key point to success deploying services/enterprises over the cloud IaaS is that how can customers design and deploy effective architecture that faults mitigate and avoid real risk for their businesses.Accordingly, in the following section, general conceptual steps are proposed to help designing and architecting a solution with fault tolerance consideration for customers' applications over the cloud infrastructure.

Fault tolerance consideration for IaaS cloud computing application
In reality, few works concentrate on a good understanding to build and setup cloud strategy of a solution in the cloud computing environment.In the cloud context, design a solution strategy is a complex decision-making process which needs strategic and well-ordered methods and consideration to be taken into account.

Build fault tolerance strategy within a workload definition
The workload is a key process for understanding service/enterprise architecture and describing exactly what will be deployed over the cloud virtual infrastructure.According to NetApp research (Villatore-Silva, 2012), a successful solution needs to address the real application's workload elements, requirements, technical sides, and metrics such as: -Availability, -Design solution costs such as resource cost.
-Security, regulatory, and privacy demands.
-Capacity, -Infrastructure resources are the essential requirement to run the enterprise in pay-asyou-go method.
-Financial consideration affects workload, such as the IT capital budget.
-Business services such as Enterprise Resource Planning (ERP), finance and accounting, Customer Relationship Management (CRM), Human Resources (HR), Payroll, as well as Project Management System.
-Data structure as SQL or NoSQL.
-Fault tolerance strategy setup.This characteristic depends on the type of the enterprise; whether it is a traditional solution or a new generation solution.
In essence, the workload should be built on the idea that a web application failure happens all the time, thus enterprise developers should think about how can they choose the convenient methods that fit the workload as well as they should determine the workload type is it for a client-server/traditional application or a new generation application (gaming, mobile app, HPC, social app).Moreover, a cloud application has other sub-workloads, for example, a web site may have search and browse workload, and registration workload.

Build up life-cycle development model
Clearly, after establishing an application workload operational side should be designed within a specific period of time.According to Marks and Lozano research in their book (Marks & Lozano, 2010) life-cycle describes the process of plan activity in a systematic manner throughout establish the milestones in which key scope decision are made for start and end point of the system, team responsibilities, gives direction, and identifying the availability of resource demands.As a result, the life-cycle includes tracking expected behavior of an application at different time and life-cycle stages such as planning, analysis of the system, system design, implementation, testing, and maintenance as well.Next step is to draw charts (weekly, annually) for a time against usage, availability, or capacity by taking entities, aspects, and conditions in their consideration.In this way, recognizing the life cycle will help determine the next step.
-Identify critical resources expose to fail, which components need to replicate, snapshot, change/modify software, improve, or upgrade such as OSs.

-
Trust cloud services that are provided by third parties.
-Automated everything is valuable, so as to minimize the developers' mistake which minimize business risk, improve efficiency.
-Implement right redundancy strategies.Several scenarios are clearly architected by Amazon Web Services and Google Compute Engine.Either provide full redundancy solution, however, it increases 100% of the cost, or partial redundancy in another region (cost-effective) solution.
-Traffic management is valuable to ensure that a request is directed to the appropriate node.

Related works
In (Das & Khilar, 2013), authors discussed VFT, a model which addresses fault tolerance in a hypervisor virtualization architecture inside a cloud datacenter.The VFT has the capability of decreasing the service time, also ensuring and raising the system availability.The Reactive fault tolerance mechanism with the replication and redundancy techniques is used to build the model.The models' scheme implicates two stages.Cloud manager (CM) stage carries out hypervisor virtualization, load balancing, and performance recording (success rate parameter), then detect and repair faults throughout fault handler.The second stage is the decision maker (DM) which responsible for checking the nodes status and task deadline, then the algorithm will give the final decision to create checkpoint if the all checking gives correct result.The proposed model was evaluated throughout the analyzing of the success rate SR metric; whenever, SR is increased the scheme will achieve perfect performance.
In another work (Gómeza, Carril, Valin, Mouriño, & Cotelo, 2014), Gomeza and colleagues suggested virtual cluster architecture to tolerate faults in the cloud IaaS platform.Cloud sites failure can be recovered through deploying of virtual clusters on two different geographical locations.It is a significant method in high performance computing (HPC) applications, in case the whole or part of the site fails.This architecture is suitable for executing a particular application of a customer.It is totally independent, and can be deployed in one or several cloud providers.Another technique is added which monitoring the application performance periodically.
There are works on storage architecture for cloud reported to the literature.Examples include Magicube (Feng, Han, Gao, & Meng, 2012), a cloud storage architecture model, which has the ability to reduce the space cost of redundancy, improve performance and guarantee the reliability of the system.Unlike Amazon S3, Hadoop Distributed File System (HDFS), and Google GFS storage system that needs three replications of each file as a default, Magicube system is conserved just 1 replica of a file.
In (Egwutuoha, Chen, Levy, & Selic, 2012), researchers defined a new cloud computing capability of employing fault tolerance for computation intensive application which needs a lot of computations and acceleration to perform the task such as medical imaging, bioscience, or financial trading.A framework of a high performance computing system which depends on a proactive FT strategy to predict the fault, reactive FT uses checkpointing to reduce the cost of execution time, live migration and FT protocol for communication as well.
Fault Tolerance as a Service (Nandi, Paul, Banerjee, & Ghosh, 2013) is another service proposed as a formal model.Abbreviated as FTaaS, it presents a service by a cloud vendor to provide fault tolerance guarantees to customers' applications.FTaaS functions as an agent to the lowest level of service in cloud computing, which is IaaS.It is provided as a part of SLA with two fault tolerance strategies: spatial redundancy which depends on majority-voting and temporal redundancy, which is based on checkpointing mechanism.Additionally, in (Cully et al., 2008) researchers report their work on virtual machine replication, Remus.They propose this model to provide OSs high-availability as a service which are based on virtualization infrastructure platform.Virtualization guaranties the capability of creating a copy of a running machine as well as migrates running VMs to other hosts.Remus offers protection similar or better than commercial providers' cost throughout replicates snapshot of the entire running operating system approximately every 25ms to another location.

Conclusion and future work
In conclusion, cloud computing provides many features to small and medium business enterprises such as cost-effective of infrastructure resources, managing infrastructure, availability, and scalability.From a technical point of view, cloud services are commercial and their services are not designed to ensure the continuity of customers' application.In fact, they ensure the availability of their infrastructure and components which are offered to customers.As a result, fault tolerance property should be addressed so as to guarantee customer system continuity over the cloud.The material put in the background section and the discussion of general conceptual steps show a pre-define strategy in which how an enterprise can be appointed into a workload so as to specify getting availability and failover.Accordingly, a lifecycle model and milestone could be established.After this step, specifying the nature of availability plan and establishing the map of expected failures are needed.All these key points should be considered to design a highly available and resilient architecture, where the customer could achieve a better fault tolerance plan.Section on fault tolerance consideration discusses in detail the key lessons learned.
The future research path in the direction of fault tolerance will introduce possible evaluation and analyses the cost of implementing FT within Amazon Web Service AWS and Google Compute Engine on-demand infrastructure.

AJIT-e: Online Academic Journal of Information Technology 2016 Fall/Güz -Cilt/Vol: 7 -Sayı/Num: 25 DOI: 10.5824/1309-1581.2016.4.001.x
Li & Lan, 2011)estart (C/R) recoveryC/R is the typical technique to tolerate failure on unreliable systems.By saving a snapshot of running application on a stable storage periodically so as to restart the application from a latest checkpointing image in case of a crash (Y.Li & Lan, 2011).
Varia & Mathew, 2014)eatures apply within Amazon and Google IaaS, Amazon Elastic Load Balancing (ELB) could distribute concurrently arriving traffic from users to multi-EC2 instances.Better fault tolerance architecture in AWS can be achieved throughout ELB.Amazon ELB route traffic across different Availability Zones within specific region.Amazon Virtual Private Cloud (VPC) "Virtual Datacenter" supports launching AWS resources in a virtual network topology, which is constructed by customers with entire governing over this network (IP address ranges, subnets, route table configuration, ports, VLANs, and gateway).As a result, cross-region recovery of customer's application is provided.Amazon Route53 is the cloud base DNS server, which automatically answers users request to cloud EC2 instances, ELB, Amazon S3, or routes users to infrastructure outside AWSs with Geo DNS, strong fault tolerance architecture can be configured, the healthy-check services which globally monitor and manage traffic to the just healthy and reachable resources by returning the IP address of healthy resources (Amazon Web Services, 2014;Varia & Mathew, 2014).Advanced Networking Feature launched within Google IaaS.Per-region static IP address for an instance is specified by default as well as reserved static IP feature so when a zone fails, it is easy to route traffic to another zone.