Cloud Architectures: Technology Evangelist Amazon Web Services
Cloud Architectures: Technology Evangelist Amazon Web Services
June 2008 Jinesh Varia Technology Evangelist Amazon Web Services ([email protected])
Introduction
This paper illustrates the style of building applications using services available in the Internet cloud. Cloud Architectures are designs of software applications that use Internet-accessible on-demand services. Applications built on Cloud Architectures are such that the underlying computing infrastructure is used only when it is needed (for example to process a user request), draw the necessary resources on-demand (like compute servers or storage), perform a specific job, then relinquish the unneeded resources and often dispose themselves after the job is done. While in operation the application scales up or down elastically based on resource needs. This paper is divided into two sections. In the first section, we describe an example of an application that is currently in production using the on-demand infrastructure provided by Amazon Web Services. This application allows a developer to do pattern-matching across millions of web documents. The application brings up hundreds of virtual servers on-demand, runs a parallel computation on them using an open source distributed processing framework called Hadoop, then shuts down all the virtual servers releasing all its resources back to the cloudall with low programming effort and at a very reasonable cost for the caller. In the second section, we discuss some best practices for using each Amazon Web Service - Amazon S3, Amazon SQS, Amazon SimpleDB and Amazon EC2 - to build an industrial-strength scalable application.
Keywords
Amazon Web Services, Amazon S3, Amazon EC2, Amazon SimpleDB, Amazon SQS, Hadoop, MapReduce, Cloud Computing
4.
Usage-based costing: Utility-style pricing allows billing the customer only for the infrastructure that has been used. The customer is not liable for the entire infrastructure that may be in place. This is a subtle difference between desktop applications and web applications. A desktop application or a traditional client-server application runs on customers own infrastructure (PC or server), whereas in a Cloud Architectures application, the customer uses a third party infrastructure and gets billed only for the fraction of it that was used. Potential for shrinking the processing time: Parallelization is the one of the great ways to speed up processing. If one compute-intensive or dataintensive job that can be run in parallel takes 500 hours to process on one machine, with Cloud Architectures, it would be possible to spawn and launch 500 instances and process the same job in 1 hour. Having available an elastic infrastructure provides the application with the ability to exploit parallelization in a cost-effective manner reducing the total processing time.
5.
2.
3.
In this paper, we will discuss one application example in detail - code-named as GrepTheWeb.
zoom in to see different levels of the architecture of GrepTheWeb. Figure 1 shows a high-level depiction of the architecture. The output of the Million Search Results Service, which is a sorted list of links and gzipped (compressed using the Unix gzip utility) in a single file, is given to GrepTheWeb as input. It takes a regular expression as a second input. It then returns a filtered subset of document links sorted and gzipped into a single file. Since the overall process is asynchronous, developers can get the status of their jobs by calling GetStatus() to see whether the execution is completed. Performing a regular expression against millions of documents is not trivial. Different factors could combine to cause the processing to take lot of time: Regular expressions could be complex Dataset could be large, even hundreds of terabytes Unknown request patterns, e.g., any number of people can access the application at any given point in time
Hence, the design goals of GrepTheWeb included to scale in all dimensions (more powerful pattern-matching languages, more concurrent users of common datasets, larger datasets, better result qualities) while keeping the costs of processing down. The approach was to build an application that not only scales with demand, but also without a heavy upfront investment and without the cost of maintaining idle machines (downbottom). To get a response in a reasonable amount of time, it was important to distribute the job into multiple tasks and to perform a Distributed Grep operation that runs those tasks on multiple nodes in parallel.
RegEx GetStatus
GrepTheWeb Application
Amazon SQS
Manage phases
Controller
User info, Job status info Launch, Monitor, Shutdown
GetStatus
Amazon SimpleDB DB
Get Output
Input Output
Amazon S3
Zooming in further, GrepTheWeb architecture looks like as shown in Figure 2 (above). It uses the following AWS components: Amazon S3 for retrieving input datasets and for storing the output dataset Amazon SQS for durably buffering requests acting as a glue between controllers Amazon SimpleDB for storing intermediate status, log, and for user data about tasks Amazon EC2 for running a large distributed processing Hadoop cluster on-demand Hadoop for distributed processing, automatic parallelization, and job scheduling
Workflow
GrepTheWeb is modular. It does its processing in four phases as shown in figure 3. The launch phase is responsible for validating and initiating the processing of a GrepTheWeb request, instantiating Amazon EC2 instances, launching the Hadoop cluster on them and starting all the job processes. The monitor phase is responsible for monitoring the EC2 cluster, maps, reduces, and checking for success and failure. The shutdown phase is responsible for billing and shutting down all Hadoop processes and Amazon EC2 instances, while the cleanup phase deletes Amazon SimpleDB transient data.
Launch Phase
Monitor Phase
Shutdown Phase
Cleanup Phase
Detailed Workflow for Figure 4: 1. 2. 3. On application start, queues are created if not already created and all the controller threads are started. Each controller thread starts polling their respective queues for any messages. When a StartGrep user request is received, a launch message is enqueued in the launch queue. Launch phase: The launch controller thread picks up the launch message, and executes the launch task, updates the status and timestamps in the Amazon SimpleDB domain, enqueues a new message in the monitor queue and deletes the message from the launch queue after processing. a. The launch task starts Amazon EC2 instances using a JRE pre-installed AMI , deploys required Hadoop libraries
4.
5.
6. 7.
and starts a Hadoop Job (run Map/Reduce tasks). Hadoop runs map tasks on Amazon EC2 slave nodes in parallel. Each map task takes files (multithreaded in background) from Amazon S3, runs a regular expression (Queue Message Attribute) against the file from Amazon S3 and writes the match results along with a description of up to 5 matches locally and then the combine/reduce task combines and sorts the results and consolidates the output. c. The final results are stored on Amazon S3 in the output bucket Monitor phase: The monitor controller thread picks up this message, validates the status/error in Amazon SimpleDB and executes the monitor task, updates the status in the Amazon SimpleDB domain, enqueues a new message in the shutdown queue and billing queue and deletes the message from monitor queue after processing. a. The monitor task checks for the Hadoop status (JobTracker success/failure) in regular intervals, updates the SimpleDB items with status/error and Amazon S3 output file. Shutdown phase: The shutdown controller thread picks up this message from the shutdown queue, and executes the shutdown task, updates the status and timestamps in Amazon SimpleDB domain, deletes the message from the shutdown queue after processing. a. The shutdown task kills the Hadoop processes, terminates the EC2 instances after getting EC2 topology information from Amazon SimpleDB and disposes of the infrastructure. b. The billing task gets EC2 topology information, Simple DB Box Usage, Amazon S3 file and query input and calculates the billing and passes it to the billing service. Cleanup phase: Archives the SimpleDB data with user info. Users can execute GetStatus on the service endpoint to get the status of the overall system (all controllers and Hadoop) and download the filtered results from Amazon S3 after completion. b.
Amazon SQS
Billing Queue
Launch Queue
Monitor Queue
Shutdown Queue
Billing Service
Launch Controller
Shutdown Controller
Billing Controller
Controller
Status DB
Put File
Output
HDFS
Get File
Input
Amazon SimpleDB
Amazon S3
As it was difficult to know how much time each phase would take to execute (e.g., the launch phase decides dynamically how many instances need to start based on the request and hence execution time is unknown) Amazon SQS helped in building asynchronous systems. Now, if the launch phase takes more time to process or the monitor phase fails, the other components of the system are not affected and the overall system is more stable and highly available.
The launch controller makes an educated guess, based on reservation logic, of how many slaves are needed to perform a particular job. The reservation logic is based on the complexity of the query (number of predicates etc) and the size of the input dataset (number of documents to be searched). This was also kept configurable so that we can reduce the processing time by simply specifying the number of instances to launch. After launching the instances and starting the Hadoop cluster on those instances, Hadoop will appoint a master and slaves, handles the negotiating, handshaking and file distribution (SSH keys, certificates) and runs the grep job.
It typically works in three phases. A map phase transforms the input into an intermediate representation of key value pairs, a combine phase (handled by Hadoop itself) combines and sorts by the keys and a reduce phase recombines the intermediate representation into the final output. Developers implement two interfaces, Mapper and Reducer, while Hadoop takes care of all the distributed processing (automatic parallelization, job scheduling, job monitoring, and result aggregation). In Hadoop, theres a master process running on one node to oversee a pool of slave processes (also called workers) running on separate nodes. Hadoop splits the input into chunks. These chunks are assigned to slaves, each slave performs the map task (logic specified by user) on each pair found in the chunk and writes the results locally and informs the master of the completed status. Hadoop combines all the results and sorts the results by the keys. The master then assigns keys to the reducers. The reducer pulls the results using an iterator, runs the reduce task (logic specified by user), and sends the final output back to distributed file system.
Combine
StopJob1 Reduce
Hadoop Job
Combine
Reduce StopJob2
Hadoop Job
3.
4.
5.
to respond) for some reason, the other components in the system are built so as to continue to work as if no failure is happening. Implement parallelization for better use of the infrastructure and for performance. Distributing the tasks on multiple machines, multithreading your requests and effective aggregation of results obtained in parallel are some of the techniques that help exploit the infrastructure. After designing the basic functionality, ask the question What if this fails? Use techniques and approaches that will ensure resilience. If any component fails (and failures happen all the time), the system should automatically alert, failover, and re-sync back to the last known state as if nothing had failed. Dont forget the cost factor. The key to building a cost-effective application is using on-demand resources in your design. Its wasteful to pay for infrastructure that is sitting idle.
Reducer Implementation - Pass-through (Built-in Identity Function) and write the results back to S3.
2.
Controller A
Controller B
Controller C
Queue A
Queue B
Queue C
Controller A
Controller B
Controller C
Think Parallel
In this era of tera and multi-core processors, when programming we ought to think multi-threaded processes. In GrepTheWeb, wherever possible, the processes were made thread-safe through a share-nothing philosophy and were multi-threaded to improve performance. For example, objects are fetched from Amazon S3 by multiple concurrent threads as such access is faster than fetching objects sequentially one at the time. If multi-threading is not sufficient, think multi-node. Until now, parallel computing across large cluster of machines was not only expensive but also difficult to achieve. First, it was difficult to get the funding to acquire a large cluster of machines and then once acquired, it was difficult to manage and maintain them. Secondly, after it was acquired and managed, there were technical problems. It was difficult to run massively distributed tasks on the machines, store and access large datasets. Parallelization was not easy and job scheduling was
example, if the regular expression does not have many predicates, or if the input dataset has just 500 documents, it will only spawn 2 instances. However, if the input dataset is 10 million documents, it will spawn up to 100 instances.
Amazon SQS queue and their states from the Amazon SimpleDB domain item on reboot. If a task tracker (slave) node dies due to hardware failure, Hadoop reschedules the task on another node automatically. This fault-tolerance enables Hadoop to run on large commodity server clusters overcoming hardware failures.
Conclusion
Instead of building your applications on fixed and rigid infrastructures, Cloud Architectures provide a new way to build applications on on-demand infrastructures. GrepTheWeb demonstrates how such applications can be built. Without having any upfront investment, we were able to run a job massively distributed on multiple nodes in parallel and scale incrementally based on the demand (users, size of the input dataset). With no idle time, the application infrastructure was never underutilized. In the next section, we will learn how each of the Amazon Infrastructure Service (Amazon EC2, Amazon S3, Amazon SimpleDB and Amazon SQS) was used and we will share with you some of the lessons learned and some of the best practices.
Good cloud architectures should be impervious to reboots and re-launches. In GrepTheWeb, by using a combination of Amazon SQS and Amazon SimpleDB, the overall controller architecture is more resilient. For instance, if the instance on which controller thread was running dies, it can be brought up and resume the previous state as if nothing had happened. This was accomplished by creating a pre-configured Amazon Machine Image, which when launched dequeues all the messages from the
f474b439-ee32-4af0-8e0f-a62d1f7de897 Queued Your request has been queued. StartGrep A(.*)zon http://s3.amazonaws.com/com.alexa.msr.prod/msr_ f474b439-ee32-4af0-8e0fa979907de897.dat.gz?Signature=CvD9iHA%3D&Expire s=1204840434&AWSAccessKeyId=DDXCXCCDEEDSDFGSDDX f474b439-ee32-4af0-8e0f-a62d1f7de897 Completed Results are now available for download from DownloadUrl StartGrep 2008-03-05T12:33:05 http://s3.amazonaws.com/com.alexa.gtw.prod/gtw _f474b439-ee32-4af0-8e0fa62de897.dat.gz?Signature=CvD9iIGGjUIlkOlAeHA% 3D&Expires=1204840434&AWSAccessKeyId=DDXCXCCDE EDSDFGSDDX
DownloadUrl
Use Process-oriented Messaging and Documentoriented Messaging There are two messaging approaches that have worked effectively for us: process oriented and document oriented messaging. Process-oriented messaging is often defined by process or actions. The typical approach is to delete the old message from the from queue, and then to add a new message with new attributes to the new to queue. Document-oriented messaging happens when one message per user/job thread is passed through the entire system with different message attributes. This is often implemented using XML/JSON because it has an extensible model. In this solution, messages can evolve, except that the receiver only needs to understand those parts that are important to him. This way a single message can flow through the system and the different
components only need to understand the parts of the message that is important to them. For GrepTheWeb, we decided to use the process-oriented approach. Take Advantage Of Visibility Timeout Feature Amazon SQS has a special functionality that is not present in many other messaging systems; when a message is read from the queue it is visible to other readers of the queue yet it is not automatically deleted from the queue. The consumer needs to explicitly delete the message from the queue. If this hasn't happened within a certain period after the message was read, the consumer is considered to have failed and the message will re-appear in the queue to be consumed again. This is done by setting the so-called visibility timeout when creating the queue. In GrepTheWeb, the visibility timeout is very important because certain processes (such as the shutdown controller) might fail and not respond (e.g., instances would stay up). With the visibility timeout set to a certain number of minutes, another controller thread would pick up the old message and resume the task (of shutting down).
attributes of each item in the list. As you can guess, the execution time would be slow. To address this, it is highly recommended to multi-thread your GetAttributes calls and to run them in parallel. The overall performance increases dramatically (up to 50 times) when run in parallel. In the GrepTheWeb application to generate monthly activity reports, this approach helped create more dynamic reports. Use Amazon SimpleDB in Conjunction With Other Services Build frameworks, libraries and utilities that use functionality of two or more services together in one. For GrepTheWeb, we built a small framework that uses Amazon SQS and Amazon SimpleDB together to externalize appropriate state. For example, all controllers are inherited from the BaseController class. The BaseController classs main responsibility is to dequeue the message from the from queue, validate the statuses from a particular Amazon SimpleDB domain, execute the function, update the statuses with a new timestamp and status, and put a new message in the to queue. The advantage of such a setup is that in an event of hardware failure or when controller instance dies, a new node can be brought up almost immediately and resume the state of operation by getting the messages back from the Amazon SQS queue and their status from Amazon SimpleDB upon reboot and makes the overall system more resilient. Although not used in this design, a common practice is to store actual files as objects on Amazon S3 and to store all the metadata related to the object on Amazon SimpleDB. Also, using an Amazon S3 key to the object as item name in Amazon SimpleDB is a common practice.
Queue A
GetMessage()
Controller Thread
PutMessage()
Queue B
1. 2.
replaceableAttribute()
3. 4.
Execute Tasks
Status DB
Controller dequeues message from Queue A Controller executes Tasks (for eg. Launch, monitor etc) Controller Updates Statuses in status DB Controller enqueues new message in Queue B
the AWS credentials in the AMI. Instead of embedding the credentials, they should be passed in as arguments using the parameterized launch feature and encrypted before being sent over the wire. General steps are: 1. 2. 3. 4. Generate a new RSA keypair (use OpenSSL tools). Copy the private key onto the image, before you bundle it (so it will be embedded in the final AMI). Post the public key along with the image details, so users can use it. When a user launches the image they must first encrypt their AWS secret key (or private key if you wanted to use SOAP) with the public key you gave them in step 3. This encrypted data should be injected via user-data at launch (i.e. the parameterized launch feature). Your image can then decrypt this at boot time and use it to decrypt the data required to contact Amazon S3. Also be sure to delete this private key upon reboot before installing the SSH key (i.e. before users can log into the machine). If users won't have root access then you don't have to delete the private key, just make sure it's not readable by users other than root.
5.
Credits
Special Thanks to Kenji Matsuoka and Tinou Bao the core team that developed the GrepTheWeb Architecture.
Further Reading
Amazon SimpleDB White Papers Amazon SQS White paper Hadoop Wiki Hadoop Website Distributed Grep Examples Map Reduce Paper Blog: Taking Massive Distributed Computing to the Common man Hadoop on Amazon EC2/S3
Appendix 1: Amazon S3, Amazon SQS, Amazon SimpleDB When to Use Which?
The table will help explain which Amazon service to use when: Amazon S3 Storing Large write-once, read-many types of objects Media-like files, audio, video, large images Querying, content distribution Database, File Systems Amazon SQS Small short-lived transient messages Workflow jobs, XML/JSON/TXT messages Large objects, persistent objects Persistent data stores Amazon SimpleDB Querying light-weight attribute data Querying, Mapping, tagging, click-stream logs, metadata, state management Transactional systems OLTP, DW cube rollups
Ideal for Ideal examples Not recommended for Not recommended examples
Recommendations
Since the Amazon Web Services are primitive building block services, the most value is derived when they are used in conjunction with other services Use Amazon S3 and Amazon SimpleDB together whenever you want to query Amazon S3 objects using their metadata We recommend you store large files on Amazon S3 and the associated metadata and reference information on Amazon SimpleDB so that developers can query the metadata. Read-only metadata can also be stored on Amazon S3 as metadata on object (e.g. author, create date etc).
Amazon S3 entities Bucket Key/S3 URI Metadata describing S3 object Amazon SimpleDB entities Domain (private to subscriber) Item name Attributes of an item
Use SimpleDB and Amazon SQS together whenever you want an application to be in phases Store transient messages in Amazon SQS and statuses of job/messages in Amazon SimpleDB so that you can update statuses frequently and get the status of any request at any time by simply querying the item. This works especially well in asynchronous systems.
Use Amazon S3 and Amazon SQS together whenever you want to create processing pipelines or producerconsumer solutions Store raw files on Amazon S3 and insert a corresponding message in an Amazon SQS queue with reference and metadata (S3 URI etc)