NoSql Unit 3
NoSql Unit 3
SELVA KUMAR S
B.M.S COLLEGE OF ENGINEERING
▪ NOSQL in CLOUD
▪ Exploring ready-to-use NoSQL databases in the cloud
▪ Leveraging Google AppEngine and its scalable data store
▪ Using Amazon SimpleDB
2
▪ Google and Amazon, have achieved
▪ High availability
▪ Ability to concurrently service millions of users
▪ Scaling out horizontally among multiple machines
▪ Spread across multiple data centers.
▪ Success stories of large-scale web applications like those from Google and Amazon have
proven that
▪ Horizontally scaled environments
▪ NoSQL solutions
▪ Available on-demand
4
▪ The Google App Engine (GAE) provides a sandboxed
deployment environment for applications.
▪ It is written using:
▪ Python programming
▪ Java Virtual Machine (JVM)
▪ Google provides developers with a set of rich APIs and an
SDK to build applications for the app engine.
5
▪ Google App Engine (GAE) is a Platform as a Service (PaaS) cloud computing
platform for developing and hosting web applications in Google-managed data
centers.
▪ Google’s Platform to build web applications on Cloud.
▪ Easy to build.
▪ Easy to maintain.
▪ Easy to scale as the traffic and storage needs grow.
▪ Automatic scaling and load balancing.
▪ Transactional data store model.
▪ Free for up to 1 GB of storage and enough CPU and bandwidth to support 5 million
page views a month. 10 Applications per Google account.
6
▪ Lower total cost of ownership
▪ Rich set of APIs
▪ Fully featured SDK for local development
▪ Ease of Deployment
7
8
▪ Java:
• App Engine runs JAVA apps on a JAVA 7 virtual machine
(currently
▪ supports JAVA 6 as well).
• Uses JAVA Servlet standard for web applications:
• WAR (Web Applications ARchive) directory structure.
• Servlet classes
• Java Server Pages (JSP)
• Static and data files
• Deployment descriptor (web.xml)
• Other configuration files
9
▪ Python:
• Uses WSGI (Web Server Gateway Interface) standard.
• Python applications can be written using:
• Webapp2 framework
• Django framework
• Any python code that uses the CGI (Common Gateway Interface)
standard.
10
▪ PHP (Experimental support):
• Local development servers are available to anyone for developing
and testing local applications.
▪ Google’s Go:
• Go is an Google’s open source programming environment.
• Tightly coupled with Google App Engine.
• Applications can be written using App Engine’s Go SDK.
11
▪ App Engine Datastore:
• NoSQL schema-less object based data storage, with a query engine and
▪ atomic transactions.
• Data object is called a “Entity” that has a kind (~ table name) and a set of
▪ properties (~ column names).
• JAVA JDO/ JPA interfaces and Python datastore interfaces.
12
▪ Google cloud store:
• RESTful service for storing and querying data.
• Fast, scalable and highly available solution.
• Provides Multiple layers of redundancy. All data is replicated to multiple
▪ data centers.
• Provides different levels of access control.
• HTTP based APIs.
13
14
15
▪ Use App Engine when:
16
17
18
19
20
21
22
23
▪ The app engine provides a SQL-like query language called GQL.
▪ GQL queries on entities and their properties.
▪ Entities manifest as objects in the GAE Python and the Java SDK.
▪ GQL is quite similar to object-oriented query languages that are
used:
▪ query, filter, and get model instances and their properties.
24
from google.appengine.ext import db
class Person(db.Model):
name = db.StringProperty()
age = db.IntegerProperty()
26
27
28
29
▪ address_k = db.Key.from_path('Employee', 'asalieri', 'Address', 1)
▪ address = db.get(address_k)
30
▪ To update an existing entity:
▪ Modify the attributes of the object
▪ Call its put() method.
▪ The object data overwrites the existing entity.
▪ The entire object is sent to Datastore with every call to
put().
31
▪ employee_k = db.Key.from_path('Employee', 'asalieri')
▪ employee = db.get(employee_k)
▪ # ...
▪ employee.delete()
32
33
34
▪ Amazon SimpleDB is a ready-to-run database alternative to the app engine
data store.
▪ Amazon SimpleDB is a web service for running queries on structured data in
real time.
▪ Amazon SimpleDB requires no schema, automatically indexes your data and
provides a simple API for storage and access.
▪ This eliminates the administrative burden of data modeling, index
maintenance, and performance tuning.
▪ This service works in close conjunction with Amazon Simple Storage Service
(Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2), collectively
providing the ability to store, process and query data sets in the cloud.
35
▪Domain
▪ Attributes
▪ Item
36
▪ A domain is like a table.
▪ An attribute is analogous to a field or column.
▪ An item is similar to a database row.
▪ We can change the structure of a domain easily, since it
has no schema.
▪ In addition, attributes are of string type and can contain
multiple values.
37
▪ SimpleDB can be queried in one of the following ways:
▪ Making RESTful get and post requests over HTTP or
HTTPS.
▪ Making SQL like query using a programming language.
38
▪ This shows a REST request that puts
three attributes and values for an item
named Item123 into the domain
named MyDomain.
39
40
▪ Simple Queries:
▪ These are the usual queries we perform like in any database:
▪ Examples: select * from mydomain where Title = 'The Right Stuff’
select * from mydomain where Year > '1985’
▪ Range Queries:
▪ Amazon SimpleDB enables us to execute more than one comparison against
attribute values within the same predicate.
▪ This is most commonly used to specify a range of values.
▪ select * from mydomain where Year between '1975' and '2008’
▪ select * from mydomain where (Year > '1950' and Year < '1960') or Year like '193%'
or Year = '2007'
41
▪ Amazon SimpleDB allows you to associate multiple values with a
single attribute.
▪ Each attribute is considered individually against the comparison
conditions defined in the predicate.
▪ Example: select * from mydomain where Keyword = 'Book' and
Keyword = 'Hardcover’
▪ Retrieve all items that have the Keyword attribute as both "Book"
and "Hardcover."
▪ Each value is evaluated individually against the predicate
expression.
42
▪ Multiple attribute queries work by producing a set of item names
from each predicate and applying the intersection operator.
▪ The intersection operator only returns item names that appear in
both result sets.
▪ select * from mydomain where Keyword = 'Book' intersection
Keyword = 'Hardcover’
▪ The first predicate produces 100, 200, and 50. The second produces
50.
▪ The result returns 50 counts. The intersection operator returns
results that appear in both queries.
43
▪ Amazon does the query optimization on its own and lets
the users to just store the data and query it.
▪ The 10gb domain limit was created with optimization in
mind.
▪ The user can optimize it themselves by splitting data to
multiple domains.
▪ In order to improve the performance, we can partition our
dataset among multiple domains to parallelize queries
and have them operate on smaller individual datasets.
44
▪ Applications to parallelize queries:
▪ Natural Partitions— The data set naturally partitions along some
dimension. For example, a University catalog might be partitioned
in the "Grad", "UnderGrad" and "Staff" domains. Although we can
store all the product data in a single domain, partitioning can
improve overall performance.
▪ High Performance Application— This can be useful when the
application requires higher throughput than a single domain can
provide.
▪ Large Data Set—This can be useful when timeout limits are reached
because of the data size or query complexity.
45
▪ If we need aggregation, SimpleDB is not the right solution.
▪ It is built around the school of thought that the DB is just a key value
store, and aggregation should be handled by an aggregation
process that writes the results back to the key value store.
▪ The count() function is recently introduced to the set of functions.
▪ Since only 2500 data records will be displayed per query we should
make sure that the count function does not exceed this range.
▪ We cannot perform joins in SimpleDB as we can execute a query
against a single domain only and this is one of the limitations
present in it.
46
▪ Amazon does not provide enough information about how indexes
are created or managed on SimpleDB, except for the fact that they
are automatically created and managed.
▪ SimpleDB users do not have any control over it.
▪ Following are some of the salient features of indexes:
▪ Domain keys are indexed.
▪ Data are indexed when we enter or modify them in the database.
▪ SimpleDB takes all data as input and indexes all the attributes.
47
▪ Asynchronous replication is supported.
▪ Amazon SimpleDB creates and manages multiple
geographically distributed replicas of the data
automatically.
▪ Every time we store a data item, multiple replicas are
created in different data centers within the region we
select.
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
▪ HBase provides a TableInputFormat, to which you provided a table scan, that splits
the rows resulting from the table scan into the regions in which those rows reside.
▪ The map process is passed an ImmutableBytesWritable that contains the row key
for a row and a Result that contains the columns for that row.
▪ The map process outputs its key/value pair based on its business logic in whatever
form makes sense to your application.
▪ The reduce process builds its results but emits the row key as an
ImmutableBytesWritable and a Put command to store the results back to HBase.
▪ Finally, the results are stored in HBase by the HBase MapReduce infrastructure.
64
65
▪ Input format
▪ First it splits the input data, and then it returns a RecordReader instance that
defines the classes of the key and value objects, and provides a next() method
that is used to iterate over each input record.
66
▪ Mapper
▪ In this step, each record read using the RecordReader is processed using the
map() method.
67
▪ Reducer
▪ The Reducer stage and class hierarchy is very similar to the Mapper stage. This
time we get the output of a Mapper class and process it after the data has been
shuffled and sorted.
68
▪ OutputFormat
▪ The final stage is the OutputFormat class, and its job is to persist the data in
various locations. There are specific implementations that allow output to files, or
to HBase tables in the case of the TableOutputFormat class. It uses a TableRecord
Writer to write the data into the specific HBase output table.
69
▪ Apache Mahout is a project of the Apache Software Foundation which is
implemented on top of Apache Hadoop and uses the MapReduce paradigm.
▪ It is also used to create implementations of scalable and distributed
machine learning algorithms that are focused in the areas of
▪ Clustering,
▪ Collaborative filtering and
▪ Classification.
▪ Mahout contains Java libraries for common math algorithms and operations
focused on statistics and linear algebra, as well as primitive Java
collections.
70
▪ To build a recommender engine mahout provides the following components:
• DataModel
• UserSimilarity
• ItemSimilarity
• UserNeighborhood
• Recommender
71
72
▪ DataModel datamodel = new FileDataModel(new File("input file"));
73
74
▪ What is HIVE?
75
76
77
78
▪ Create database mydb;
▪ Show databases;
▪ Use mydb;
79
▪ Create table customer(custId INT, custName String, mobile INT)
row format delimited
fields terminated by ‘,’;
▪ Load data local inpath ‘c:/temp/cust.txt’ into table customer;
▪ Select * from customer;
▪ Select count(*) from customer;
80
▪ Create table out(custId INT, custName String, amount INT, product String)
row format delimited
fields terminated by ‘,’;
▪ Insert overwrite table out
select a.custId, a.custName, b.amount, b.product
from customer a JOIN products b ON a.custId = b.custId;
▪ Select * from out limit 5;
81
▪ Insert overwrite table out1
select *, case
when age<30 then ‘young’
when age>=30 and age<50 ‘middle’
when age>=50 ‘old’
else ‘others’
end
from out;
▪ Insert overwrite table out2
select level, sum(amount) from out1 group by level;
82
▪ hive> SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title, users.gender
▪ > FROM ratings JOIN movies ON (ratings.movieid = movies.movieid)
▪ > JOIN users ON (ratings.userid = users.userid)
▪ > LIMIT 5;
83
▪ An explain plan in Hive reveals the MapReduce behind a query.
▪ hive> EXPLAIN SELECT COUNT(*) FROM ratings
▪ > WHERE movieid = 1 and rating = 5;
▪ OK
▪ ABSTRACT SYNTAX TREE:
▪ (TOK_QUERY (TOK_FROM (TOK_TABREF ratings))
▪ (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
▪ (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTIONSTAR COUNT)))
▪ (TOK_WHERE (and (= (TOK_TABLE_OR_COL movieid) 1)
▪ (= (TOK_TABLE_OR_COL rating) 5)))))
▪ STAGE DEPENDENCIES:
▪ Stage-1 is a root stage
84
85