0% found this document useful (0 votes)
16 views93 pages

Unit 5

Uploaded by

niharikagg2702
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views93 pages

Unit 5

Uploaded by

niharikagg2702
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Data Science

CS3EL13
Medi-Caps University
UNIT-V
Data Science and Different Tools

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Python Libraries for Data Science
Many popular Python toolboxes/libraries:
• NumPy
• SciPy
• Pandas
• SciKit-Learn
Visualization libraries
• matplotlib
• Seaborn

and many more …

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Python Libraries

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Python Libraries for Data Science

NumPy:
▪ Introduces objects for multidimensional arrays and matrices, as well as functions that
allow to easily perform advanced mathematical and statistical operations on those
objects

▪ Provides vectorization of mathematical operations on arrays and matrices which


significantly improves the performance

▪ Many other python libraries are built on NumPy


Link: http://www.numpy.org/

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Python Libraries for Data Science

SciPy:
▪ Collection of algorithms for linear algebra, differential equations, numerical
integration, optimization, statistics and more

▪ Part of SciPy Stack

▪ Built on NumPy

Link: https://www.scipy.org/scipylib/

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Python Libraries for Data Science
Pandas:
▪ Adds data structures and tools designed to work with table-like data (similar to
Series and Data Frames in R)

▪ Provides tools for data manipulation: reshaping, merging, sorting, slicing,


aggregation etc.

▪ Allows handling missing data


Link: http://pandas.pydata.org/
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Python Libraries for Data Science

SciKit-Learn:
▪ Provides machine learning algorithms: classification, regression, clustering, model
validation etc.

▪ Built on NumPy, SciPy and matplotlib

Link: http://scikit-learn.org/

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Python Libraries for Data Science
matplotlib:
▪ A python 2D plotting library which produces publication quality figures in a variety of
hardcopy formats

▪ A set of functionalities similar to those of MATLAB

▪ Line plots, scatter plots, barcharts, histograms, pie charts etc.

▪ Relatively low-level; some effort needed to create advanced visualization


Link: https://matplotlib.org/

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Python Libraries for Data Science

Seaborn:
▪ based on matplotlib

▪ provides high level interface for drawing attractive statistical graphics

▪ Similar (in style) to the popular ggplot2 library in R

Link: https://seaborn.pydata.org/

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Start Jupyter nootebook
jupyter notebook

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Loading Python Libraries

In []: #Import Python Libraries


import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns

Press Shift+Enter to execute the jupyter cell

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Reading data using pandas

In [ ]: #Read csv file


df = pd.read_csv(“F:/Salaries.csv")

Note: The above command has many optional arguments to fine-tune the data import process.

There is a number of pandas commands to read other data formats:

pd.read_excel('myfile.xlsx',sheet_name='Sheet1', index_col=None, na_values=['NA'])

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Reading data using pandas
There is a number of pandas commands to read other data formats:

pd.read_stata('myfile.dta’)

Stata stores data in a special format that cannot be read by other programs. Stata data files have
extension .dta. Stata can read data in several other formats. A standard format is a comma-separated
values file with extension . csv

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Reading data using pandas
There is a number of pandas commands to read other data formats:

pd.read_sas('myfile.sas7bdat’)

The SAS file is an ASCII (text) file that contains a series of SAS functions that may be run against a
data set, or a SAS file may contain the actual data set.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Reading data using pandas
There is a number of pandas commands to read other data formats:

pd.read_hdf('myfile.h5','df’)

HDF files are Hierarchical Data Format Files by which they are the standardized file format for
scientific data storage. These files are categorized as data files used mainly in non-destructive testing,
aerospace applications, environmental science and neutron scattering.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Exploring data frames

In [3]: #List first 5 records


df.head()

Out[3]:

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Try at Home

✓ Try to read the first 10, 20, 50 records;

✓ Can you guess how to view the last few records;

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Data Frame data types
Pandas Type Native Python Type Description
object string The most general dtype. Will be assigned to
your column if column has mixed types
(numbers and strings).
int64 int Numeric characters. 64 refers to the memory
allocated to hold this character.
float64 float Numeric characters with decimals. If a
column contains numbers and NaNs(see
below), pandas will default to float64, in case
your missing value has a decimal.
datetime64, timedelta[ns] N/A (but see the datetime module in Values meant to hold time data. Look into
Python’s standard library) these for time series experiments.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Data Frame data types

In [4]: #Check a particular column type


df['salary'].dtype
Out[4]: dtype('int64')
In [5]: #Check types for all the columns
df.dtypes
Out[4]: rank object
discipline object
phd int64
service int64
sex object
salary int64
dtype: object

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Data Frames and Attributes
Python objects have attributes and methods.

df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labels and column names
ndim number of dimensions
size number of elements
shape return a tuple representing the dimensionality
values numpy representation of the data

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Try at Home
✓ Find how many records this data frame has;

✓ How many elements are there?

✓ What are the column names?

✓ What types of columns we have in this data frame?

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Data Frames and Methods
Unlike attributes, python methods have parenthesis.
All attributes and methods can be listed with a dir() function: dir(df)
df.method() description
head( [n] ), tail( [n] ) first/last n rows
describe() generate descriptive statistics (for numeric columns only)
max(), min() return max/min values for all numeric columns, If the values are strings, an
alphabetically comparison is done
mean(), median() return mean/median values for all numeric columns
std() standard deviation
sample([n]) returns a random sample of the data frame
dropna() drop all the records with missing values
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Try at Home

✓ Give the summary for the numeric columns in the dataset

✓ Calculate standard deviation for all numeric columns;

✓ What are the mean values of the first 50 records in the dataset?

Hint: use head() method to subset the first 50 records and then calculate the mean

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Selecting a column in a Data Frame

Method 1: Subset the data frame using column name:


df['sex’]
Method 2: As data frame
df[[‘sex’]]

Method 3: Use the column name as an attribute:


df.sex

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Try at Home

✓ Calculate the basic statistics for the salary column;

✓ Find how many values in the salary column (use count method);

✓ Calculate the average salary;

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Data Frames groupby method
Using "group by" method we can:
• Split the data into groups based on some criteria
• Calculate statistics (or apply a function) to each group
• Similar to dplyr() function in R

In [ ]: #Group data using rank


df_rank = df.groupby(['rank'])

In [ ]: #Calculate mean value for each numeric column per each group
df_rank.mean()

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Data Frames groupby method
Once groupby object is create we can calculate various statistics for each group:

In [ ]: #Calculate mean salary for each professor rank:


df.groupby('rank')[['salary']].mean()

Note: If single brackets are used to specify the column (e.g. salary), then the output is Pandas Series object.
When double brackets are used the output is a Data Frame

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Pandas Series object

• Series is a one-dimensional labelled array capable of holding data of any


type (integer, string, float, python objects, etc.).
• The axis labels are collectively called index.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Data Frames groupby method

groupby performance notes:


- no grouping/splitting occurs until it's needed. Creating the groupby object only
verifies that you have passed a valid mapping
- by default the group keys are sorted during the groupby operation.
- You may pass sort=False for potential speedup:

In [ ]: #Calculate mean salary for each professor rank:


df.groupby(['rank'], sort=False)[['salary']].mean()

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Data Frame: filtering
To subset the data we can apply Boolean indexing. This indexing is commonly
known as a filter. For example if we want to subset the rows in which the salary
value is greater than $120K:

In [ ]: #Calculate mean salary for each professor rank:


df_sub = df[ df['salary'] > 120000 ]

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Data Frame: filtering

Any Boolean operator can be used to subset the data:


> greater; >= greater or equal;
< less; <= less or equal;
== equal; != not equal;

In [ ]: #Select only those rows that contain female professors:


df_f = df[ df['sex'] == 'Female' ]

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Data Frames: Slicing

There are a number of ways to subset the Data Frame:


• one or more columns
• one or more rows
• a subset of rows and columns

Rows and columns can be selected by their position or label

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Data Frames: Slicing
When selecting one column, it is possible to use single set of brackets, but the
resulting object will be a Series (not a DataFrame):
In [ ]: #Select column salary:
df['salary']

When we need to select more than one column and/or make the output to be a
DataFrame, we should use double brackets:
In [ ]: #Select column salary:
df[['rank','salary']]
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Data Frames: Selecting rows
If we need to select a range of rows, we can specify the range using ":"

In [ ]: #Select rows by their position:


df[10:20]

Notice that the first row has a position 0, and the last value in the range is omitted:
So for 0:10 range the first 10 rows are returned with the positions starting with 0
and ending with 9

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Data Frames: method loc
If we need to select a range of rows, using their labels we can use method loc:
In [ ]: #Select rows by their labels:
df.loc[10:20,['rank','sex','salary']]

Out[ ]:

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Data Frames: method iloc
If we need to select a range of rows and/or columns, using their positions we can
use method iloc:
In [ ]: #Select rows by their labels: Out[ ]:
df.iloc[10:20,[0, 3, 4, 5]]

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Data Frames: method iloc (summary)
df.iloc[0] # First row of a data frame
df.iloc[i] #(i+1)th row
df.iloc[-1] # Last row

df.iloc[:, 0] # First column


df.iloc[:, -1] # Last column

df.iloc[0:7] #First 7 rows


df.iloc[:, 0:2] #First 2 columns
df.iloc[1:3, 0:2] #Second through third rows and first 2 columns
df.iloc[[0,5], [1,3]] #1st and 6th rows and 2nd and 4th columns

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Data Frames: Sorting
We can sort the data by a value in the column. By default the sorting will occur in
ascending order and a new data frame is return.

In [ ]: # Create a new data frame from the original sorted by the column Salary
df_sorted = df.sort_values( by ='service')
df_sorted.head()
Out[ ]:

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Data Frames: Sorting
We can sort the data using 2 or more columns:

In [ ]: df_sorted = df.sort_values( by =['service', 'salary'], ascending = [True, False])


df_sorted.head(10)

Out[ ]:

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Missing Values
Missing values are marked as NaN
In [ ]: # Read a dataset with missing values
flights = pd.read_csv(“f:/flights.csv")

In [ ]: # Select the rows that have at least one missing value


flights[flights.isnull().any(axis=1)].head()

Out[ ]:

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Missing Values
There are a number of methods to deal with missing values in the data frame:
df.method() description
dropna() Drop missing observations
dropna(how='all') Drop observations where all cells are NA
dropna(axis=1, how='all') Drop column if all the values are missing
dropna(thresh = 5) Drop rows that contain less than 5 non-missing values. That
means at least 5 non NaN to survive
fillna(0) Replace missing values with zeros
isnull() returns True if the value is missing
notnull() Returns True for non-missing values
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Missing Values

• When summing the data, missing values will be treated as zero


• If all values are missing, the sum will be equal to NaN
• cumsum() and cumprod() methods ignore missing values but preserve them in
the resulting arrays
• Missing values in GroupBy method are excluded (just like in R)
• Many descriptive statistics methods have skipna option to control if missing data
should be excluded . This value is set to True by default (unlike R)

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Aggregation Functions in Pandas

Aggregation - computing a summary statistic about each group, i.e.


• compute group sums or means
• compute group sizes/counts
Common aggregation functions:
min, max
count, sum, prod
mean, median, mode, mad
std, var

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Aggregation Functions in Pandas

agg() method are useful when multiple statistics are computed per column:
In [ ]: flights[['dep_delay','arr_delay']].agg(['min','mean','max'])

Out[ ]:

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Basic Descriptive Statistics

df.method() description
describe Basic statistics (count, mean, std, min, quantiles, max)
min, max Minimum and maximum values
mean, median, mode Arithmetic average, median and mode
var, std Variance and standard deviation
sem Standard error of mean
skew Sample skewness
kurt kurtosis

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Graphics to explore the data

Seaborn package is built on matplotlib but provides high level interface for
drawing attractive statistical graphics, similar to ggplot2 library in R. It
specifically targets statistical data visualization

To show graphs within Python notebook include inline directive:

In [ ]: %matplotlib inline

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Graphics
description
distplot histogram
barplot estimate of central tendency for a numeric variable
violinplot similar to boxplot, also shows the probability density of the data
jointplot Scatterplot
regplot Regression plot
pairplot Pairplot
boxplot boxplot
swarmplot categorical scatterplot
factorplot General categorical plot

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Basic statistical Analysis
statsmodel has a number of function for statistical analysis

statsmodels mostly used for regular analysis using R style formulas

statsmodels:
• linear regressions
• ANOVA tests
• hypothesis testings
• many more ...

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Basic statistical Analysis
scikit-learn – has a number of functions for statistical analysis

scikit-learn is more tailored for Machine Learning.

scikit-learn:
• kmeans
• support vector machines
• random forests
• many more ...

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


NoSQL
RDBMS Characteristics
• Values are atomic.
• All of the values in a column have the same data type.
• Each row is unique.
• The sequence of columns is insignificant.
• The sequence of rows is insignificant.
• Each column has a unique name.
• Integrity constraints maintain data consistency across multiple tables.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Transactions – ACID Properties
• Atomic – The entire transaction take place at once (commit) or doesn’t happen
at all
• a transaction to transfer funds from one account to another involves making a withdrawal
operation from the first account and a deposit operation on the second. If the deposit
operation failed, you don’t want the withdrawal operation to happen either.

• Consistent – A database must be consistent before and after the transaction


• a database tracking a checking account may only allow unique check numbers to exist for
each transaction

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Transactions – ACID Properties
• Isolated – The results of any changes made during a transaction are not visible
until the transaction has committed.
• a teller looking up a balance must be isolated from a concurrent transaction involving a
withdrawal from the same account. Only when the withdrawal transaction commits
successfully and the teller looks at the balance again will the new balance be reported.

• Durable – The results of a committed transaction survive failures


• A system crash or any other failure must not be allowed to lose the results of a
transaction or the contents of the database. Durability is often achieved through separate
transaction logs that can "re-create" all transactions from some picked point in time (like
a backup).

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Why RDBMS not suitable for Big Data
• The context is Internet
• RDBMS assumes that data are
• Dense
• Largely uniform (structured data)
• Data coming from Internet are
• Massive and sparse
• Semi-structured or unstructured
• With massive sparse data sets, the typical storage mechanisms and access methods get stretched

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Dealing with Big Data and Scalability

• Issues with scaling up when the dataset is just too big


• RDBMS were not designed to be distributed
• Traditional DBMSs are best designed to run well on a “single” machine
• Larger volumes of data/operations requires to upgrade the server with
faster CPUs or computing power known as ‘scaling up’ or ‘Vertical
scaling’

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Dealing with Big Data and Scalability

• NoSQL solutions are designed to run on clusters or multi-node database solutions


• Larger volumes of data/operations requires to add more machines to the cluster,
Known as ‘scaling out’ or ‘horizontal scaling’
• Different approaches include:
• Replication
• Sharding (partitioning)

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Dealing with Big Data and Scalability
• Replication
• Replication copies data across multiple servers.
• Each bit of data can be found in multiple places.
• Replication comes in two forms
• Master-Slave
• Peer-to-peer

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Scaling RDBMS
• Master-Slave
• Master-slave replication makes one node the authoritative copy that handles writes while
slaves synchronize with the master and may handle reads.
• All writes are written to the master.
• All reads performed against the replicated slave databases
• Master-slave replication reduces the chance of update conflicts
• Large data sets can pose problems as master needs to duplicate data to slaves

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Scaling RDBMS

• Peer-to-Peer

• Peer-to-peer replication allows writes to any node; the nodes coordinate to


synchronize their copies of the data.

• peer-to-peer replication avoids loading all writes onto a single server creating a
single point of failure

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Scaling RDBMS

• Sharding
• Sharding is the process of breaking up large tables into smaller chunks called shards
and that are spread across multiple servers.
• Sharding distributes different data across multiple servers, so each server acts as the
single source for a subset of data
• Any DB distributed across multiple machines needs to know, in which machine the
data is stored or must be stored
• A sharding system makes this decision for each row, using its key

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


NoSQL, No ACID

• NoSQL
• Does not give importance to ACID properties
• In some cases completely ignores them
• In distributed parallel systems it is difficult/impossible to ensure ACID properties
• Long-running transactions don't work because keeping resources blocked for a long time is
not practical

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


BASE Transactions
• Basically Available
• Rather than enforcing immediate consistency, BASE-modelled NoSQL databases will ensure availability of data
by spreading and replicating it across the nodes of the database cluster

• Soft state

• Due to the lack of immediate consistency, data values may change over time. The BASE model breaks off with
the concept of a database which enforces its own consistency, delegating that responsibility to developers.

• Eventually Consistent

• The fact that BASE does not enforce immediate consistency does not mean that it never achieves it. However,
until it does, data reads are still possible (even though they might not reflect the reality).

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


BASE
• Characteristics
• Weak consistency – stale data OK
• Availability first
• Best effort
• Approximate answers OK
• Aggressive (optimistic)
• Simpler and faster

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


ACID Vs BASE

• The fundamental difference between ACID and BASE database models is the way they
deal with this limitation
• The ACID model provides a consistent system
• The BASE model provides high availability.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


NoSQL Properties
• Higher scalability.
• Distributed computing.
• Cost effective.
• Support flexible schema.
• Process both unstructured and semi-structured data.
• No complex relationships, such as the ones between tables in RDBMS.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


NoSQL Distinguishing Characteristics

• Large data volumes • Asynchronous Inserts & Updates


• Google’s “big data” • Schema-less
• Scalable replication and distribution • ACID transaction properties are not needed
• Potentially thousands of machines – BASE
• Potentially distributed around the • CAP Theorem
world • Open source development
• Queries need to return answers quickly
• Mostly query, few updates

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


No SQL?
• NoSQL stands for:
• No Relational
• No RDBMS
• Not Only SQL
• NoSQL is an umbrella term for all databases and data stores that don’t follow the
RDBMS principles
• A class of products
• A collection of several (related) concepts about data storage and manipulation
• Often related to large data sets

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


How did we get here?
• Explosion of social media sites (Facebook, Twitter) with large data needs
• Rise of cloud-based solutions such as Amazon S3 (simple storage solution)
• Just as moving to dynamically-typed languages (Python, Ruby, Groovy), a shift
to dynamically-typed data with frequent schema changes
• Open-source community

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Where does NoSQL come from?
• Non-relational DBMSs are not new
• But NoSQL represents a different Approach
• Due to massively scalable Internet applications
• Based on distributed and parallel computing
• Development
• Starts with Google
• First research paper published in 2003
• Continues, and thanks to Lucene's developers/Apache (Hadoop) and Amazon (Dynamo)
• Then a lot of products and interests came from Facebook, Netflix, Yahoo, eBay, Hulu, IBM, and many more

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


NoSQL and Big Data
• NoSQL comes from Internet, thus it is often related to the “big data” concept
• How much big are “big data”?
• Over few terabytes Enough to start spanning multiple storage units
• Challenges
• Efficiently storing and accessing large amount of data is difficult, even more considering
fault tolerance and backups
• Manipulating large data sets involves running immensely parallel processes
• Managing continuously evolving schema and metadata for semi-structured and un-
structured data is difficult

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Dynamo and BigTable

• Three major papers were the seeds of the NoSQL movement


• BigTable (Google)
• Bigtable is a compressed, high performance, proprietary data storage system
built on Google File System
• Dynamo (Amazon)
• Distributed key-value data store
• Eventual consistency
• CAP Theorem

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


NoSQL Database Types

NoSQL databases is complicated because there are a variety of types:


•Sorted ordered Column Store
•Optimized for queries over large datasets, and store columns of data together, instead of rows
•Document databases:
•Pair each key with a complex data structure known as a document.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


NoSQL Database Types
•Key-Value Store :
•Are the simplest NoSQL databases. Every single item in the database is stored as an
attribute name (or 'key'), together with its value.
•Graph Databases :
•Are used to store information about networks of data, such as social connections.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Sorted Ordered Column-Oriented Stores
• Data are stored in a column-oriented way
• Data efficiently stored
• Avoids consuming space for storing nulls
• Columns are grouped in column-families
• Data isn’t stored as a single table but is stored by column families
• Unit of data is a set of key/value pairs
• Identified by “row-key”
• Ordered and sorted based on row-key

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Sorted Ordered Column-Oriented Stores

• Notable for:
• Google's Bigtable (used in all Google's services)

• HBase (Facebook, StumbleUpon, Hulu, Yahoo!, ...)

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Document Databases (Document Store)

• Documents
• Loosely structured sets of key/value pairs in documents, e.g., XML, JSON, BSON
• Encapsulate and encode data in some standard formats or encodings
• Are addressed in the database via a unique key
• Documents are treated as a whole, avoiding splitting a document into its constituent
name/value pairs

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Document Databases (Document Store)
• Allow documents retrieving by keys or contents
• Notable for:
• MongoDB (used in FourSquare, Github, and more)
• CouchDB (used in Apple, BBC, Canonical, Cern, and more)

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Document Databases (Document Store)
• The central concept is the notion of a "document“ which corresponds to a row in
RDBMS.
• A document comes in some standard formats like JSON (BSON).
• Documents are addressed in the database via a unique key that represents that
document.
• The database offers an API or query language that retrieves documents based on their
contents.
• Documents are schema free, i.e., different documents can have structures and schema
that differ from one another. (An RDBMS requires that each row contain the same
columns.)
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Document Databases, JSON
{
_id: ObjectId("51156a1e056d6f966f268f81"),
type: "Article",
author: "Derick Rethans",
title: "Introduction to Document Databases with MongoDB",
date: ISODate("2013-04-24T16:26:31.911Z"),
body: "This arti…"
},
{
_id: ObjectId("51156a1e056d6f966f268f82"),
type: "Book",
author: "Derick Rethans",
title: "php|architect's Guide to Date and Time Programming with PHP",
isbn: "978-0-9738621-5-7"
}

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Key/Value stores
• Store data in a schema-less way
• Store data as maps
• HashMaps or associative arrays
• Provide a very efficient average running
time algorithm for accessing data
• Voldemort (LinkedIn, eBay, …)
• Riak (Github, Comcast, Mochi, ...)

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Key/Value stores

• Notable for:
• Couchbase (Zynga, Vimeo, NAVTEQ, ...)
• Redis (Craiglist, Instagram, StackOverfow,
flickr, ...)
• Amazon Dynamo (Amazon, Elsevier,
IMDb, ...)
• Apache Cassandra (Facebook, Digg,
Reddit, Twitter,...)

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Graph Databases

• Graph-oriented
• Everything is stored as an edge, a node or an attribute.
• Each node and edge can have any number of attributes.
• Both the nodes and edges can be labelled.
• Labels can be used to narrow searches.

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


CAP Theorem

A congruent and logical way for assessing the problems involved in assuring
ACID-like guarantees in distributed systems is provided by the CAP theorem

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


CAP Theorem

At most two of the following three can be maximized at one time


• Consistency
• Each client has the same view of the data
• Availability
• Each client can always read and write
• Partition tolerance
• System works well across distributed physical networks

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


CAP Theorem: Two out of Three
• CAP theorem – At most two properties on three can be addressed
• The choices could be as follows:
Availability is compromised but consistency and partition tolerance are
preferred over it
The system has little or no partition tolerance. Consistency and
availability are preferred
Consistency is compromised but systems are always available and can
work when parts of it are partitioned

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Consistency or Availability

• Consistency and Availability is not “binary” decision

• AP systems relax consistency in favor of availability – but are not inconsistent

• CP systems sacrifice availability for consistency- but are not unavailable

• This suggests both AP and CP systems can offer a degree of consistency,


C A
and availability, as well as partition tolerance
P
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Performance

• There is no perfect NoSQL database


• Every database has its advantages and disadvantages
• Depending on the type of tasks (and preferences) to accomplish

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Performance
• NoSQL is a set of concepts, ideas, technologies, and software dealing with
• Big data
• Sparse un/semi-structured data
• High horizontal scalability
• Massive parallel processing
• Different applications, goals, targets, approaches need different NoSQL solutions

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Where would we use it?
• Where would we use a NoSQL database?
• Do you have somewhere a large set of uncontrolled, unstructured, data that you are
trying to fit into RDBMS?
• Log Analysis
• Social Networking Feeds (many firms hooked in through Facebook or Twitter)
• External feeds from partners
• Data that is not easily analyzed in a RDBMS such as time-based data
• Large data feeds that need to be massaged before entry into an RDBMS

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University


Don’t forget about the DBA

• It does not matter if the data is deployed on a NoSQL platform instead of an RDBMS.
• Still need to address:
• Backups & recovery
• Capacity planning
• Performance monitoring
• Data integration
• Tuning & optimization

Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University

You might also like