Unit 5
Unit 5
CS3EL13
Medi-Caps University
UNIT-V
Data Science and Different Tools
NumPy:
▪ Introduces objects for multidimensional arrays and matrices, as well as functions that
allow to easily perform advanced mathematical and statistical operations on those
objects
SciPy:
▪ Collection of algorithms for linear algebra, differential equations, numerical
integration, optimization, statistics and more
▪ Built on NumPy
Link: https://www.scipy.org/scipylib/
SciKit-Learn:
▪ Provides machine learning algorithms: classification, regression, clustering, model
validation etc.
Link: http://scikit-learn.org/
Seaborn:
▪ based on matplotlib
Link: https://seaborn.pydata.org/
Note: The above command has many optional arguments to fine-tune the data import process.
pd.read_stata('myfile.dta’)
Stata stores data in a special format that cannot be read by other programs. Stata data files have
extension .dta. Stata can read data in several other formats. A standard format is a comma-separated
values file with extension . csv
pd.read_sas('myfile.sas7bdat’)
The SAS file is an ASCII (text) file that contains a series of SAS functions that may be run against a
data set, or a SAS file may contain the actual data set.
pd.read_hdf('myfile.h5','df’)
HDF files are Hierarchical Data Format Files by which they are the standardized file format for
scientific data storage. These files are categorized as data files used mainly in non-destructive testing,
aerospace applications, environmental science and neutron scattering.
Out[3]:
df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labels and column names
ndim number of dimensions
size number of elements
shape return a tuple representing the dimensionality
values numpy representation of the data
✓ What are the mean values of the first 50 records in the dataset?
Hint: use head() method to subset the first 50 records and then calculate the mean
✓ Find how many values in the salary column (use count method);
In [ ]: #Calculate mean value for each numeric column per each group
df_rank.mean()
Note: If single brackets are used to specify the column (e.g. salary), then the output is Pandas Series object.
When double brackets are used the output is a Data Frame
When we need to select more than one column and/or make the output to be a
DataFrame, we should use double brackets:
In [ ]: #Select column salary:
df[['rank','salary']]
Dr.Pramod S. Nair, Dean Engineering & Professor, CSE, Medi-Caps University
Data Frames: Selecting rows
If we need to select a range of rows, we can specify the range using ":"
Notice that the first row has a position 0, and the last value in the range is omitted:
So for 0:10 range the first 10 rows are returned with the positions starting with 0
and ending with 9
Out[ ]:
In [ ]: # Create a new data frame from the original sorted by the column Salary
df_sorted = df.sort_values( by ='service')
df_sorted.head()
Out[ ]:
Out[ ]:
Out[ ]:
agg() method are useful when multiple statistics are computed per column:
In [ ]: flights[['dep_delay','arr_delay']].agg(['min','mean','max'])
Out[ ]:
df.method() description
describe Basic statistics (count, mean, std, min, quantiles, max)
min, max Minimum and maximum values
mean, median, mode Arithmetic average, median and mode
var, std Variance and standard deviation
sem Standard error of mean
skew Sample skewness
kurt kurtosis
Seaborn package is built on matplotlib but provides high level interface for
drawing attractive statistical graphics, similar to ggplot2 library in R. It
specifically targets statistical data visualization
In [ ]: %matplotlib inline
statsmodels:
• linear regressions
• ANOVA tests
• hypothesis testings
• many more ...
scikit-learn:
• kmeans
• support vector machines
• random forests
• many more ...
• Peer-to-Peer
• peer-to-peer replication avoids loading all writes onto a single server creating a
single point of failure
• Sharding
• Sharding is the process of breaking up large tables into smaller chunks called shards
and that are spread across multiple servers.
• Sharding distributes different data across multiple servers, so each server acts as the
single source for a subset of data
• Any DB distributed across multiple machines needs to know, in which machine the
data is stored or must be stored
• A sharding system makes this decision for each row, using its key
• NoSQL
• Does not give importance to ACID properties
• In some cases completely ignores them
• In distributed parallel systems it is difficult/impossible to ensure ACID properties
• Long-running transactions don't work because keeping resources blocked for a long time is
not practical
• Soft state
• Due to the lack of immediate consistency, data values may change over time. The BASE model breaks off with
the concept of a database which enforces its own consistency, delegating that responsibility to developers.
• Eventually Consistent
• The fact that BASE does not enforce immediate consistency does not mean that it never achieves it. However,
until it does, data reads are still possible (even though they might not reflect the reality).
• The fundamental difference between ACID and BASE database models is the way they
deal with this limitation
• The ACID model provides a consistent system
• The BASE model provides high availability.
• Notable for:
• Google's Bigtable (used in all Google's services)
• Documents
• Loosely structured sets of key/value pairs in documents, e.g., XML, JSON, BSON
• Encapsulate and encode data in some standard formats or encodings
• Are addressed in the database via a unique key
• Documents are treated as a whole, avoiding splitting a document into its constituent
name/value pairs
• Notable for:
• Couchbase (Zynga, Vimeo, NAVTEQ, ...)
• Redis (Craiglist, Instagram, StackOverfow,
flickr, ...)
• Amazon Dynamo (Amazon, Elsevier,
IMDb, ...)
• Apache Cassandra (Facebook, Digg,
Reddit, Twitter,...)
• Graph-oriented
• Everything is stored as an edge, a node or an attribute.
• Each node and edge can have any number of attributes.
• Both the nodes and edges can be labelled.
• Labels can be used to narrow searches.
A congruent and logical way for assessing the problems involved in assuring
ACID-like guarantees in distributed systems is provided by the CAP theorem
• It does not matter if the data is deployed on a NoSQL platform instead of an RDBMS.
• Still need to address:
• Backups & recovery
• Capacity planning
• Performance monitoring
• Data integration
• Tuning & optimization