PSK Unit 1 Merged
PSK Unit 1 Merged
Modern cars have close to 100 sensors for monitoring tire pressure, fuel level, etc. , thus
generating a lot of sensor data.
Facebook stores and analyzes more than 30 Petabytes of data generated by the users
each day.
YouTube users upload about 48 hours of video every minute of the day.
Big Data
Big Data is any data that is expensive to manage and hard to extract value from
Volume
The size of the data
Velocity
The latency of data processing relative to the growing demand for interactivity
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
What is Data Science?
Theories and techniques from many fields and disciplines are used
to investigate and analyze a large amount of data to help decision
makers in many industries such as science, engineering, economics,
politics, finance, and education
Computer Science
Pattern recognition, visualization, data warehousing, High performance
computing, Databases, AI
Mathematics
Mathematical Modeling
Statistics
Statistical and Stochastic modeling, Probability.
Data Science
Data Science
Applications of data science
Augmented realities
Self driving cars
Robots
Data Scientists
Data Scientist
The Sexiest Job of the 21st Century
They find stories, extract knowledge. They are not
reporters
Data Scientist Roles and Responsibilities
3
Daimla Benz , ISL , NCR & OHRA
Founded at 1996
4
5
1. Determine the business question and
objective:
What to solve from the business perspective,
what the customer wants, and define the
business success criteria
2. Situation Assessment:
assess the resources availability,
project requirements,
risks, and cost-benefit from this project.
6
3. Determine the project goals:
4. Project plan:
7
understand the scope and depth of the problem ,if we
make a mistake ,we end up spending a lot of time.
8
9
10
Collect Data:
Describe data:
Explore data:
Verify data quality:
11
faulty, incorrect data is insufficient to solve
the problem
collect needs from reliable sources
Get data directly from customers with their
knowledge
websites using web scraping
12
• missing values in several rows or columns -fill them with zero or
fill them with the average
13
14
15
16
17
we can extract some patterns from our data, which can lead us to solve
our problem.
exploration can be performed using the visualizations and the numerical
summaries of the data and its columns.
18
use statistical and numerical methods to draw inferences about the data.
how different columns are related to each other by finding out their
correlation.
19
20
21
22
23
24
25
26
27
consolidate the results so that they can be analyzed and understood
by stakeholders.
28
29
30
31
Similarity & dissimilarity
What is data?
• Data denotes a collection of objects and their attributes.
• An attribute (feature, variable, or field) is a property or characteristic
of an object.
• A collection of attributes describe an object (individual, entity, case,
or record).
3
4
5
6
7
8
• Proximity refers to either a similarity or dissimilarity
Similarity might be used to identify
• duplicate data that may have differences due to typos.
• equivalent instances from different data sets. E.g. names and/or
addresses that are the same but have misspellings.
• groups of data that are very close (clusters)
Dissimilarity might be used to identify
• outliers
• interesting exceptions, e.g. credit card fraud
• boundaries to clusters
Proximity measures for
• Nominal attributes
• Binary attributes
• Ordinal attributes
• Numerical attributes
• Mixed attributes
point xi yi
P1 0 2
• d(p2-p1)=max(|0-2|,|0-2|}
P2 2 0
• = {2,2}
P3 3 1
•P4 5 1
=2
L p1 p2 p3 p4
P1 0
P2 2 0
P3 3 1 0
p4 5 3 2 0
Bhattacharya distance :
• Measures the similarity of two probability distributions
• Developed by Anil Kumar Bhattacharya.
• More reliable than mahalanobis distance
• It is a generalization of mahalanobis distance
3 in intersection.
8 in union.
Jaccard similarity
= 3/8
• Extreme behavior:
• Jsim(X,Y) = 1, iff X = Y
• Jsim(X,Y) = 0 iff X,Y have no elements in common
• JSim is symmetric
40
Disadvantage :
• Sim(X,Y) = cos(X,Y)
• The cosine of the angle between X and Y
1 . 6.708
2. 11
3. 6.1534
How similar are two strings?
• Spell correction
• The user typed “graffe”
Which is closest?
• graf
• graft
• grail
• giraffe
Examples:
Input: s1 = “CRATE”, s2 = “TRACE”;
Output: Jaro Similarity = 0.733333
Input: s1 = “DwAyNE”, s2 = “DuANE”;
Output: Jaro Similarity = 0.822222
Jaro distance :