0% found this document useful (0 votes)
11 views125 pages

PSK Unit 1 Merged

The document provides an overview of data science, emphasizing the vast amounts of data generated daily across various platforms and industries. It defines data science as a multidisciplinary field focused on extracting insights from large datasets, utilizing methods from computer science, mathematics, and statistics. Additionally, it outlines the roles and responsibilities of data scientists and introduces concepts such as big data, data types, and various similarity measures used in data analysis.

Uploaded by

sujal.22210365
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views125 pages

PSK Unit 1 Merged

The document provides an overview of data science, emphasizing the vast amounts of data generated daily across various platforms and industries. It defines data science as a multidisciplinary field focused on extracting insights from large datasets, utilizing methods from computer science, mathematics, and statistics. Additionally, it outlines the roles and responsibilities of data scientists and introduces concepts such as big data, data types, and various similarity measures used in data analysis.

Uploaded by

sujal.22210365
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 125

Introduction to Data Science

Data All Around

 Lots of data is being collected


and warehoused
 Web data, e-commerce
 Financial transactions, bank/credit transactions
 Online trading and purchasing
 Social Network
How Much Data Do We have?

 Google processes 20 PB a day (2008)


 Facebook has 60 TB of daily logs
 eBay has 6.5 PB of user data + 50 TB/day (5/2009)
 1000 genomes project: 200 TB

 Cost of 1 TB of disk: $35


 Time to read 1 TB disk: 3 hrs
(100 MB/s)
A single Jet engine can generate 10+terabytes of data
in 30 minutes of flight time. With many thousand flights
per day, generation of data reaches up to many Petabytes.
 Each day 500 million tweets are sent.

 Amazon, in order to recommend products, on average, handles more than 15 million+


customer clickstreams per day.

 Walmart an American Multinational Retail Corporation handle about 1 million+ customer


transactions per hour.

 65 billion+ messages are sent on Whatsapp every day.

 On average, everyday 294 billion+ emails are sent.

 Modern cars have close to 100 sensors for monitoring tire pressure, fuel level, etc. , thus
generating a lot of sensor data.

 Facebook stores and analyzes more than 30 Petabytes of data generated by the users
each day.

 YouTube users upload about 48 hours of video every minute of the day.
Big Data
Big Data is any data that is expensive to manage and hard to extract value from
 Volume
 The size of the data

 Velocity
 The latency of data processing relative to the growing demand for interactivity

 Variety and Complexity


 the diversity of sources, formats, quality, structures.
Big Data
Types of Data We Have

 Relational Data (Tables/Transaction/Legacy Data)


 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
 Social Network, Semantic Web (RDF), …
 Streaming Data
 You can afford to scan the data once
What is Data Science?

“Data Science is about extraction, preparation, analysis,


visualization, and maintenance of information. It is a cross-
disciplinary field which uses scientific methods and
processes to draw insights from data. ”
What is Data Science?

 An area that manages, manipulates, extracts, and


interprets knowledge from tremendous amount of data
 Data science (DS) is a multidisciplinary field of study
with goal to address the challenges in big data
 Data science principles apply to all data – big and small

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
What is Data Science?

 Theories and techniques from many fields and disciplines are used
to investigate and analyze a large amount of data to help decision
makers in many industries such as science, engineering, economics,
politics, finance, and education
 Computer Science
 Pattern recognition, visualization, data warehousing, High performance
computing, Databases, AI
 Mathematics
 Mathematical Modeling

 Statistics
 Statistical and Stochastic modeling, Probability.
Data Science
Data Science
Applications of data science
 Augmented realities
 Self driving cars
 Robots
Data Scientists

 Data Scientist
 The Sexiest Job of the 21st Century
 They find stories, extract knowledge. They are not
reporters
Data Scientist Roles and Responsibilities

 Collect data and identify data sources


 Analyze huge amounts of data, both structured and
unstructured
 Create solutions and strategies to business problems
 Work with team members and leaders to develop data
strategy
 To discover trends and patterns, combine various
algorithms and modules
 Present data using various data visualization techniques
and tools
 Investigate additional technologies and tools for
developing innovative data strategies
 Create comprehensive analytical solutions, from data gathering to
display; assist in the construction of data engineering pipelines
 Supporting the data scientists, BI developers, and analysts team as
needed for their projects Working with the sales and pre-sales team
on cost reduction, effort estimation, and cost optimization
 To boost general effectiveness and performance, stay current with
the newest tools, trends, and technologies
 collaborating together with the product team and partners to provide
data-driven solutions created with original concepts
 Create analytics solutions for businesses by combining various tools,
applied statistics, and machine learning
 Lead discussions and assess the feasibility of AI/ML solutions for
business processes and outcomes
 Architect, implement, and monitor data pipelines, as well as conduct
knowledge sharing sessions with peers to ensure effective data use
1
2
 CRISP-DM is the most popular data mining
process model

3
 Daimla Benz , ISL , NCR & OHRA

 Founded at 1996

 Non proprietary , documented , freely


available

4
5
1. Determine the business question and
objective:
 What to solve from the business perspective,
what the customer wants, and define the
business success criteria

 2. Situation Assessment:
 assess the resources availability,
 project requirements,
 risks, and cost-benefit from this project.

6
 3. Determine the project goals:
 4. Project plan:

7
 understand the scope and depth of the problem ,if we
make a mistake ,we end up spending a lot of time.

Key Questions must be asked in framing the problem:

• What kind of a system would the company like to build?


• What kind of data is available for us to use?
• How many movies are there in the library?
• How many movies should be there in a recommendation?
• How are these recommendations going to be used?

8
9
10
 Collect Data:
 Describe data:
 Explore data:
 Verify data quality:

11
 faulty, incorrect data is insufficient to solve
the problem
 collect needs from reliable sources
 Get data directly from customers with their
knowledge
 websites using web scraping

12
• missing values in several rows or columns -fill them with zero or
fill them with the average

• could have many outliers, incorrect values, or values in timestamps


with different time zones,

• issues related to date ranges

• e.g., if the data is collected from multiple thermometers and any of


those are faulty

13
14
15
16
17
we can extract some patterns from our data, which can lead us to solve
our problem.
exploration can be performed using the visualizations and the numerical
summaries of the data and its columns.

For example, imagine we are analyzing data from an e-commerce


platform to help devise a meaningful strategy to attract more customers
to each product.

18
use statistical and numerical methods to draw inferences about the data.

identify the relationship between multiple columns in our dataset.

summarize the data using images, graphs, charts, plots, etc.

how different columns are related to each other by finding out their
correlation.

19
20
21
22
23
24
25
26
27
consolidate the results so that they can be analyzed and understood
by stakeholders.

create documents that justify our conclusions by describing the


insights and visualizations.

28
29
30
31
Similarity & dissimilarity
What is data?
• Data denotes a collection of objects and their attributes.
• An attribute (feature, variable, or field) is a property or characteristic
of an object.
• A collection of attributes describe an object (individual, entity, case,
or record).
3
4
5
6
7
8
• Proximity refers to either a similarity or dissimilarity
Similarity might be used to identify
• duplicate data that may have differences due to typos.
• equivalent instances from different data sets. E.g. names and/or
addresses that are the same but have misspellings.
• groups of data that are very close (clusters)
Dissimilarity might be used to identify
• outliers
• interesting exceptions, e.g. credit card fraud
• boundaries to clusters
Proximity measures for
• Nominal attributes
• Binary attributes
• Ordinal attributes
• Numerical attributes
• Mixed attributes

• Why Proximity measures ?


For clustering
Outlier analysis
Nearest neighbor classification
Mahalanobis distance:
• Measures distance between a point and a distribution
• Used for anomaly detection , classification
• Euclidean will work when dimensions are same
Area (sq ft) Price ($1000) Area (acre) Price ($M)
2400 156000 0.0550944 156
1950 126750 0.0447642 126.75
2100 105000 0.0482076 105
1200 78000 0.0275472 78
2000 130000 0.045912 130
900 54000 0.0206604 54
• Euclidean distance gives different values even though distances are
same .
• This can be overcome by scaling.
• But still Euclidean distance between point and center of the
distribution point can give misleading information about how close a
point to the cluster.
• So Mahalanobis distance.
• It is introduced by Prof. P.C. Mahalanobis in 1936.
Supremum distance :
• d(x,y)=maxni=1 |xi-yi|

point xi yi
P1 0 2
• d(p2-p1)=max(|0-2|,|0-2|}
P2 2 0
• = {2,2}
P3 3 1
•P4 5 1
=2

L p1 p2 p3 p4
P1 0
P2 2 0
P3 3 1 0
p4 5 3 2 0
Bhattacharya distance :
• Measures the similarity of two probability distributions
• Developed by Anil Kumar Bhattacharya.
• More reliable than mahalanobis distance
• It is a generalization of mahalanobis distance

• Bhattacharya distance =-log(BC(P,Q))


• Where BC-bhattacharya coefficient
• BC(P,Q)=sum((P(x)*Q(x))^0.5
Similarity measures for symmetric and
asymmetric binary data
• Binary Attributes: Binary data has only 2 values/states. For Example
yes or no, affected or unaffected, true or false.

• Symmetric: Both values are equally important (Gender).


• Asymmetric: Both values are not equally important (test Result).
• SMC would say all transactions are very similar.
Hamming distance :
• Compare binary strings
• Calculates distance between two binary vectors
• Compute by counting the no of features which have different values .
• Used for error detection or error correction while transfer on
network.
• Eg D1=1010010
• D2=0011001
• H. distance=1+0+0+1+0+1+1
• =4
Similarity measure for textual data :
Text similarity : how close the two pieces of text
Applications :
• Search engines
• Legal matter
• Customer services
Jaccard Similarity
• The Jaccard similarity (Jaccard coefficient) of two sets S1,
S2 is the size of their intersection divided by the size of
their union.
• JSim (C1, C2) = |C1C2| / |C1C2|.

3 in intersection.
8 in union.
Jaccard similarity
= 3/8

• Extreme behavior:
• Jsim(X,Y) = 1, iff X = Y
• Jsim(X,Y) = 0 iff X,Y have no elements in common
• JSim is symmetric
40
Disadvantage :

• JS neither able to capture semantic similarity nor lexical semantic of


two sentences .
• As the size of document increase the no of common words tend to
increase even if document talk about different topic.
Cosine Similarity

• Sim(X,Y) = cos(X,Y)
• The cosine of the angle between X and Y

• If the vectors are aligned (correlated) angle is zero degrees and


cos(X,Y)=1
• If the vectors are orthogonal (no common coordinates) angle is 90
degrees and cos(X,Y) = 0

• Cosine is commonly used for comparing documents, where we


assume that the vectors are normalized by the document length.
• Given two objects represented by tuples (22,1,42,10) and (20,0,36,8)
• Compute the Euclidian distance between two objects
• Compute the Manhattan distance between two objects .
• Compute the Minkowski distance between two objects using q=3.

1 . 6.708
2. 11
3. 6.1534
How similar are two strings?

• Spell correction
• The user typed “graffe”
Which is closest?
• graf
• graft
• grail
• giraffe

• Also for Machine Translation, Information Extraction, Speech Recognition


Edit Distance
• The minimum edit distance between two strings
• Is the minimum number of editing operations
• Insertion
• Deletion
• Substitution
• Needed to transform one into the other
Minimum Edit Distance
• Two strings and their alignment:
Minimum Edit Distance

• If each operation has cost of 1


• Distance between these is 5
• If substitutions cost 2 (Levenshtein)
• Distance between them is 8
The Edit Distance Table
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Jaro Similarity
Jaro Similarity is the measure of similarity between two strings. The value of Jaro
distance ranges from 0 to 1. where 1 means the strings are equal and 0 means no
similarity between the two strings.

Examples:
Input: s1 = “CRATE”, s2 = “TRACE”;
Output: Jaro Similarity = 0.733333
Input: s1 = “DwAyNE”, s2 = “DuANE”;
Output: Jaro Similarity = 0.822222
Jaro distance :

You might also like