0% found this document useful (0 votes)

11 views125 pages

PSK Unit 1 Merged

The document provides an overview of data science, emphasizing the vast amounts of data generated daily across various platforms and industries. It defines data science as a multidisciplinary field focused on extracting insights from large datasets, utilizing methods from computer science, mathematics, and statistics. Additionally, it outlines the roles and responsibilities of data scientists and introduces concepts such as big data, data types, and various similarity measures used in data analysis.

Uploaded by

sujal.22210365

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views125 pages

PSK Unit 1 Merged

Uploaded by

sujal.22210365

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 125

Introduction to Data Science

Data All Around

 Lots of data is being collected

and warehoused
 Web data, e-commerce
 Financial transactions, bank/credit transactions
 Online trading and purchasing
 Social Network
How Much Data Do We have?

 Google processes 20 PB a day (2008)

 Facebook has 60 TB of daily logs
 eBay has 6.5 PB of user data + 50 TB/day (5/2009)
 1000 genomes project: 200 TB

 Cost of 1 TB of disk: $35

 Time to read 1 TB disk: 3 hrs
(100 MB/s)
A single Jet engine can generate 10+terabytes of data
in 30 minutes of flight time. With many thousand flights
per day, generation of data reaches up to many Petabytes.
 Each day 500 million tweets are sent.

 Amazon, in order to recommend products, on average, handles more than 15 million+

customer clickstreams per day.

 Walmart an American Multinational Retail Corporation handle about 1 million+ customer

transactions per hour.

 65 billion+ messages are sent on Whatsapp every day.

 On average, everyday 294 billion+ emails are sent.

 Modern cars have close to 100 sensors for monitoring tire pressure, fuel level, etc. , thus
generating a lot of sensor data.

 Facebook stores and analyzes more than 30 Petabytes of data generated by the users
each day.

 YouTube users upload about 48 hours of video every minute of the day.
Big Data
Big Data is any data that is expensive to manage and hard to extract value from
 Volume
 The size of the data

 Velocity
 The latency of data processing relative to the growing demand for interactivity

 Variety and Complexity

 the diversity of sources, formats, quality, structures.
Big Data
Types of Data We Have

 Relational Data (Tables/Transaction/Legacy Data)

 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
 Social Network, Semantic Web (RDF), …
 Streaming Data
 You can afford to scan the data once
What is Data Science?

“Data Science is about extraction, preparation, analysis,

visualization, and maintenance of information. It is a cross-
disciplinary field which uses scientific methods and
processes to draw insights from data. ”
What is Data Science?

 An area that manages, manipulates, extracts, and

interprets knowledge from tremendous amount of data
 Data science (DS) is a multidisciplinary field of study
with goal to address the challenges in big data
 Data science principles apply to all data – big and small

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
What is Data Science?

 Theories and techniques from many fields and disciplines are used
to investigate and analyze a large amount of data to help decision
makers in many industries such as science, engineering, economics,
politics, finance, and education
 Computer Science
 Pattern recognition, visualization, data warehousing, High performance
computing, Databases, AI
 Mathematics
 Mathematical Modeling

 Statistics
 Statistical and Stochastic modeling, Probability.
Data Science
Data Science
Applications of data science
 Augmented realities
 Self driving cars
 Robots
Data Scientists

 Data Scientist
 The Sexiest Job of the 21st Century
 They find stories, extract knowledge. They are not
reporters
Data Scientist Roles and Responsibilities

 Collect data and identify data sources

 Analyze huge amounts of data, both structured and
unstructured
 Create solutions and strategies to business problems
 Work with team members and leaders to develop data
strategy
 To discover trends and patterns, combine various
algorithms and modules
 Present data using various data visualization techniques
and tools
 Investigate additional technologies and tools for
developing innovative data strategies
 Create comprehensive analytical solutions, from data gathering to
display; assist in the construction of data engineering pipelines
 Supporting the data scientists, BI developers, and analysts team as
needed for their projects Working with the sales and pre-sales team
on cost reduction, effort estimation, and cost optimization
 To boost general effectiveness and performance, stay current with
the newest tools, trends, and technologies
 collaborating together with the product team and partners to provide
data-driven solutions created with original concepts
 Create analytics solutions for businesses by combining various tools,
applied statistics, and machine learning
 Lead discussions and assess the feasibility of AI/ML solutions for
business processes and outcomes
 Architect, implement, and monitor data pipelines, as well as conduct
knowledge sharing sessions with peers to ensure effective data use
1
2
 CRISP-DM is the most popular data mining
process model

3
 Daimla Benz , ISL , NCR & OHRA

 Founded at 1996

 Non proprietary , documented , freely

available

4
5
1. Determine the business question and
objective:
 What to solve from the business perspective,
what the customer wants, and define the
business success criteria

 2. Situation Assessment:
 assess the resources availability,
 project requirements,
 risks, and cost-benefit from this project.

6
 3. Determine the project goals:
 4. Project plan:

7
 understand the scope and depth of the problem ,if we
make a mistake ,we end up spending a lot of time.

Key Questions must be asked in framing the problem:

• What kind of a system would the company like to build?

• What kind of data is available for us to use?
• How many movies are there in the library?
• How many movies should be there in a recommendation?
• How are these recommendations going to be used?

8
9
10
 Collect Data:
 Describe data:
 Explore data:
 Verify data quality:

11
 faulty, incorrect data is insufficient to solve
the problem
 collect needs from reliable sources
 Get data directly from customers with their
knowledge
 websites using web scraping

12
• missing values in several rows or columns -fill them with zero or
fill them with the average

• could have many outliers, incorrect values, or values in timestamps

with different time zones,

• issues related to date ranges

• e.g., if the data is collected from multiple thermometers and any of

those are faulty

13
14
15
16
17
we can extract some patterns from our data, which can lead us to solve
our problem.
exploration can be performed using the visualizations and the numerical
summaries of the data and its columns.

For example, imagine we are analyzing data from an e-commerce

platform to help devise a meaningful strategy to attract more customers
to each product.

18
use statistical and numerical methods to draw inferences about the data.

identify the relationship between multiple columns in our dataset.

summarize the data using images, graphs, charts, plots, etc.

how different columns are related to each other by finding out their
correlation.

19
20
21
22
23
24
25
26
27
consolidate the results so that they can be analyzed and understood
by stakeholders.

create documents that justify our conclusions by describing the

insights and visualizations.

28
29
30
31
Similarity & dissimilarity
What is data?
• Data denotes a collection of objects and their attributes.
• An attribute (feature, variable, or field) is a property or characteristic
of an object.
• A collection of attributes describe an object (individual, entity, case,
or record).
3
4
5
6
7
8
• Proximity refers to either a similarity or dissimilarity
Similarity might be used to identify
• duplicate data that may have differences due to typos.
• equivalent instances from different data sets. E.g. names and/or
addresses that are the same but have misspellings.
• groups of data that are very close (clusters)
Dissimilarity might be used to identify
• outliers
• interesting exceptions, e.g. credit card fraud
• boundaries to clusters
Proximity measures for
• Nominal attributes
• Binary attributes
• Ordinal attributes
• Numerical attributes
• Mixed attributes

• Why Proximity measures ?

For clustering
Outlier analysis
Nearest neighbor classification
Mahalanobis distance:
• Measures distance between a point and a distribution
• Used for anomaly detection , classification
• Euclidean will work when dimensions are same
Area (sq ft) Price ($1000) Area (acre) Price ($M)
2400 156000 0.0550944 156
1950 126750 0.0447642 126.75
2100 105000 0.0482076 105
1200 78000 0.0275472 78
2000 130000 0.045912 130
900 54000 0.0206604 54
• Euclidean distance gives different values even though distances are
same .
• This can be overcome by scaling.
• But still Euclidean distance between point and center of the
distribution point can give misleading information about how close a
point to the cluster.
• So Mahalanobis distance.
• It is introduced by Prof. P.C. Mahalanobis in 1936.
Supremum distance :
• d(x,y)=maxni=1 |xi-yi|

point xi yi
P1 0 2
• d(p2-p1)=max(|0-2|,|0-2|}
P2 2 0
• = {2,2}
P3 3 1
•P4 5 1
=2

L p1 p2 p3 p4
P1 0
P2 2 0
P3 3 1 0
p4 5 3 2 0
Bhattacharya distance :
• Measures the similarity of two probability distributions
• Developed by Anil Kumar Bhattacharya.
• More reliable than mahalanobis distance
• It is a generalization of mahalanobis distance

• Bhattacharya distance =-log(BC(P,Q))

• Where BC-bhattacharya coefficient
• BC(P,Q)=sum((P(x)*Q(x))^0.5
Similarity measures for symmetric and
asymmetric binary data
• Binary Attributes: Binary data has only 2 values/states. For Example
yes or no, affected or unaffected, true or false.

• Symmetric: Both values are equally important (Gender).

• Asymmetric: Both values are not equally important (test Result).
• SMC would say all transactions are very similar.
Hamming distance :
• Compare binary strings
• Calculates distance between two binary vectors
• Compute by counting the no of features which have different values .
• Used for error detection or error correction while transfer on
network.
• Eg D1=1010010
• D2=0011001
• H. distance=1+0+0+1+0+1+1
• =4
Similarity measure for textual data :
Text similarity : how close the two pieces of text
Applications :
• Search engines
• Legal matter
• Customer services
Jaccard Similarity
• The Jaccard similarity (Jaccard coefficient) of two sets S1,
S2 is the size of their intersection divided by the size of
their union.
• JSim (C1, C2) = |C1C2| / |C1C2|.

3 in intersection.
8 in union.
Jaccard similarity
= 3/8

• Extreme behavior:
• Jsim(X,Y) = 1, iff X = Y
• Jsim(X,Y) = 0 iff X,Y have no elements in common
• JSim is symmetric
40
Disadvantage :

• JS neither able to capture semantic similarity nor lexical semantic of

two sentences .
• As the size of document increase the no of common words tend to
increase even if document talk about different topic.
Cosine Similarity

• Sim(X,Y) = cos(X,Y)
• The cosine of the angle between X and Y

• If the vectors are aligned (correlated) angle is zero degrees and

cos(X,Y)=1
• If the vectors are orthogonal (no common coordinates) angle is 90
degrees and cos(X,Y) = 0

• Cosine is commonly used for comparing documents, where we

assume that the vectors are normalized by the document length.
• Given two objects represented by tuples (22,1,42,10) and (20,0,36,8)
• Compute the Euclidian distance between two objects
• Compute the Manhattan distance between two objects .
• Compute the Minkowski distance between two objects using q=3.

1 . 6.708
2. 11
3. 6.1534
How similar are two strings?

• Spell correction
• The user typed “graffe”
Which is closest?
• graf
• graft
• grail
• giraffe

• Also for Machine Translation, Information Extraction, Speech Recognition

Edit Distance
• The minimum edit distance between two strings
• Is the minimum number of editing operations
• Insertion
• Deletion
• Substitution
• Needed to transform one into the other
Minimum Edit Distance
• Two strings and their alignment:
Minimum Edit Distance

• If each operation has cost of 1

• Distance between these is 5
• If substitutions cost 2 (Levenshtein)
• Distance between them is 8
The Edit Distance Table
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Jaro Similarity
Jaro Similarity is the measure of similarity between two strings. The value of Jaro
distance ranges from 0 to 1. where 1 means the strings are equal and 0 means no
similarity between the two strings.

Examples:
Input: s1 = “CRATE”, s2 = “TRACE”;
Output: Jaro Similarity = 0.733333
Input: s1 = “DwAyNE”, s2 = “DuANE”;
Output: Jaro Similarity = 0.822222
Jaro distance :

Introduction To GIS Programming and Fundamentals With Python and ArcGIS
100% (7)
Introduction To GIS Programming and Fundamentals With Python and ArcGIS
381 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
209 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
Data Science Mid Syllabus
No ratings yet
Data Science Mid Syllabus
102 pages
Intro
No ratings yet
Intro
144 pages
Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
Unit-1 Data Science
No ratings yet
Unit-1 Data Science
74 pages
Combine PDF
No ratings yet
Combine PDF
270 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
24 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
Module 1 - Data Science Introduction - Detailed
No ratings yet
Module 1 - Data Science Introduction - Detailed
131 pages
Chapter 1
No ratings yet
Chapter 1
149 pages
Chapter 2. Business Problem and Data-Driven Decision
No ratings yet
Chapter 2. Business Problem and Data-Driven Decision
22 pages
M1.1 DS
No ratings yet
M1.1 DS
57 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Chapter 1 - Lecture
No ratings yet
Chapter 1 - Lecture
7 pages
2020 Intro
No ratings yet
2020 Intro
58 pages
Data Science - PPT
No ratings yet
Data Science - PPT
45 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Unit 1
No ratings yet
Unit 1
60 pages
Lecture 01
No ratings yet
Lecture 01
40 pages
1 1 Intro To Data and Data Science Course Notes
No ratings yet
1 1 Intro To Data and Data Science Course Notes
8 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Unit I-Introduction of Data Science & R Programming: What Is Data Science? What Is Data Science?
No ratings yet
Unit I-Introduction of Data Science & R Programming: What Is Data Science? What Is Data Science?
30 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
No ratings yet
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
23 pages
Data Mining
No ratings yet
Data Mining
40 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
Trends in Data Science: AI and DS-I
No ratings yet
Trends in Data Science: AI and DS-I
32 pages
DA-1,2,3 (1) Merged
No ratings yet
DA-1,2,3 (1) Merged
39 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
Data Mining and BI - Student Notes 2
No ratings yet
Data Mining and BI - Student Notes 2
40 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Wk. 1. Introduction (08.10.2020)
No ratings yet
Wk. 1. Introduction (08.10.2020)
30 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Inroduction To Data Science
No ratings yet
Inroduction To Data Science
62 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Ds Intro KK
No ratings yet
Ds Intro KK
11 pages
Ds1 - Shahana
No ratings yet
Ds1 - Shahana
36 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
25 pages
Chapter 14 Big Data and Data Science - DONE DONE DONE
No ratings yet
Chapter 14 Big Data and Data Science - DONE DONE DONE
28 pages
Fdsa PPT - Unit 1
No ratings yet
Fdsa PPT - Unit 1
19 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
6220010
No ratings yet
6220010
37 pages
Unit 1 Part 1
No ratings yet
Unit 1 Part 1
18 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
7 pages
Data
No ratings yet
Data
43 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
33 pages
Data Science Theory: Analysis and Analytics
No ratings yet
Data Science Theory: Analysis and Analytics
14 pages
DevOps 1 With Lifecycle
100% (2)
DevOps 1 With Lifecycle
39 pages
MMPC 8 Ebook Final
100% (1)
MMPC 8 Ebook Final
61 pages
Primesim Hspice Sa
No ratings yet
Primesim Hspice Sa
769 pages
Internal Order in Multi Org Setup
100% (1)
Internal Order in Multi Org Setup
10 pages
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
No ratings yet
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
20 pages
Ford Acronyms List
No ratings yet
Ford Acronyms List
32 pages
Exam Ict Paper 1 ANS
No ratings yet
Exam Ict Paper 1 ANS
12 pages
BAPI ACC Document Post
No ratings yet
BAPI ACC Document Post
4 pages
How Do I Prepare For The Samsung Professional Software Competency Test - Quora
No ratings yet
How Do I Prepare For The Samsung Professional Software Competency Test - Quora
2 pages
How Do We Want To Interact With Robotic Environments.
No ratings yet
How Do We Want To Interact With Robotic Environments.
20 pages
CC880
No ratings yet
CC880
75 pages
PLC Curriculum
No ratings yet
PLC Curriculum
8 pages
Kroenke Mis5e PPT ch04
No ratings yet
Kroenke Mis5e PPT ch04
48 pages
Obiee
No ratings yet
Obiee
11 pages
Jakarta Struts: An MVC Framework: Overview Installation and Setup Overview, Installation, and Setup
No ratings yet
Jakarta Struts: An MVC Framework: Overview Installation and Setup Overview, Installation, and Setup
17 pages
Spring Security: Basic Steps & Configuration
No ratings yet
Spring Security: Basic Steps & Configuration
14 pages
Chapter 7 Cyber Ethics
No ratings yet
Chapter 7 Cyber Ethics
8 pages
Biometrics Technology
No ratings yet
Biometrics Technology
15 pages
ML Lab Exp 7 K-Means Clustering
No ratings yet
ML Lab Exp 7 K-Means Clustering
14 pages
Lecture 2 Applied Cryptography
No ratings yet
Lecture 2 Applied Cryptography
34 pages
16 07 04 - IT Security Policy - Version 1 - NOVO
No ratings yet
16 07 04 - IT Security Policy - Version 1 - NOVO
9 pages
Dbu Hiwot Cinema
No ratings yet
Dbu Hiwot Cinema
30 pages
Chapter 9: Strings and Arrays
No ratings yet
Chapter 9: Strings and Arrays
58 pages
Mini Project Report
No ratings yet
Mini Project Report
20 pages
Cef
No ratings yet
Cef
34 pages
A Survey On Context-Aware Systems PDF
No ratings yet
A Survey On Context-Aware Systems PDF
16 pages
(Instructor Version) : Packet Tracer - Configuring Router-on-a-Stick Inter-VLAN Routing
No ratings yet
(Instructor Version) : Packet Tracer - Configuring Router-on-a-Stick Inter-VLAN Routing
10 pages
Homework 4
No ratings yet
Homework 4
3 pages
Create A New Verilog Project in Xilinx ISE 6.2i, and Name It Ram
No ratings yet
Create A New Verilog Project in Xilinx ISE 6.2i, and Name It Ram
16 pages
Be Data Curious!: Be Data Curious!, #1
From Everand
Be Data Curious!: Be Data Curious!, #1
Nick Jewell
No ratings yet

PSK Unit 1 Merged

Uploaded by

PSK Unit 1 Merged

Uploaded by

Introduction to Data Science

Data All Around

 Lots of data is being collected

 Google processes 20 PB a day (2008)

 Cost of 1 TB of disk: $35

 Amazon, in order to recommend products, on average, handles more than 15 million+

 Walmart an American Multinational Retail Corporation handle about 1 million+ customer

 65 billion+ messages are sent on Whatsapp every day.

 On average, everyday 294 billion+ emails are sent.

 Variety and Complexity

 Relational Data (Tables/Transaction/Legacy Data)

“Data Science is about extraction, preparation, analysis,

 An area that manages, manipulates, extracts, and

 Collect data and identify data sources

 Non proprietary , documented , freely

Key Questions must be asked in framing the problem:

• What kind of a system would the company like to build?

• could have many outliers, incorrect values, or values in timestamps

• issues related to date ranges

• e.g., if the data is collected from multiple thermometers and any of

For example, imagine we are analyzing data from an e-commerce

identify the relationship between multiple columns in our dataset.

summarize the data using images, graphs, charts, plots, etc.

create documents that justify our conclusions by describing the

• Why Proximity measures ?

• Bhattacharya distance =-log(BC(P,Q))

• Symmetric: Both values are equally important (Gender).

• JS neither able to capture semantic similarity nor lexical semantic of

• If the vectors are aligned (correlated) angle is zero degrees and

• Cosine is commonly used for comparing documents, where we

• Also for Machine Translation, Information Extraction, Speech Recognition

• If each operation has cost of 1

You might also like