0% found this document useful (0 votes)

4 views77 pages

Lecture No 1 Introduction

The document is an introduction to a Data Science course taught by Dr. Khalid Mahboob, outlining the assessment plan, key concepts, and importance of data science. It covers various topics including types of data, data analysis, proximity measures, and tools used in data science. The course emphasizes the significance of data in decision-making and predictive analysis, highlighting its role in modern business and technology.

Uploaded by

Marium Zehra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views77 pages

Lecture No 1 Introduction

Uploaded by

Marium Zehra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 77

Introduction Data Science

Lecture No: 01
Introduction to Data Science
Instructor: Dr. Khalid Mahboob
Email-id: [email protected]
Marking Scheme

Assessment Plan Marks Distribution

Assignments + Presentation 15

Quizzes 15

Midterm Exam 30

Final Exams. 40
Outline

 Data
 Types of Data Structured
 Simplest form of Data
 Can Data Speak?
 Objects and features of a data table
 Dimension, Vector, Proximity, Distance and similarity measurements
 Introduction Data Science
 Importance of Data Science
 Big Data
 Technology and tools
 Data Science Components
 Business Intelligence vs. Data Science
 Applications of Data Science
DATA

 The word “Data” is the plural form of “datum”.

 Data is a piece of information that is collected from different sources for analysis purposes.

 It is also known as meaning less information due to a piece of information

 For example, 13 is the example of data because it is meaningless and we don’t know what it is. it is a roll

number, age or weight, classroom number, etc.

Data

 Based on the above definition, data has three aspects:

1. Data comes from facts and statistics

2. Data is collected

3. Data is used for reference or analysis.

 In this course, our major focus will be on the third aspect — the analysis part of the data.
Simplest form of Data

 A table is the simplest form of data. Most of data science algorithms still today use tabular data as inputs.

 Data scientists prefer to convert any type of complex data — such as text, image, or time series — to tables to

make sure that existing tools can be leveraged for analysis.

 One reason for the popularity of tabular representation is the ease of storing the tabular data directly in the

main memory of the computer. Name Salary ($) Age (Years)

Jane 90000 52
John 85000 48
Delilah 75000 32
Can Data Speak?

 Closely look at the table for several minutes. Then, write down anything interesting you can find.
Can Data Speak?

 Suppose A company stores its employee data in an MS Excel file.

 Findings

1. Jane and Dave earn the highest salary

2. Delilah earns the least

3. Jane and Dave are the oldest people in the group

4. Delilah is the youngest person.

 Conclusion: Older people earn more in the company from where the data was collected.
Can Data Speak?

 Now, let us go back to the definition: Data refers to “facts and statistics collected for reference or analysis.”

 This table has facts.

1. This table is collected from a company.

2. We used the table for analysis purpose.

3. We revealed that the company appreciates experienced employees.

Can Data Speak?

 Basically, the data reflects a general trend –

 Experience, wisdom, (and money, which is the salary in this case) come with age.

 Data Speak: Data gives us insights. Data gives us those light-bulb moments.
Objects and Features of a data table

 Data is generally described using two things — objects and features.

 An object is explained or defined by features.

 Features are the attributes of object.

 For example, consider a data table containing records of employees of a company.

Name Salary ($) Age (Years)

Jane 90000 52
John 85000 48
Delilah 75000 32
Objects and Features of a data table

Name Salary ($) Age (Years)

Jane 90000 52
John 85000 48
Delilah 75000 32

 In the table above, the two columns — salary and age — are features.

 Notice that there are 3 people, whose names are placed in 3 rows. Each row is an object

 For example, given the data table above, Jane is defined or explained as a person of salary 90000 and an age 52.

That is two features — salary and age — explains Jane.

Space

 In data science, we use the word “space” to refer to the mathematical space.

 For example, if we have a two-dimensional dataset like the following one, a space with two axes is formed.
Name Salary ($) Age (Years)
Jane 90000 52
John 85000 48
Delilah 75000 32
Dave 90000 53
Ellen 82000 44
Dimension

 In general . A dimension refers to a direction

 The word “dimension” in programming is used to count the number of cells.

 “Dimension” in data science refers to the mathematical space, such as Euclidian space.

 Number of features = number of dimensions

90000 52 10

85000 48 20

75000 32 30
Dimension

 As an example, the given data table has three columns or three features.

 In programming, we will say that this table can be stored in a 2-dimensional array of size 3 times 3. That
means it has three rows and three columns.

 In data science, this table is called a three-dimensional dataset because it composes a mathematical space of
three dimensions.
90000 52 10

85000 48 20

75000 32 30
Dimension
 If the dataset has 1 feature, it is called, 1-dimensional; with 2 features it is called 2-dimensional, so and so forth.

With n features or n columns, the data is called n-dimensional.

Vector

 The word “vector” in physics refers to a quantity with a direction.

 A vector in data science is no different than a vector in physics. Each object of tabular data is sometimes referred
to as a vector.

 a point in the space=an object of the data table=a vector

Vector

 Why do we call an object a vector?

 Before answering this question, we need to know what an origin is.

 The origin: Regardless of the data points in the space, there is an origin in every data table.

 The origin is the coordinate where the value in every axis is zero.

 That is, the origin in a two-dimensional space is (0, 0). The origin in a three-dimensional space is (0, 0, 0).
Vector

 We have five objects or five data points. Each of these five data points is a vector.

 All five vectors are drawn in the following figure. Notice that the vectors have a direction from the origin toward

the data points.

Proximity

 The concept of nearness or farness in the space is known as proximity.

 Proximity is quantified in two ways:

 by computing distance between two vectors (i.e., two data points.)

 by computing similarity between two vectors (i.e., two data points.)

Proximity

 Distance refers to how far is a data point from another in the space.

 Many of us use the word “distance” when we are really referring to “dissimilarity.”

 Mostly, the distance between two points is measured by Euclidean and Manhattan distance.
Proximity

A distance formula must satisfy the four following axioms.

1. D(p1, p2)≥ 0. This indicates that the distance between two points p1 and p2 cannot be negative.

2. D(p1, p2)=0 iff p1=p2.

3. D(p1, p2)=D(p2, p1). This indicates that the distance from a point p1 to p2 cannot be different than the distance

from p2 to p1.

4. D(p1, p2) ≤ D(p1, p3) + D(p3, p2). The distance from one point to another cannot be greater than the distance

between the same two points via another point. This is commonly known as the triangle inequality property — the

length of one side of a triangle cannot be greater than the sum of the other two sides.
Proximity

A distance formula must satisfy the four following axioms.

 Let’s use a simple example with coordinates:

 Point p1 : (0, 0)

 Point p2 : (4, 0)

 Point p3 : (2, 3)
Proximity

 Point p1 : (0, 0)

 Point p2 : (4, 0)

 Point p3 : (2, 3)
Proximity

Name Salary ($) Age (Years)

Jane 90000 52

John 85000 48

Delilah 75000 32

Dave 90000 53

Ellen 82000 44
Importance of Proximity

1. Clustering: Clustering is the task of grouping similar data points together. This can be done by finding the

distance between each data point and then grouping the data points that are closest together.

2. Recommendation systems: Recommendation systems are used to recommend products or services to users. This

can be done by finding the similarity between the user's profile and the profiles of other users who have

purchased or rated the product or service.

3. Anomaly detection: Anomaly detection is the task of finding data points that are significantly different from

the rest of the data. This can be done by finding the distance between each data point and the rest of the data
Importance of Proximity

1. Pattern Recognition: In pattern recognition, distance measures are used to determine how closely a given pattern

matches a reference pattern. This is important in applications like fingerprint recognition and speech recognition.

2. Information Retrieval: Like search engines, similarity measures are crucial for ranking search results. They

determine how closely a document matches a user's query, allowing for more relevant search results.
Distance Calculations techniques

Euclidean Distance || A-B || =

Generalize Euclidean || A-B || =

Distance Calculations techniques

Feature 1 Feature 2
Row 1 10 3
Row 2 5 4
Row 3 10 4
Row 4 8 6
Row 5 9 2

Generalize Euclidean || A-B || =

For example, consider Row 2 and Row 5 of the following table. Row 2 has (5, 4), and Row 5 has (9, 2).

Therefore, the distance between Row 2 and Row 5 is equal to

Distance Calculations techniques

Feature 1 Feature 2 Feature 3

Row 1 10 3 3
Row 2 5 4 5
Row 3 10 4 6
Row 4 8 6 2
Row 5 9 2 1

Row 2 of the table above contains (5, 4, 5) and Row 5 contains (9, 2, 1).

The Euclidean distance between Row 2 and Row 5 of the data above is:
Class Activity

Feature 1 Feature 2
Row 1 10 3
Row 2 5 4
Row 3 10 4
Row 4 8 6
Row 5 9 2

Manhattan D | A-B |1 =

For example, consider Row 2 and Row 5 of the following table. Row 2 has (5, 4), and Row 5 has (9, 2).

Therefore, the distance between Row 2 and Row 5 is equal to

Class Activity

Manhattan

Feature 1 Feature 2 Feature 3

Row 1 10 3 3
Row 2 5 4 5
Row 3 10 4 6
Row 4 8 6 2
Row 5 9 2 1

Row 2 of the table above contains (5, 4, 5) and Row 5 contains (9, 2, 1).

The Euclidean distance between Row 2 and Row 5 of the data above is:
Similarity Calculations techniques

 The Jaccard index is a similarity measurement technique that is used to compute the similarity and

dissimilarity between sets or vectors.

 The Jaccard index, also known as the Jaccard coefficient.

 It is a ratio of commonality between the sets over all the items.

 If X and Y are two sets, then the Jaccard index between the two sets is computed using the ratio of the size of the

intersection and the size of the union of the two sets.

Similarity Calculations techniques
Set Theory Technique

 The Jaccard index is a measurement technique that is used computer the similarity and dissimilarity between

sets or vectors.
Weighted Jaccard index/coefficient/similarity

 The Jaccard index can be computed between two vectors too.

 Suppose, we have a four-dimensional dataset (Features 1 through 4). Let us compute the Jaccard similarity between

Row 1 and Row 3.

Feature 1 Feature 2 Feature 3 Feature 4
Row 1 10 3 3 5
Row 2 5 4 5 3
Row 3 9 4 6 4
Row 4 8 6 2 6
Row 5 20 15 10 20
Data Acquisition

 Data Acquisition is a process of collecting data from a variety of sources

 If data is available in your office (bulk or less (collect more) amount of data) then use it.

 If data is not available then collect (Primary or secondary data) it from different sources
Types of Data Structured

1. Unstructured Data: The information Retrieval system, works on this unstructured data.

2. Structured (table) Data: The database works on structured data.

3. Semi-structured (partially) data: Web pages work on semi-structured data

Introduction Data Science

 Data science is a field of study that focuses on techniques and algorithms to extract knowledge from

structured and unstructured data.

 In simple words: Data science is the deep study of data using technology to gain insights from data.

 Data science uses the most powerful hardware, programming systems, and efficient algorithms to solve data-

 It is the future of artificial intelligence.

Introduction Data Science
Importance of Data Science

 Data is the oil of today’s world. With the right tools, technologies, and algorithms, we can use data and convert it
into a distinct business advantage

 Better and faster decisions (should we choose A or B)

 Predictive analysis (what will happen next?)

 Pattern discoveries (find the pattern, or maybe hidden information in the data)

 Data Science can help you to detect fraud using advanced machine learning algorithms

 Allows to build intelligence ability in machines

 You can perform sentiment analysis to gauge customer brand loyalty

Technology and tools for Data Science

 Tools and Languages  Data Visualization Tools

 Power BI: it works on structured data  Power BI

  Tableau
MS Excel: it is not sufficient for big data

  Matplotlib
Python Language


 Machine Learning & Deep Learning tools
R Language
 TensorFlow
 Store data
 Pytorch
 Apache Hadoop
 Scikit learn
 Apache Spark
Business Intelligence vs Data Science

Criterion Business intelligence Data science

 Data Source  Business intelligence deals with structured  Data science deals with structured and unstructured
data, e.g., data warehouse. data, e.g., weblogs, feedback, etc.

 Method  Analytical(historical data)  Scientific(goes deeper to know the reason for the
data report)
 Skills  Statistics and Visualization are the two skills  Statistics, Visualization, and Machine learning are
required for business intelligence. the required skills for data science.

 Focus  Business intelligence focuses on both Past  Data science focuses on past data, present data, and
and present data also future predictions.
Data Science Components

1. Big Data: Every day, humans are producing so much data in the form of clicks, orders, videos, images,

comments, articles, etc. This data is generally unstructured and is often called Big Data.

2. Machine Learning: Machine learning is the backbone of data science. Machine learning is all about

providing training to a machine so that it can act as a human brain.

3. Business Intelligence: Each business has and produces too much data every day. This data when analyzed

carefully and then presented in visual reports involving graphs can bring good decision-making to life.
Data Science Components

4. Statistics: Statistics is one of the most important components of data science. Statistics is a way to collect

and analyze numerical data in a large amount and find meaningful insights from it.

5. Domain Expertise: In data science, domain expertise binds data science together. Domain expertise means

specialized knowledge or skills in a particular area. In data science, there are various areas for which we need

domain experts.

6. Data engineering: Data engineering is a part of data science, which involves acquiring, storing, retrieving,

and transforming data. Data engineering also includes metadata (data about data) to the data.
Data Science Components

7. Visualization: Data visualization is meant by representing data in a visual context so that people can easily

understand the significance of data. Data visualization makes it easy to access a huge amount of data in visuals.

8. Mathematics: Mathematics is a critical part of data science. Mathematics involves the study of quantity,

structure, space, and changes. For a data scientist, knowledge of good mathematics is essential.
Data Science Components
Applications of Data Science

1. Healthcare: Healthcare companies are using data science to build sophisticated medical instruments to detect and
cure diseases.
Applications of Data Science

2. Fraud Detection: Banking and financial institutions use data science and related algorithms to detect
fraudulent transactions.
Applications of Data Science

3. Image Recognition: Identifying patterns in images and detecting objects in an image is one of the most
popular data science applications. Image recognition may also be seen on social media platforms such as Facebook,
Instagram, and Twitter.
Applications of Data Science

4. Recommendation Systems: Netflix and Amazon give movie and product recommendations based on what
you like to watch, purchase, or browse on their platforms.
Applications of Data Science
Applications of Data Science
Applications of Data Science
Applications of Data Science
Applications of Data Science
Applications of Data Science

Vienna Superautomatica Parts Diagram
0% (1)
Vienna Superautomatica Parts Diagram
8 pages
Data Mining: Data: Lecture Notes For Chapter 2 Lecture Notes For Chapter 2
100% (1)
Data Mining: Data: Lecture Notes For Chapter 2 Lecture Notes For Chapter 2
16 pages
Effects of Food On Drug Therapy
No ratings yet
Effects of Food On Drug Therapy
12 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Unit 3 DS
No ratings yet
Unit 3 DS
16 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
PSK Unit 1 Merged
No ratings yet
PSK Unit 1 Merged
125 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Full
No ratings yet
Full
367 pages
Sources of Air Pollution PDF
100% (1)
Sources of Air Pollution PDF
30 pages
Lect 2
No ratings yet
Lect 2
77 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
Unit 1 Ganeshk e
No ratings yet
Unit 1 Ganeshk e
24 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
SSC Maths Quiz 2
No ratings yet
SSC Maths Quiz 2
12 pages
Operation On The Musculoskeletal Oni Fix
No ratings yet
Operation On The Musculoskeletal Oni Fix
3 pages
02 Tinh Khoang Cach - Compatibility Mode
No ratings yet
02 Tinh Khoang Cach - Compatibility Mode
14 pages
Data
No ratings yet
Data
84 pages
Speed Test - 11 Simple Interest
No ratings yet
Speed Test - 11 Simple Interest
4 pages
Module THERMO Thermodynamics BEXET 2 BSIT RACT X BMET MT 2 OK
No ratings yet
Module THERMO Thermodynamics BEXET 2 BSIT RACT X BMET MT 2 OK
124 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
Chapter 2.1 2.2
No ratings yet
Chapter 2.1 2.2
40 pages
21AB07 Project Management Technology in PM
No ratings yet
21AB07 Project Management Technology in PM
18 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
13 Motor Control and Reflexes Edit NOques
No ratings yet
13 Motor Control and Reflexes Edit NOques
22 pages
Screeningthe Antibacterial Activityof Moringa Oleifera Leavesand
No ratings yet
Screeningthe Antibacterial Activityof Moringa Oleifera Leavesand
5 pages
GHG Emissions Calculator Ver01.1 Web
0% (1)
GHG Emissions Calculator Ver01.1 Web
91 pages
Module 1 - Model Question Paper
No ratings yet
Module 1 - Model Question Paper
78 pages
CSE512 DataAndImageModels
No ratings yet
CSE512 DataAndImageModels
82 pages
1234 TamaraMunzner 2015 2WhatDataAbstraction VisualizationAnalysis
No ratings yet
1234 TamaraMunzner 2015 2WhatDataAbstraction VisualizationAnalysis
6 pages
Fall Detection Using OpenPose
No ratings yet
Fall Detection Using OpenPose
4 pages
ITS632 Lecture2 Data
No ratings yet
ITS632 Lecture2 Data
61 pages
Intuition
100% (6)
Intuition
337 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
The Four Realms of Riches
100% (1)
The Four Realms of Riches
2 pages
Time Table BA 2022
No ratings yet
Time Table BA 2022
22 pages
03 - Data Mining
No ratings yet
03 - Data Mining
37 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Untitled
No ratings yet
Untitled
146 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Application of The Exact Muffin-Tin Orbitals Theory
No ratings yet
Application of The Exact Muffin-Tin Orbitals Theory
30 pages
CP16
No ratings yet
CP16
19 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
For Fill Slope For Cut Slope
No ratings yet
For Fill Slope For Cut Slope
2 pages
02data Part4
No ratings yet
02data Part4
28 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Unit 4
No ratings yet
Unit 4
7 pages
Rincian Harga CCTV: Paket All Dahua 4 Channel
No ratings yet
Rincian Harga CCTV: Paket All Dahua 4 Channel
2 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Lec 5
No ratings yet
Lec 5
24 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
IPAQ - AUTOMATIC REPORT - Kuisioner
No ratings yet
IPAQ - AUTOMATIC REPORT - Kuisioner
20 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
1st You Need To Take Off The Dash. A Few Pic. Where The Screws Are
No ratings yet
1st You Need To Take Off The Dash. A Few Pic. Where The Screws Are
6 pages
Use of Coagulants To Reduce Crud by Colloidal Silica
100% (1)
Use of Coagulants To Reduce Crud by Colloidal Silica
11 pages
Similarity
No ratings yet
Similarity
19 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Book Nocse
No ratings yet
Book Nocse
340 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
Vu360 Helpguide
No ratings yet
Vu360 Helpguide
8 pages
Eric Parquet and Qun Lin - Microwave-Assisted Wolff-Kishner Reduction Reaction
No ratings yet
Eric Parquet and Qun Lin - Microwave-Assisted Wolff-Kishner Reduction Reaction
1 page
SANS 2001-CC1:2012: Construction Works Part CC1: Concrete Works (Structural)
60% (5)
SANS 2001-CC1:2012: Construction Works Part CC1: Concrete Works (Structural)
5 pages
Diesel Generator Warranty
No ratings yet
Diesel Generator Warranty
1 page
Books of M.A I-Ii Etc
No ratings yet
Books of M.A I-Ii Etc
3 pages

Lecture No 1 Introduction

Uploaded by

Lecture No 1 Introduction

Uploaded by

Introduction Data Science

Assessment Plan Marks Distribution

 The word “Data” is the plural form of “datum”.

 It is also known as meaning less information due to a piece of information

number, age or weight, classroom number, etc.

 Based on the above definition, data has three aspects:

1. Data comes from facts and statistics

3. Data is used for reference or analysis.

make sure that existing tools can be leveraged for analysis.

main memory of the computer. Name Salary ($) Age (Years)

 Suppose A company stores its employee data in an MS Excel file.

1. Jane and Dave earn the highest salary

2. Delilah earns the least

3. Jane and Dave are the oldest people in the group

4. Delilah is the youngest person.

 This table has facts.

1. This table is collected from a company.

2. We used the table for analysis purpose.

3. We revealed that the company appreciates experienced employees.

 Basically, the data reflects a general trend –

 Data is generally described using two things — objects and features.

 An object is explained or defined by features.

 Features are the attributes of object.

 For example, consider a data table containing records of employees of a company.

Name Salary ($) Age (Years)

Name Salary ($) Age (Years)

That is two features — salary and age — explains Jane.

 In general . A dimension refers to a direction

 The word “dimension” in programming is used to count the number of cells.

 Number of features = number of dimensions

With n features or n columns, the data is called n-dimensional.

 The word “vector” in physics refers to a quantity with a direction.

 a point in the space=an object of the data table=a vector

 Why do we call an object a vector?

 Before answering this question, we need to know what an origin is.

the data points.

 The concept of nearness or farness in the space is known as proximity.

 Proximity is quantified in two ways:

 by computing distance between two vectors (i.e., two data points.)

 by computing similarity between two vectors (i.e., two data points.)

A distance formula must satisfy the four following axioms.

2. D(p1, p2)=0 iff p1=p2.

A distance formula must satisfy the four following axioms.

 Let’s use a simple example with coordinates:

Name Salary ($) Age (Years)

purchased or rated the product or service.

Euclidean Distance || A-B || =

Generalize Euclidean || A-B || =

Generalize Euclidean || A-B || =

Therefore, the distance between Row 2 and Row 5 is equal to

Feature 1 Feature 2 Feature 3

Therefore, the distance between Row 2 and Row 5 is equal to

Feature 1 Feature 2 Feature 3

dissimilarity between sets or vectors.

 The Jaccard index, also known as the Jaccard coefficient.

 It is a ratio of commonality between the sets over all the items.

intersection and the size of the union of the two sets.

 The Jaccard index can be computed between two vectors too.

Row 1 and Row 3.

 Data Acquisition is a process of collecting data from a variety of sources

2. Structured (table) Data: The database works on structured data.

3. Semi-structured (partially) data: Web pages work on semi-structured data

structured and unstructured data.

 It is the future of artificial intelligence.

 Better and faster decisions (should we choose A or B)

 Predictive analysis (what will happen next?)

 Allows to build intelligence ability in machines

 You can perform sentiment analysis to gauge customer brand loyalty

 Tools and Languages  Data Visualization Tools

 Power BI: it works on structured data  Power BI

Criterion Business intelligence Data science

providing training to a machine so that it can act as a human brain.

You might also like