0% found this document useful (0 votes)
2 views

Lecture No 1 Introduction

The document is an introduction to a Data Science course taught by Dr. Khalid Mahboob, outlining the assessment plan, key concepts, and importance of data science. It covers various topics including types of data, data analysis, proximity measures, and tools used in data science. The course emphasizes the significance of data in decision-making and predictive analysis, highlighting its role in modern business and technology.

Uploaded by

Marium Zehra
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture No 1 Introduction

The document is an introduction to a Data Science course taught by Dr. Khalid Mahboob, outlining the assessment plan, key concepts, and importance of data science. It covers various topics including types of data, data analysis, proximity measures, and tools used in data science. The course emphasizes the significance of data in decision-making and predictive analysis, highlighting its role in modern business and technology.

Uploaded by

Marium Zehra
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 77

Introduction Data Science

Lecture No: 01
Introduction to Data Science
Instructor: Dr. Khalid Mahboob
Email-id: [email protected]
Marking Scheme

Assessment Plan Marks Distribution

Assignments + Presentation 15

Quizzes 15

Midterm Exam 30

Final Exams. 40
Outline

 Data
 Types of Data Structured
 Simplest form of Data
 Can Data Speak?
 Objects and features of a data table
 Dimension, Vector, Proximity, Distance and similarity measurements
 Introduction Data Science
 Importance of Data Science
 Big Data
 Technology and tools
 Data Science Components
 Business Intelligence vs. Data Science
 Applications of Data Science
DATA

 The word “Data” is the plural form of “datum”.

 Data is a piece of information that is collected from different sources for analysis purposes.

 It is also known as meaning less information due to a piece of information

 For example, 13 is the example of data because it is meaningless and we don’t know what it is. it is a roll

number, age or weight, classroom number, etc.


Data

 Based on the above definition, data has three aspects:

1. Data comes from facts and statistics

2. Data is collected

3. Data is used for reference or analysis.

 In this course, our major focus will be on the third aspect — the analysis part of the data.
Simplest form of Data

 A table is the simplest form of data. Most of data science algorithms still today use tabular data as inputs.

 Data scientists prefer to convert any type of complex data — such as text, image, or time series — to tables to

make sure that existing tools can be leveraged for analysis.

 One reason for the popularity of tabular representation is the ease of storing the tabular data directly in the

main memory of the computer. Name Salary ($) Age (Years)


Jane 90000 52
John 85000 48
Delilah 75000 32
Can Data Speak?

 Closely look at the table for several minutes. Then, write down anything interesting you can find.
Can Data Speak?

 Suppose A company stores its employee data in an MS Excel file.

 Findings

1. Jane and Dave earn the highest salary

2. Delilah earns the least

3. Jane and Dave are the oldest people in the group

4. Delilah is the youngest person.

 Conclusion: Older people earn more in the company from where the data was collected.
Can Data Speak?

 Now, let us go back to the definition: Data refers to “facts and statistics collected for reference or analysis.”

 This table has facts.

1. This table is collected from a company.

2. We used the table for analysis purpose.

3. We revealed that the company appreciates experienced employees.


Can Data Speak?

 Basically, the data reflects a general trend –

 Experience, wisdom, (and money, which is the salary in this case) come with age.

 Data Speak: Data gives us insights. Data gives us those light-bulb moments.
Objects and Features of a data table

 Data is generally described using two things — objects and features.

 An object is explained or defined by features.

 Features are the attributes of object.

 For example, consider a data table containing records of employees of a company.

Name Salary ($) Age (Years)


Jane 90000 52
John 85000 48
Delilah 75000 32
Objects and Features of a data table

Name Salary ($) Age (Years)


Jane 90000 52
John 85000 48
Delilah 75000 32

 In the table above, the two columns — salary and age — are features.

 Notice that there are 3 people, whose names are placed in 3 rows. Each row is an object

 For example, given the data table above, Jane is defined or explained as a person of salary 90000 and an age 52.

That is two features — salary and age — explains Jane.


Space

 In data science, we use the word “space” to refer to the mathematical space.

 For example, if we have a two-dimensional dataset like the following one, a space with two axes is formed.
Name Salary ($) Age (Years)
Jane 90000 52
John 85000 48
Delilah 75000 32
Dave 90000 53
Ellen 82000 44
Dimension

 In general . A dimension refers to a direction

 The word “dimension” in programming is used to count the number of cells.

 “Dimension” in data science refers to the mathematical space, such as Euclidian space.

 Number of features = number of dimensions


90000 52 10

85000 48 20

75000 32 30
Dimension

 As an example, the given data table has three columns or three features.

 In programming, we will say that this table can be stored in a 2-dimensional array of size 3 times 3. That
means it has three rows and three columns.

 In data science, this table is called a three-dimensional dataset because it composes a mathematical space of
three dimensions.
90000 52 10

85000 48 20

75000 32 30
Dimension
 If the dataset has 1 feature, it is called, 1-dimensional; with 2 features it is called 2-dimensional, so and so forth.

With n features or n columns, the data is called n-dimensional.


Vector

 The word “vector” in physics refers to a quantity with a direction.

 A vector in data science is no different than a vector in physics. Each object of tabular data is sometimes referred
to as a vector.

 a point in the space=an object of the data table=a vector


Vector

 Why do we call an object a vector?

 Before answering this question, we need to know what an origin is.

 The origin: Regardless of the data points in the space, there is an origin in every data table.

 The origin is the coordinate where the value in every axis is zero.

 That is, the origin in a two-dimensional space is (0, 0). The origin in a three-dimensional space is (0, 0, 0).
Vector

 We have five objects or five data points. Each of these five data points is a vector.

 All five vectors are drawn in the following figure. Notice that the vectors have a direction from the origin toward

the data points.


Proximity

 The concept of nearness or farness in the space is known as proximity.

 Proximity is quantified in two ways:

 by computing distance between two vectors (i.e., two data points.)

 by computing similarity between two vectors (i.e., two data points.)


Proximity

 Distance refers to how far is a data point from another in the space.

 Many of us use the word “distance” when we are really referring to “dissimilarity.”

 Mostly, the distance between two points is measured by Euclidean and Manhattan distance.
Proximity

A distance formula must satisfy the four following axioms.

1. D(p1, p2)≥ 0. This indicates that the distance between two points p1 and p2 cannot be negative.

2. D(p1, p2)=0 iff p1=p2.

3. D(p1, p2)=D(p2, p1). This indicates that the distance from a point p1 to p2 cannot be different than the distance

from p2 to p1.

4. D(p1, p2) ≤ D(p1, p3) + D(p3, p2). The distance from one point to another cannot be greater than the distance

between the same two points via another point. This is commonly known as the triangle inequality property — the

length of one side of a triangle cannot be greater than the sum of the other two sides.
Proximity

A distance formula must satisfy the four following axioms.

 Let’s use a simple example with coordinates:

 Point p1 : (0, 0)

 Point p2 ​: (4, 0)

 Point p3 : (2, 3)
Proximity

 Point p1 : (0, 0)

 Point p2 ​: (4, 0)

 Point p3 : (2, 3)
Proximity

Name Salary ($) Age (Years)

Jane 90000 52

John 85000 48

Delilah 75000 32

Dave 90000 53

Ellen 82000 44
Importance of Proximity

1. Clustering: Clustering is the task of grouping similar data points together. This can be done by finding the

distance between each data point and then grouping the data points that are closest together.

2. Recommendation systems: Recommendation systems are used to recommend products or services to users. This

can be done by finding the similarity between the user's profile and the profiles of other users who have

purchased or rated the product or service.

3. Anomaly detection: Anomaly detection is the task of finding data points that are significantly different from

the rest of the data. This can be done by finding the distance between each data point and the rest of the data
Importance of Proximity

1. Pattern Recognition: In pattern recognition, distance measures are used to determine how closely a given pattern

matches a reference pattern. This is important in applications like fingerprint recognition and speech recognition.

2. Information Retrieval: Like search engines, similarity measures are crucial for ranking search results. They

determine how closely a document matches a user's query, allowing for more relevant search results.
Distance Calculations techniques

Euclidean Distance || A-B || =

Generalize Euclidean || A-B || =


Distance Calculations techniques

Feature 1 Feature 2
Row 1 10 3
Row 2 5 4
Row 3 10 4
Row 4 8 6
Row 5 9 2

Generalize Euclidean || A-B || =

For example, consider Row 2 and Row 5 of the following table. Row 2 has (5, 4), and Row 5 has (9, 2).

Therefore, the distance between Row 2 and Row 5 is equal to


Distance Calculations techniques

Feature 1 Feature 2 Feature 3


Row 1 10 3 3
Row 2 5 4 5
Row 3 10 4 6
Row 4 8 6 2
Row 5 9 2 1

Row 2 of the table above contains (5, 4, 5) and Row 5 contains (9, 2, 1).

The Euclidean distance between Row 2 and Row 5 of the data above is:
Class Activity

Feature 1 Feature 2
Row 1 10 3
Row 2 5 4
Row 3 10 4
Row 4 8 6
Row 5 9 2

Manhattan D | A-B |1 =

For example, consider Row 2 and Row 5 of the following table. Row 2 has (5, 4), and Row 5 has (9, 2).

Therefore, the distance between Row 2 and Row 5 is equal to


Class Activity

Manhattan

Feature 1 Feature 2 Feature 3


Row 1 10 3 3
Row 2 5 4 5
Row 3 10 4 6
Row 4 8 6 2
Row 5 9 2 1

Row 2 of the table above contains (5, 4, 5) and Row 5 contains (9, 2, 1).

The Euclidean distance between Row 2 and Row 5 of the data above is:
Similarity Calculations techniques

 The Jaccard index is a similarity measurement technique that is used to compute the similarity and

dissimilarity between sets or vectors.

 The Jaccard index, also known as the Jaccard coefficient.

 It is a ratio of commonality between the sets over all the items.

 If X and Y are two sets, then the Jaccard index between the two sets is computed using the ratio of the size of the

intersection and the size of the union of the two sets.


Similarity Calculations techniques
Set Theory Technique

 The Jaccard index is a measurement technique that is used computer the similarity and dissimilarity between

sets or vectors.
Weighted Jaccard index/coefficient/similarity

 The Jaccard index can be computed between two vectors too.

 Suppose, we have a four-dimensional dataset (Features 1 through 4). Let us compute the Jaccard similarity between

Row 1 and Row 3.


Feature 1 Feature 2 Feature 3 Feature 4
Row 1 10 3 3 5
Row 2 5 4 5 3
Row 3 9 4 6 4
Row 4 8 6 2 6
Row 5 20 15 10 20
Data Acquisition

 Data Acquisition is a process of collecting data from a variety of sources

 If data is available in your office (bulk or less (collect more) amount of data) then use it.

 If data is not available then collect (Primary or secondary data) it from different sources
Types of Data Structured

1. Unstructured Data: The information Retrieval system, works on this unstructured data.

2. Structured (table) Data: The database works on structured data.

3. Semi-structured (partially) data: Web pages work on semi-structured data


Introduction Data Science

 Data science is a field of study that focuses on techniques and algorithms to extract knowledge from

structured and unstructured data.

 In simple words: Data science is the deep study of data using technology to gain insights from data.

 Data science uses the most powerful hardware, programming systems, and efficient algorithms to solve data-

related problems.

 It is the future of artificial intelligence.


Introduction Data Science
Importance of Data Science

 Data is the oil of today’s world. With the right tools, technologies, and algorithms, we can use data and convert it
into a distinct business advantage

 Better and faster decisions (should we choose A or B)

 Predictive analysis (what will happen next?)

 Pattern discoveries (find the pattern, or maybe hidden information in the data)

 Data Science can help you to detect fraud using advanced machine learning algorithms

 Allows to build intelligence ability in machines

 You can perform sentiment analysis to gauge customer brand loyalty


Technology and tools for Data Science

 Tools and Languages  Data Visualization Tools

 Power BI: it works on structured data  Power BI

  Tableau
MS Excel: it is not sufficient for big data

  Matplotlib
Python Language


 Machine Learning & Deep Learning tools
R Language
 TensorFlow
 Store data
 Pytorch
 Apache Hadoop
 Scikit learn
 Apache Spark
Business Intelligence vs Data Science

Criterion Business intelligence Data science

 Data Source  Business intelligence deals with structured  Data science deals with structured and unstructured
data, e.g., data warehouse. data, e.g., weblogs, feedback, etc.

 Method  Analytical(historical data)  Scientific(goes deeper to know the reason for the
data report)
 Skills  Statistics and Visualization are the two skills  Statistics, Visualization, and Machine learning are
required for business intelligence. the required skills for data science.

 Focus  Business intelligence focuses on both Past  Data science focuses on past data, present data, and
and present data also future predictions.
Data Science Components

1. Big Data: Every day, humans are producing so much data in the form of clicks, orders, videos, images,

comments, articles, etc. This data is generally unstructured and is often called Big Data.

2. Machine Learning: Machine learning is the backbone of data science. Machine learning is all about

providing training to a machine so that it can act as a human brain.

3. Business Intelligence: Each business has and produces too much data every day. This data when analyzed

carefully and then presented in visual reports involving graphs can bring good decision-making to life.
Data Science Components

4. Statistics: Statistics is one of the most important components of data science. Statistics is a way to collect

and analyze numerical data in a large amount and find meaningful insights from it.

5. Domain Expertise: In data science, domain expertise binds data science together. Domain expertise means

specialized knowledge or skills in a particular area. In data science, there are various areas for which we need

domain experts.

6. Data engineering: Data engineering is a part of data science, which involves acquiring, storing, retrieving,

and transforming data. Data engineering also includes metadata (data about data) to the data.
Data Science Components

7. Visualization: Data visualization is meant by representing data in a visual context so that people can easily

understand the significance of data. Data visualization makes it easy to access a huge amount of data in visuals.

8. Mathematics: Mathematics is a critical part of data science. Mathematics involves the study of quantity,

structure, space, and changes. For a data scientist, knowledge of good mathematics is essential.
Data Science Components
Applications of Data Science

1. Healthcare: Healthcare companies are using data science to build sophisticated medical instruments to detect and
cure diseases.
Applications of Data Science

2. Fraud Detection: Banking and financial institutions use data science and related algorithms to detect
fraudulent transactions.
Applications of Data Science

3. Image Recognition: Identifying patterns in images and detecting objects in an image is one of the most
popular data science applications. Image recognition may also be seen on social media platforms such as Facebook,
Instagram, and Twitter.
Applications of Data Science

4. Recommendation Systems: Netflix and Amazon give movie and product recommendations based on what
you like to watch, purchase, or browse on their platforms.
Applications of Data Science
Applications of Data Science
Applications of Data Science
Applications of Data Science
Applications of Data Science
Applications of Data Science

You might also like