Lecture No 1 Introduction
Lecture No 1 Introduction
Lecture No: 01
Introduction to Data Science
Instructor: Dr. Khalid Mahboob
Email-id: [email protected]
Marking Scheme
Assignments + Presentation 15
Quizzes 15
Midterm Exam 30
Final Exams. 40
Outline
Data
Types of Data Structured
Simplest form of Data
Can Data Speak?
Objects and features of a data table
Dimension, Vector, Proximity, Distance and similarity measurements
Introduction Data Science
Importance of Data Science
Big Data
Technology and tools
Data Science Components
Business Intelligence vs. Data Science
Applications of Data Science
DATA
Data is a piece of information that is collected from different sources for analysis purposes.
For example, 13 is the example of data because it is meaningless and we don’t know what it is. it is a roll
2. Data is collected
In this course, our major focus will be on the third aspect — the analysis part of the data.
Simplest form of Data
A table is the simplest form of data. Most of data science algorithms still today use tabular data as inputs.
Data scientists prefer to convert any type of complex data — such as text, image, or time series — to tables to
One reason for the popularity of tabular representation is the ease of storing the tabular data directly in the
Closely look at the table for several minutes. Then, write down anything interesting you can find.
Can Data Speak?
Findings
Conclusion: Older people earn more in the company from where the data was collected.
Can Data Speak?
Now, let us go back to the definition: Data refers to “facts and statistics collected for reference or analysis.”
Experience, wisdom, (and money, which is the salary in this case) come with age.
Data Speak: Data gives us insights. Data gives us those light-bulb moments.
Objects and Features of a data table
In the table above, the two columns — salary and age — are features.
Notice that there are 3 people, whose names are placed in 3 rows. Each row is an object
For example, given the data table above, Jane is defined or explained as a person of salary 90000 and an age 52.
In data science, we use the word “space” to refer to the mathematical space.
For example, if we have a two-dimensional dataset like the following one, a space with two axes is formed.
Name Salary ($) Age (Years)
Jane 90000 52
John 85000 48
Delilah 75000 32
Dave 90000 53
Ellen 82000 44
Dimension
“Dimension” in data science refers to the mathematical space, such as Euclidian space.
85000 48 20
75000 32 30
Dimension
As an example, the given data table has three columns or three features.
In programming, we will say that this table can be stored in a 2-dimensional array of size 3 times 3. That
means it has three rows and three columns.
In data science, this table is called a three-dimensional dataset because it composes a mathematical space of
three dimensions.
90000 52 10
85000 48 20
75000 32 30
Dimension
If the dataset has 1 feature, it is called, 1-dimensional; with 2 features it is called 2-dimensional, so and so forth.
A vector in data science is no different than a vector in physics. Each object of tabular data is sometimes referred
to as a vector.
The origin: Regardless of the data points in the space, there is an origin in every data table.
The origin is the coordinate where the value in every axis is zero.
That is, the origin in a two-dimensional space is (0, 0). The origin in a three-dimensional space is (0, 0, 0).
Vector
We have five objects or five data points. Each of these five data points is a vector.
All five vectors are drawn in the following figure. Notice that the vectors have a direction from the origin toward
Distance refers to how far is a data point from another in the space.
Many of us use the word “distance” when we are really referring to “dissimilarity.”
Mostly, the distance between two points is measured by Euclidean and Manhattan distance.
Proximity
1. D(p1, p2)≥ 0. This indicates that the distance between two points p1 and p2 cannot be negative.
3. D(p1, p2)=D(p2, p1). This indicates that the distance from a point p1 to p2 cannot be different than the distance
from p2 to p1.
4. D(p1, p2) ≤ D(p1, p3) + D(p3, p2). The distance from one point to another cannot be greater than the distance
between the same two points via another point. This is commonly known as the triangle inequality property — the
length of one side of a triangle cannot be greater than the sum of the other two sides.
Proximity
Point p1 : (0, 0)
Point p2 : (4, 0)
Point p3 : (2, 3)
Proximity
Point p1 : (0, 0)
Point p2 : (4, 0)
Point p3 : (2, 3)
Proximity
Jane 90000 52
John 85000 48
Delilah 75000 32
Dave 90000 53
Ellen 82000 44
Importance of Proximity
1. Clustering: Clustering is the task of grouping similar data points together. This can be done by finding the
distance between each data point and then grouping the data points that are closest together.
2. Recommendation systems: Recommendation systems are used to recommend products or services to users. This
can be done by finding the similarity between the user's profile and the profiles of other users who have
3. Anomaly detection: Anomaly detection is the task of finding data points that are significantly different from
the rest of the data. This can be done by finding the distance between each data point and the rest of the data
Importance of Proximity
1. Pattern Recognition: In pattern recognition, distance measures are used to determine how closely a given pattern
matches a reference pattern. This is important in applications like fingerprint recognition and speech recognition.
2. Information Retrieval: Like search engines, similarity measures are crucial for ranking search results. They
determine how closely a document matches a user's query, allowing for more relevant search results.
Distance Calculations techniques
Feature 1 Feature 2
Row 1 10 3
Row 2 5 4
Row 3 10 4
Row 4 8 6
Row 5 9 2
For example, consider Row 2 and Row 5 of the following table. Row 2 has (5, 4), and Row 5 has (9, 2).
Row 2 of the table above contains (5, 4, 5) and Row 5 contains (9, 2, 1).
The Euclidean distance between Row 2 and Row 5 of the data above is:
Class Activity
Feature 1 Feature 2
Row 1 10 3
Row 2 5 4
Row 3 10 4
Row 4 8 6
Row 5 9 2
Manhattan D | A-B |1 =
For example, consider Row 2 and Row 5 of the following table. Row 2 has (5, 4), and Row 5 has (9, 2).
Manhattan
Row 2 of the table above contains (5, 4, 5) and Row 5 contains (9, 2, 1).
The Euclidean distance between Row 2 and Row 5 of the data above is:
Similarity Calculations techniques
The Jaccard index is a similarity measurement technique that is used to compute the similarity and
If X and Y are two sets, then the Jaccard index between the two sets is computed using the ratio of the size of the
The Jaccard index is a measurement technique that is used computer the similarity and dissimilarity between
sets or vectors.
Weighted Jaccard index/coefficient/similarity
Suppose, we have a four-dimensional dataset (Features 1 through 4). Let us compute the Jaccard similarity between
If data is available in your office (bulk or less (collect more) amount of data) then use it.
If data is not available then collect (Primary or secondary data) it from different sources
Types of Data Structured
1. Unstructured Data: The information Retrieval system, works on this unstructured data.
Data science is a field of study that focuses on techniques and algorithms to extract knowledge from
In simple words: Data science is the deep study of data using technology to gain insights from data.
Data science uses the most powerful hardware, programming systems, and efficient algorithms to solve data-
related problems.
Data is the oil of today’s world. With the right tools, technologies, and algorithms, we can use data and convert it
into a distinct business advantage
Pattern discoveries (find the pattern, or maybe hidden information in the data)
Data Science can help you to detect fraud using advanced machine learning algorithms
Tableau
MS Excel: it is not sufficient for big data
Matplotlib
Python Language
Machine Learning & Deep Learning tools
R Language
TensorFlow
Store data
Pytorch
Apache Hadoop
Scikit learn
Apache Spark
Business Intelligence vs Data Science
Data Source Business intelligence deals with structured Data science deals with structured and unstructured
data, e.g., data warehouse. data, e.g., weblogs, feedback, etc.
Method Analytical(historical data) Scientific(goes deeper to know the reason for the
data report)
Skills Statistics and Visualization are the two skills Statistics, Visualization, and Machine learning are
required for business intelligence. the required skills for data science.
Focus Business intelligence focuses on both Past Data science focuses on past data, present data, and
and present data also future predictions.
Data Science Components
1. Big Data: Every day, humans are producing so much data in the form of clicks, orders, videos, images,
comments, articles, etc. This data is generally unstructured and is often called Big Data.
2. Machine Learning: Machine learning is the backbone of data science. Machine learning is all about
3. Business Intelligence: Each business has and produces too much data every day. This data when analyzed
carefully and then presented in visual reports involving graphs can bring good decision-making to life.
Data Science Components
4. Statistics: Statistics is one of the most important components of data science. Statistics is a way to collect
and analyze numerical data in a large amount and find meaningful insights from it.
5. Domain Expertise: In data science, domain expertise binds data science together. Domain expertise means
specialized knowledge or skills in a particular area. In data science, there are various areas for which we need
domain experts.
6. Data engineering: Data engineering is a part of data science, which involves acquiring, storing, retrieving,
and transforming data. Data engineering also includes metadata (data about data) to the data.
Data Science Components
7. Visualization: Data visualization is meant by representing data in a visual context so that people can easily
understand the significance of data. Data visualization makes it easy to access a huge amount of data in visuals.
8. Mathematics: Mathematics is a critical part of data science. Mathematics involves the study of quantity,
structure, space, and changes. For a data scientist, knowledge of good mathematics is essential.
Data Science Components
Applications of Data Science
1. Healthcare: Healthcare companies are using data science to build sophisticated medical instruments to detect and
cure diseases.
Applications of Data Science
2. Fraud Detection: Banking and financial institutions use data science and related algorithms to detect
fraudulent transactions.
Applications of Data Science
3. Image Recognition: Identifying patterns in images and detecting objects in an image is one of the most
popular data science applications. Image recognition may also be seen on social media platforms such as Facebook,
Instagram, and Twitter.
Applications of Data Science
4. Recommendation Systems: Netflix and Amazon give movie and product recommendations based on what
you like to watch, purchase, or browse on their platforms.
Applications of Data Science
Applications of Data Science
Applications of Data Science
Applications of Data Science
Applications of Data Science
Applications of Data Science