0% found this document useful (0 votes)

21 views5 pages

3 Datasets

iitm ds

Uploaded by

DEBENDRA KUMAR DHIR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views5 pages

3 Datasets

iitm ds

Uploaded by

DEBENDRA KUMAR DHIR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Datasets

1. Datasets
2. Whence comes data?
2.1. Scenario-1
2.2. Scenario-2
2.3. Scenario-3
3. Supervision
4. Partitioning the dataset
5. Summary

Question: What is a dataset and why is it important?

1. Datasets

There are different kinds of datasets. The housing dataset that we saw
right at the beginning is a tabular dataset. Data comes in the form of a
table. Each column of this table is called an attribute or a feature and
each row represents one record or observation. Recall that also use the
term data-point to refer to each row of the table. By far, tabular datasets
are the most common form in which data is represented. Tabular data can be
neatly packed into comma-separated files or CSVs. Few other forms of data:

• image
• text
• speech

Image, text and speech data cannot be packed into simple CSVs and are often
called unstructured data.

2. Whence comes data?

How do we obtain data? Where does data come from? This seems like a simple
question but it doesn't have a simple answer. Here are some scenarios that
are arranged in increasing order of complexity:

2.1. Scenario-1
An FMCG company has given you some historical data concerning its sales
over the last three years. It wants you to predict the average sales in
the coming quarter.

Here we are lucky. Someone comes to our doorstep and gives us the data. It
might be the case that the company has neatly arranged the data in a
tabular format. In addition, we also have a very precise definition of the
problem statement. We have to predict a real number by looking at the data.
It is a regression problem.

2.2. Scenario-2

Twitter is developing an algorithm to detect tweets that contain

offensive content. As a data scientist, you are given a dump of one
million tweets and asked to develop an algorithm to solve the problem.

This is a more challenging problem compared to scenario-1. First, this is

an instance of what is called a binary classification problem. Instead of
predicting a real number, we have to predict one of two (binary) outcomes
for each tweet:

• offensive
• not-offensive

In order to train a computer to distinguish between the two kinds of

tweets, we need to give it examples of tweets of both kinds. Unfortunately,
we don't have that information. If that information is absent, how can we
teach the computer to differentiate between the two? So, the first task
here is to get the dataset labeled. That is, for each tweet, mark it as
"offensive" or "not-offensive". This process is time consuming and requires
considerable manpower, especially if the dataset is large.

2.3. Scenario-3
You are a research scientist at a manufacturing company. You want to set
up a facility that automates the segregation of defective products from
non-defective ones. Come up with an end-end ML solution.

This is by far the most challenging scenario. We don't have access to the
data. We need to gather data in the first place. Once we have the data, we
need to label it or annotate it. Only then can we start thinking about
training ML models on top of the data.

3. Supervision

Labeling a dataset is an important part of the data preparation process.

However, there may be situations where labeling is not practically
feasible. In such cases, we have to settle with unlabeled data. Therefore,
datasets in ML can be classified into two categories:

• labeled dataset
• unlabeled dataset

Techniques that work with labeled data fall under the category of
supervised learning. Those that work with unlabeled data come under
unsupervised learning. What is so special about the term "supervised"?

Cambridge dictionary defines the verb supervise as follows: to watch a

person or activity to make certain that everything is done correctly,
safely. By a slight extension of this definition, we could say that a
supervisor is a teacher who tells us whether we are right or wrong. In this
sense, the label performs the role of a supervisor for the machine as it is
learning. With unlabeled data, there is no supervision available.

4. Partitioning the dataset

As humans, how do we know if someone has learnt a skill or not? Tests or
exams are the way to go. Exams are so ubiquitous that we often conflate
learning with scoring well in exams. However, for a machine, getting a good
score in an exam is a good enough proxy for learning. For almost every
skill that we can think of, there is some test or exam to evaluate our
competency in that skill. Take the analogy that we have been working with:
three-digit addition. To know if kids have learnt addition, teachers
conduct tests that have problems on three digit addition.

An important feature of testing is to make sure that it is challenging. If

we ask the same questions that are there in the textbook, kids might score
high marks. But chances are that a lot of them would have memorized the
answers. Therefore, whenever we have a dataset, we always partition it into
two parts:

• train-dataset
• test-dataset

We train the model on the train-dataset and evaluate its performance on the
test-dataset. But often, we don't stop with two partitions, we go for three
partitions:

• train-dataset
• validation-dataset
• test-dataset

Think about the validation-dataset as additional problems for practice or a

mock exam that helps the machine learn a good model. The test-dataset is
not shown to the model during the learning stage. The learning algorithm
has access to only the train-dataset and the validation-dataset. Once the
learning process is complete, the model is evaluated on the test-dataset.
The test-dataset is sacred in any ML problem. It should be kept hidden and
used only at the end. This is analogous to the effort taken by the
administration of colleges and universities to seal exam papers and keep
them secure until the day of the examination. If the exam paper somehow
gets leaked, the exam can no longer be conducted in a fair manner!

5. Summary
Datasets come in different types: tabular data, image, text, speech data
and so on. The source of data varies from situation to situation. Sometimes
the data could be given to us in a well formatted and usable condition. At
other times, we would have to expend effort in gathering data and making it
suitable for further processing. Datasets could either be labeled or
unlabeled. ML algorithms that deal with labeled data are called supervised
learning methods. To evaluate the performance of any ML model, it is
important to partition the data into two parts: train, test; the model is
trained on the training data and evaluated on the test data.

MathDash_Book 2025
No ratings yet
MathDash_Book 2025
137 pages
Unit 2 – Advance Concepts of Modelling in AI
No ratings yet
Unit 2 – Advance Concepts of Modelling in AI
12 pages
AI Project Cycle
No ratings yet
AI Project Cycle
7 pages
Tutorials
No ratings yet
Tutorials
75 pages
Unit No. 1
No ratings yet
Unit No. 1
73 pages
ML -1_Sovan_Introduction to ML
No ratings yet
ML -1_Sovan_Introduction to ML
83 pages
Pre-Test - Mathematics 6
No ratings yet
Pre-Test - Mathematics 6
8 pages
Lecture 6 - AI and ML (1)
No ratings yet
Lecture 6 - AI and ML (1)
32 pages
Aiml Mca
100% (1)
Aiml Mca
38 pages
SS CH2 LM AI CLASS X
No ratings yet
SS CH2 LM AI CLASS X
92 pages
Knowing the Machine Learning
No ratings yet
Knowing the Machine Learning
15 pages
2021 Machine Learning Intro
No ratings yet
2021 Machine Learning Intro
43 pages
ML Merge
No ratings yet
ML Merge
145 pages
Notes
No ratings yet
Notes
125 pages
2ML Problem
No ratings yet
2ML Problem
5 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
25 pages
Building A ML System
No ratings yet
Building A ML System
42 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
ML SIG - Day 1
No ratings yet
ML SIG - Day 1
55 pages
Topic 5-Types of Machine Learning
No ratings yet
Topic 5-Types of Machine Learning
31 pages
Week3 02 Dataset Characteristics
No ratings yet
Week3 02 Dataset Characteristics
41 pages
ML+LVC+1+Post-Session+Summary
No ratings yet
ML+LVC+1+Post-Session+Summary
15 pages
Week 12 Intro to DS and ML
No ratings yet
Week 12 Intro to DS and ML
67 pages
Machine Learning - ch1
No ratings yet
Machine Learning - ch1
46 pages
CE880_lecture5_slides
No ratings yet
CE880_lecture5_slides
32 pages
PartB-U-2_Notes
No ratings yet
PartB-U-2_Notes
17 pages
Ch7 Introduction to Machine Learning
No ratings yet
Ch7 Introduction to Machine Learning
29 pages
CSE445 NSU Week_1
No ratings yet
CSE445 NSU Week_1
28 pages
Air quality prediction using machine learning
No ratings yet
Air quality prediction using machine learning
29 pages
ML
No ratings yet
ML
12 pages
Doubt Clearance Session(AI) on 29.12.2024
No ratings yet
Doubt Clearance Session(AI) on 29.12.2024
41 pages
1 Overview
No ratings yet
1 Overview
22 pages
5.1 Large Scale ML
No ratings yet
5.1 Large Scale ML
10 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
24 pages
nn
No ratings yet
nn
24 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
4_Unit 2 - Lecture 1 Types of DataSet-L1
No ratings yet
4_Unit 2 - Lecture 1 Types of DataSet-L1
17 pages
Module2 ch2
No ratings yet
Module2 ch2
36 pages
TTDS Lectures
No ratings yet
TTDS Lectures
13 pages
Types of Machine Learning Algorithms
No ratings yet
Types of Machine Learning Algorithms
14 pages
MILIT PPT Modifies
No ratings yet
MILIT PPT Modifies
43 pages
SWE 227 Slide 01
No ratings yet
SWE 227 Slide 01
21 pages
Lecture 1
No ratings yet
Lecture 1
47 pages
Lect3 Machine Learning
No ratings yet
Lect3 Machine Learning
27 pages
20ECE633T Machine Learning in VLSI
No ratings yet
20ECE633T Machine Learning in VLSI
81 pages
Data_in_machine_learning
No ratings yet
Data_in_machine_learning
7 pages
Chapter Introduction
No ratings yet
Chapter Introduction
7 pages
AB Micrologix PLC PDF
100% (4)
AB Micrologix PLC PDF
50 pages
Lecture 2
No ratings yet
Lecture 2
22 pages
AI Session 5 Class 10
No ratings yet
AI Session 5 Class 10
19 pages
Data Exploration
No ratings yet
Data Exploration
5 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Lecture 2 - Supervised Learning
No ratings yet
Lecture 2 - Supervised Learning
6 pages
NAME-Rajat Gupta Section - B2B2 (Marketing and Analytics) UID - 2019-1706-0001-0007
No ratings yet
NAME-Rajat Gupta Section - B2B2 (Marketing and Analytics) UID - 2019-1706-0001-0007
9 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
Chapter-3
No ratings yet
Chapter-3
4 pages
02 AI Project Cycle Revision Notes
No ratings yet
02 AI Project Cycle Revision Notes
5 pages
02 Ai Project Cycle Revision Notes
No ratings yet
02 Ai Project Cycle Revision Notes
4 pages
CP R81.10 ClusterXL AdminGuide
No ratings yet
CP R81.10 ClusterXL AdminGuide
306 pages
02 Ai Project Cycle Revision Notes
No ratings yet
02 Ai Project Cycle Revision Notes
4 pages
30 Tips & Tricks To Master Microsoft Excel
No ratings yet
30 Tips & Tricks To Master Microsoft Excel
33 pages
The Intel microprocessors: 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium, Pentium Pro processor, Pentium II, Pentium III, Pentium 4, and Core2 with 64-bit extensions: architecture, programming, and interfacing 8th ed Edition Barry B Brey - eBook PDFpdf download
No ratings yet
The Intel microprocessors: 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium, Pentium Pro processor, Pentium II, Pentium III, Pentium 4, and Core2 with 64-bit extensions: architecture, programming, and interfacing 8th ed Edition Barry B Brey - eBook PDFpdf download
51 pages
Project Documentation by Divyanshi Verma (719057)
No ratings yet
Project Documentation by Divyanshi Verma (719057)
298 pages
InTech-Types of Machine Learning Algorithms
No ratings yet
InTech-Types of Machine Learning Algorithms
30 pages
HP ENVY x360 15m Convertible Maintenance and Service Guide
No ratings yet
HP ENVY x360 15m Convertible Maintenance and Service Guide
87 pages
Deep Learning With Tensorflow 2 and Keras
No ratings yet
Deep Learning With Tensorflow 2 and Keras
14 pages
Srsconverter
No ratings yet
Srsconverter
11 pages
Kernrate
No ratings yet
Kernrate
22 pages
Intro To Digital Electronics
No ratings yet
Intro To Digital Electronics
22 pages
Manual Warp Pyxis
No ratings yet
Manual Warp Pyxis
13 pages
Deep - Learning - With - Edge - Computing - A - Review 2023
No ratings yet
Deep - Learning - With - Edge - Computing - A - Review 2023
20 pages
Format of The SIT Report Rev 2024
No ratings yet
Format of The SIT Report Rev 2024
9 pages
Oracle PLSQL PDF
No ratings yet
Oracle PLSQL PDF
24 pages
HaiRobotics Datasheet Product Brochure CN 220825
No ratings yet
HaiRobotics Datasheet Product Brochure CN 220825
13 pages
Computerized Control Consoles: MCC Classic 50-C8422/C
No ratings yet
Computerized Control Consoles: MCC Classic 50-C8422/C
8 pages
Gfz-63665en 01
No ratings yet
Gfz-63665en 01
512 pages
Birla Institute of Technology & Science, Pilani EEE G613: Advanced Digital Signal Processing Semester I: 2021-2022
No ratings yet
Birla Institute of Technology & Science, Pilani EEE G613: Advanced Digital Signal Processing Semester I: 2021-2022
6 pages
MIC MICROPROJECT
No ratings yet
MIC MICROPROJECT
11 pages
Ieee
No ratings yet
Ieee
4 pages
IT Assignment 2
No ratings yet
IT Assignment 2
16 pages
Doubly Linked List
No ratings yet
Doubly Linked List
5 pages
Syllabus 7150 Fall2023
No ratings yet
Syllabus 7150 Fall2023
3 pages
Yandex - LeetCode
No ratings yet
Yandex - LeetCode
2 pages
AE-Math7 (23) - Mock Test 2 - Paper 2 QP
No ratings yet
AE-Math7 (23) - Mock Test 2 - Paper 2 QP
15 pages
Adec Dryers Manual Rev2
No ratings yet
Adec Dryers Manual Rev2
34 pages
Adobe Photoshop Request Code PDF
No ratings yet
Adobe Photoshop Request Code PDF
4 pages
SpringBoot Notes
0% (1)
SpringBoot Notes
2 pages
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet

3 Datasets

Uploaded by

3 Datasets

Uploaded by

Datasets

Question: What is a dataset and why is it important?

2. Whence comes data?

Twitter is developing an algorithm to detect tweets that contain

This is a more challenging problem compared to scenario-1. First, this is

In order to train a computer to distinguish between the two kinds of

Labeling a dataset is an important part of the data preparation process.

Cambridge dictionary defines the verb supervise as follows: to watch a

4. Partitioning the dataset

An important feature of testing is to make sure that it is challenging. If

Think about the validation-dataset as additional problems for practice or a

You might also like