3 Datasets
3 Datasets
1. Datasets
2. Whence comes data?
2.1. Scenario-1
2.2. Scenario-2
2.3. Scenario-3
3. Supervision
4. Partitioning the dataset
5. Summary
1. Datasets
There are different kinds of datasets. The housing dataset that we saw
right at the beginning is a tabular dataset. Data comes in the form of a
table. Each column of this table is called an attribute or a feature and
each row represents one record or observation. Recall that also use the
term data-point to refer to each row of the table. By far, tabular datasets
are the most common form in which data is represented. Tabular data can be
neatly packed into comma-separated files or CSVs. Few other forms of data:
• image
• text
• speech
Image, text and speech data cannot be packed into simple CSVs and are often
called unstructured data.
How do we obtain data? Where does data come from? This seems like a simple
question but it doesn't have a simple answer. Here are some scenarios that
are arranged in increasing order of complexity:
2.1. Scenario-1
An FMCG company has given you some historical data concerning its sales
over the last three years. It wants you to predict the average sales in
the coming quarter.
Here we are lucky. Someone comes to our doorstep and gives us the data. It
might be the case that the company has neatly arranged the data in a
tabular format. In addition, we also have a very precise definition of the
problem statement. We have to predict a real number by looking at the data.
It is a regression problem.
2.2. Scenario-2
• offensive
• not-offensive
2.3. Scenario-3
You are a research scientist at a manufacturing company. You want to set
up a facility that automates the segregation of defective products from
non-defective ones. Come up with an end-end ML solution.
This is by far the most challenging scenario. We don't have access to the
data. We need to gather data in the first place. Once we have the data, we
need to label it or annotate it. Only then can we start thinking about
training ML models on top of the data.
3. Supervision
• labeled dataset
• unlabeled dataset
Techniques that work with labeled data fall under the category of
supervised learning. Those that work with unlabeled data come under
unsupervised learning. What is so special about the term "supervised"?
• train-dataset
• test-dataset
We train the model on the train-dataset and evaluate its performance on the
test-dataset. But often, we don't stop with two partitions, we go for three
partitions:
• train-dataset
• validation-dataset
• test-dataset
5. Summary
Datasets come in different types: tabular data, image, text, speech data
and so on. The source of data varies from situation to situation. Sometimes
the data could be given to us in a well formatted and usable condition. At
other times, we would have to expend effort in gathering data and making it
suitable for further processing. Datasets could either be labeled or
unlabeled. ML algorithms that deal with labeled data are called supervised
learning methods. To evaluate the performance of any ML model, it is
important to partition the data into two parts: train, test; the model is
trained on the training data and evaluated on the test data.