Data Science Course From Packt
Data Science Course From Packt
As its name implies, data science projects require data, but it is actually more important to
have defined a clear business problem to solve first. If it's not framed correctly, a project may
lead to incorrect results as you may have used the wrong information, not prepared the data
properly, or led a model to learn the wrong patterns. So, it is absolutely critical to properly
define the scope and objective of a data science project with your stakeholders.
Supervised learning
Unsupervised learning
Reinforcement learning
Supervised Learning
Supervised learning refers to a type of task where an algorithm is trained to learn patterns
based on prior knowledge. That means this kind of learning requires the labeling of the
outcome (also called the response variable, dependent variable, or target variable) to be
predicted beforehand. For instance, if you want to train a model that will predict whether a
customer will cancel their subscription, you will need a dataset with a column (or variable)
that already contains the churn outcome (cancel or not cancel) for past or existing customers.
This outcome has to be labeled by someone prior to the training of a model. If this dataset
contains 5,000 observations, then all of them need to have the outcome being populated. The
objective of the model is to learn the relationship between this outcome column and the other
features (also called independent variables or predictor variables). Following is an example of
such a dataset:
The Cancel column is the response variable. This is the column you are interested in, and you
want the model to predict accurately the outcome for new input data (in this case, new
customers). All the other columns are the predictor variables.
The model, after being trained, may find the following pattern: a customer is more likely to
cancel their subscription after 12 months and if their average monthly spent is over $50. So,
if a new customer has gone through 15 months of subscription and is spending $85 per
month, the model will predict this customer will cancel their contract in the future.
When the response variable contains a limited number of possible values (or classes), it is a
classification problem (you will learn more about this in Chapter 3, Binary Classification,
and Chapter 4, Multiclass Classification with RandomForest). The model will learn how to
predict the right class given the values of the independent variables. The churn example we
just mentioned is a classification problem as the response variable can only take two different
values: yes or no.
On the other hand, if the response variable can have a value from an infinite number of
possibilities, it is called a regression problem.
An example of a regression problem is where you are trying to predict the exact number of
mobile phones produced every day for some manufacturing plants. This value can potentially
range from 0 to an infinite number (or a number big enough to have a large range of potential
values), as shown in Figure 1.2.
In the preceding figure, you can see that the values for Daily output can take any value
from 15000 to more than 50000. This is a regression problem, which we will look at
in Chapter 2, Regression.
Unsupervised Learning
Unsupervised learning is a type of algorithm that doesn't require any response variables at all.
In this case, the model will learn patterns from the data by itself. You may ask what kind of
pattern it can find if there is no target specified beforehand.
This type of algorithm usually can detect similarities between variables or records, so it will
try to group those that are very close to each other. This kind of algorithm can be used for
clustering (grouping records) or dimensionality reduction (reducing the number of variables).
Clustering is very popular for performing customer segmentation, where the algorithm will
look to group customers with similar behaviors together from the data. Chapter 5,
Performing Your First Cluster Analysis, will walk you through an example of clustering
analysis.
Reinforcement Learning
Reinforcement learning is another type of algorithm that learns how to act in a specific
environment based on the feedback it receives. You may have seen some videos where
algorithms are trained to play Atari games by themselves. Reinforcement learning techniques
are being used to teach the agent how to act in the game based on the rewards or penalties it
receives from the game.
For instance, in the game Pong, the agent will learn to not let the ball drop after multiple
rounds of training in which it receives high penalties every time the ball drops.
Note: Reinforcement learning algorithms are out of scope and will not be covered in this
course.
Overview of Python
As mentioned earlier, Python is one of the most popular programming languages for data
science. But before diving into Python's data science applications, let's have a quick
introduction to some core Python concepts.
Types of Variable
In Python, you can handle and manipulate different types of variables. Each has its own
specificities and benefits. We will not go through every single one of them but rather focus
on the main ones that you will have to use in this book. For each of the following code
examples, you can run the code in Google Colab to view the given output.
Numeric Variables
The most basic variable type is numeric. This can contain integer or decimal (or float)
numbers, and some mathematical operations can be performed on top of them.
Let's use an integer variable called var1 that will take the value 8 and another one
called var2 with the value 160.88, and add them together with the + operator, as shown
here:
var1 = 8
var2 = 160.88
var1 + var2
Very simple, right? In Python, you can perform other mathematical operations on numerical
variables, such as multiplication (with the * operator) and division (with /).
Text Variables
Another interesting type of variable is string, which contains textual information. You can
create a variable with some specific text using the single or double quote, as shown in the
following example:
print(var3)
print(var4)
Python also provides an interface called f-strings for printing text with the value of defined
variables. It is very handy when you want to print results with additional text to make it more
readable and interpret results. It is also quite common to use f-strings to print logs. You will
need to add f before the quotes (or double quotes) to specify that the text will be an f-string.
Then you can add an existing variable inside the quotes and display the text with the value of
this variable. You need to wrap the variable with curly brackets, {}. For instance, if we want
to print Text: before the values of var3 and var4, we will write the following code:
You can also perform some text-related transformations with string variables, such as
capitalizing or replacing characters. For instance, you can concatenate the two variables
together with the + operator:
var3 + var4
You should get the following output:
Python List
Another very useful type of variable is the list. It is a collection of items that can be changed
(you can add, update, or remove items). To declare a list, you will need to use square
brackets, [], like this:
A list can have different item types, so you can mix numerical and text variables in it:
An item in a list can be accessed by its index (its position in the list). To access the first
(index 0) and third elements (index 2) of a list, you do the following:
print(var6[0])
print(var6[2])
Python provides an API to access a range of items using the : operator. You just need to
specify the starting index on the left side of the operator and the ending index on the right
side. The ending index is always excluded from the range. So, if you want to get the first
three items (index 0 to 2), you should do as follows:
print(var6[0:3])
You can also iterate through every item of a list using a for loop. If you want to print every
item of the var6 list, you should do this:
var6.append('Python')
print(var6)
var6.remove(15019)
print(var6)
Python Dictionary
Another very popular Python variable used by data scientists is the dictionary type. For
example, it can be used to load JSON data into Python so that it can then be converted into a
DataFrame (you will learn more about the JSON format and DataFrames in the following
sections). A dictionary contains multiple elements, like a list, but each element is
organized as a key-value pair. A dictionary is not indexed by numbers but by keys. So, to
access a specific value, you will have to call the item by its corresponding key. To define a
dictionary in Python, you will use curly brackets, {}, and specify the keys and values
separated by :, as shown here:
To access a specific value, you need to provide the corresponding key name. For instance, if
you want to get the value Python, you do this:
var7['Language']
Python provides a method to access all the key names from a dictionary, .keys(), which is
used as shown in the following code snippet:
var7.keys()
There is also a method called .values(), which is used to access all the values of a
dictionary:
var7.values()
You can iterate through all items from a dictionary using a for loop and
the .items() method, as shown in the following code snippet:
You can add a new element in a dictionary by providing the key name like this:
var7['Publisher'] = 'Packt'
print(var7)
del var7['Publisher']
print(var7)
In Exercise 1.01, we will be looking to use these concepts that we've just looked at.
Note: If you are interested in exploring Python in more depth, head over to our website to get
yourself the Python Workshop.