Machine Learning - 2 Books in 1 - The Complete Guide For Beginners To Master Neural Networks, Artificial Intelligence, and Data Science With Python (BooksRack - Net)
Machine Learning - 2 Books in 1 - The Complete Guide For Beginners To Master Neural Networks, Artificial Intelligence, and Data Science With Python (BooksRack - Net)
2 Books in 1:
Python Machine Learning and
Data Science.
The Complete Guide for Beginners to
Master Neural Networks, Artificial
Intelligence, and Data Science with
Python
Andrew Park
Density Estimation
Machine Learning allows the system to use the data that’s available to it to
suggest similar products. For instance, if you were to pick up a copy of Pride
and Prejudice from a bookstore and then run it through a machine, then
Machine Learning will help it determine the density of the words and come
up with other books that are similar to Pride and Prejudice.
Latent Variables
When you are working with latent variables, the machine will use clustering
to determine whether any of the variables present in it are related to one
another or not. This comes in handy when you aren’t certain of the reason
that caused the change in variables and aren’t aware of the relationship
between the variables. When a large quantity of data is involved, it is easier
to look for latent variables because it helps with a better understanding of the
data thus obtained.
Reduction of Dimensionality
Usually, the data that is obtained tends to have some variables and
dimensions. If there are more than three dimensions involved, then the human
mind can’t visualize that data. In such situations, Machine Learning helps to
reduce these data into manageable proportions so that the user can easily
understand the relationship between any variables.
Models of Machine Learning train the machines to learn from all the
available data and offer different services like prediction or classification that
in turn have multiple real-life applications like self-driving cars, the ability of
smartphones to recognize the user’s face or how Google Home or Alexa can
recognize your accent and voice and how the accuracy of the machines
improves if they have been learning for longer.
Advantages and Disadvantages of Machine Learning
The disadvantages of Machine Learning are:
Predict a category
The model of Machine Learning helps analyze the input data and then
predicts a category under which the output will fall. The prediction in such
cases is usually a binary answer that’s based on “yes” or “no.” For instance, it
helps with answers like, “will it rain today or not?” “Is this a fruit?” “Is this
mail spam or not?” And so on. This is attained by referencing a group of data
that will indicate whether a certain email falls under the category of spam or
not based on specific keywords. This process is known as classification.
Predict a quantity
This system is usually used to predict a value like predicting the rainfall
according to different attributes of the weather like the temperature,
percentage of humidity, air pressure and so on. This sort of prediction is
referred to as regression. The regression algorithm has various subdivisions
like linear regression, multiple regression, etc.
Clustering Systems
These forms of systems are still in the initial stages, but their applications are
numerous and can drastically change the way business is conducted. In this
system, the user is classified into different clusters according to various
behavioral factors like their age group, the region they live in or even the
kind of programs they like to view. According to this clustering, the business
can now suggest different programs or shows a user might be interested in
watching according to the cluster that the said user belongs to during
classification.
Categories of Machine Learning Systems
In the case of traditional machines, the programmer will give the machine a
set of instructions and the input parameters, which the machine will use to
compute make some calculations and derive an output using specific
commands. In the case of Machine Learning systems, however, the system is
never restricted by any command that the engineer provides, the machine will
choose the algorithm that it can be used to process the data set and decide the
output with high accuracy. It does this, by using the training data set which
consists of historical data and output.
Therefore, in the classical world, we will tell the machine to process data
based on a set of instructions, while in the Machine Learning setup, we will
never instruct a system. The computer will have to interact with the data set,
develop an algorithm using the historical data set, make decisions like a
human being would, analyze the information and then provide an output. The
machine, unlike a human being, can process large data sets in short periods
and provide results with high accuracy.
There are different types of Machine Learning algorithms, and they are
classified based on the purpose of that algorithm. There are three categories
in Machine Learning systems:
Supervised Learning
Unsupervised Learning
Reinforced Learning
Supervised Learning
In this model, the engineers feed the machine with labeled data. In other
words, the engineer will determine what the output of the system or specific
data sets should be. This type of algorithm is also called a predictive
algorithm.
For example, consider the following table:
Currency (label) Weight (Feature)
1 USD 10 gm
1 EUR 5 gm
1 INR 3 gm
1 RU 7 gm
In the above table, each currency is given an attribute of weight. Here, the
currency is the label, and the weight is the attribute or feature.
The supervised Machine Learning system with first we fed with this training
data set, and when it comes across any input of 3 grams, it will predict that
the coin is a 1 INR coin. The same can be said for a 10-gram coin.
Classification and regression algorithms are a type of supervised Machine
Learning algorithms. Regression algorithms are used to predict match scores
or house prices, while classification algorithms identify which category the
data should belong to.
We will discuss some of these algorithms in detail in the later parts of the
book, where you will also learn how to build or implement these algorithms
using Python.
Unsupervised Learning
In this type of model, the system is more sophisticated in the sense that it will
learn to identify patterns in unlabeled data and produce an output. This is a
kind of algorithm that is used to draw any meaningful inference from large
data sets. This model is also called the descriptive model since it uses data
and summarizes that data to generate a description of the data sets. This
model is often used in data mining applications that involve large volumes of
unstructured input data.
For instance, if a system is Python input of name, runs and wickets, the
system will visualize that data on a graph and identify the clusters. There will
be two clusters generated - one cluster is for the batsman while the other is
for the bowlers. When any new input is fed, the person will certainly fall into
one of these clusters, which will help the machine predict whether the player
is a batsman or a bowler.
Name Runs Wickets
Rachel 100 3
John 10 50
Paul 60 10
Sam 250 6
Alex 90 60
Sample data set for a match. Based on this, the cluster model can group the
players into batsmen or bowlers.
Some common algorithms which fall under unsupervised Machine Learning
include density estimation, clustering, data reduction and compressing.
The clustering algorithm summarizes the data and presents it differently. This
is a technique used in data mining applications. Density estimation is used
when the objective is to visualize any large data set and create a meaningful
summary. This will bring us the concept of data reduction and
dimensionality. These concepts explain that the analysis or output should
always deliver the summary of the data set without the loss of any valuable
information. In simple words, these concepts say that the complexity of data
can be reduced if the derived output is useful.
Reinforced learning
This type of learning is similar to how human beings learn, in the sense that
the system will learn to behave in a specific environment, and take actions
based on that environment. For example, human beings do not touch fire
because they know it will hurt and they have been told that will hurt.
Sometimes, out of curiosity, we may put a finger into the fire, and learn that
it will burn. This means that we will be careful with fire in the future.
The table below will summarize and give an overview of the differences
between supervised and unsupervised Machine Learning. This will also list
the popular algorithms that are used in each of these models.
Supervised Learning Unsupervised Learning
Works with labeled data Works with unlabeled data
Takes Direct feedback No feedback loop
Predicts output based on input data. Finds the hidden structure/pattern
Therefore also called “Predictive from input data. Sometimes called
Algorithm” as “Descriptive Model”
Some common classes of Some common classes of
supervised algorithms include: unsupervised algorithms include:
- Logistic Regression - Clustering, Compressing,
- Linear Regression (Numeric density estimation & data
prediction) reduction
- Polynomial Regression - K-means Clustering
- Regression trees (Numeric (Clustering)
prediction) - Association Rules (Pattern
- Gradient Descent Detection)
- Random Forest - Singular Value
- Decision Trees (classification) Decomposition
- K-Nearest Algorithm - Fuzzy Means
(classification) - Partial Least Squares
- Naive Bayes - Hierarchical Clustering
- Support Vector Machines - Principal Component Analysis
Table: Supervised vs Unsupervised Learning
We will look at each of these algorithms briefly and learn how to implement
them in Python. Let us now look at some examples of where Machine
Learning is applied. It is always a good idea to identify which type of
Machine Learning model you must use with examples. The following points
are explained in the next section:
Collect Data
This is perhaps the most time-consuming steps of building a system of
Machine Learning. You must collect all the relevant data that you will use to
train the algorithm.
Prepare Data
This is an important step that is usually overlooked. Overlooking this step can
prove to be a costly mistake. The cleaner and the more relevant the data you
are using is, the more accurate the prediction or the output will be.
Select an Algorithm
There are different algorithms that you can choose, like Structured Vector
Machine (SVM), k-nearest, Naive-Bayes, Apriori, etc. The algorithm that
you use will primarily depend on the objective you wish to attain with the
model.
Train Model
Once you have all the data ready, you must feed it into the machine and the
algorithm must be trained to predict.
Test Model
Once your model is trained, it is now ready to start reading the input to
generate appropriate outputs.
Predict
Multiple iterations will be performed and you can also feed the feedback into
the system to improve its predictions over time.
Deploy
Once you test the model and are satisfied with the way it is working, the said
model will be sterilized and can be integrated into any application you want.
This means that it is ready to be deployed.
All these steps can vary according to the application and the type of
algorithm (supervised or unsupervised) you are using. However, these steps
are generally involved in all processes of designing a system of Machine
Learning. There are various languages and tools that you can use in each of
these stages. In this book, you will learn about how you can design a system
of Machine Learning using Python.
Let us understand the scenarios from the previous section below.
Scenario One
In a picture from a tagged album, Facebook recognizes the photo of the
friend.
Explanation: This is an instance of supervised learning. In this case,
Facebook is using tagged photographs to recognize the person. The tagged
photos will become the labels of the pictures. Whenever a machine is
learning from any form of labeled data, it is referred to as supervised
learning.
Scenario Two
Suggesting new songs based on someone’s past music preferences.
Explanation: This is an instance of supervised learning. The model is training
classified or pre-existing labels- in this case, the genre of songs. This is
precisely what Netflix, Pandora, and Spotify do – they collect the
songs/movies that you like, evaluate the features based on your preferences
and then come up with suggestions of songs or movies based on similar
features.
Scenario Three
Analyzing the bank data to flag any suspicious or fraudulent transactions.
Explanation: This is an instance of unsupervised learning. The suspicious
transaction cannot be fully defined in this case and therefore, there are no
specific labels like fraud or not a fraud. The model will try to identify any
outliers by checking for anomalous transactions.
Scenario Four
Combination of various models.
Explanation: The surge pricing feature of Uber is a combination of different
models of Machine Learning like the prediction of peak hours, the traffic in
specific areas, the availability of cabs and clustering is used to determine the
usage pattern of users in different areas of the city.
Chapter 3 Linear Regression with Python
Linear regression with one variable
The first part of linear regression that we are going to focus on is when we
just have one variable. This is going to make things a bit easier to work with
and will ensure that we can get some of the basics down before we try some
of the things that are a bit harder. We are going to focus on problems that
have just one independent and one dependent variable on them.
To help us get started with this one, we are going to use the set of data for
car_price.csv so that we can learn what the price of the car is going to be. We
will have the price of the car be our dependent variable and then the year of
the car is going to be the independent variable. You can find this information
in the folders for Data sets that we talked about before. To help us make a
good prediction on the price of the cars, we will need to use the Scikit Learn
library from Python to help us get the right algorithm for linear regression.
When we have all of this setup, we need to use the following steps to help
out.
Exercise
Write a program that captures the following in a list: “Best”, 26, 89, 3.9
Nested Lists
A nested list is a list as an item in another list.
Example
Start IDLE.
Navigate to the File menu and click New Window.
Type the following:
list_mine=[“carrot”, [9, 3, 6], [‘g’]]
Exercise
Write a nested list for the following elements: [36, 2, 1], ”Writer”, ’t’, [3.0,
2.5]
Example
Start IDLE.
Navigate to the File menu and click New Window.
Type the following:
list_mine=[‘b’,’e’,’s’,’t’]
print(list_mine[0])#the output will be b
print(list_mine[2])#the output will be s
print(list_mine[3])#the output will be t
Exercise
Given the following list:
your_collection=[‘t’,‘k’,‘v’,‘w’,‘z’,‘n’,‘f’]
a. Write a Python program to display the second item in the list
b. Write a Python program to display the sixth item in the last
c. Write a Python program to display the last item in the list.
Example
Start IDLE.
Navigate to the File menu and click New Window.
Type the following:
list_mine=[‘c’,’h’,’a’,’n’,’g’,’e’,’s’]
print(list_mine[3:5]) #Picking elements from the fourth to the sixth
Example
Picking elements from start to the fifth
Start IDLE.
Navigate to the File menu and click New Window.
Type the following:
print(list_mine[:-6])
Example
Picking the third element to the last.
print(list_mine[2:])
Exercise
Given class_names=[‘John’, ‘Kelly’, ‘Yvonne’, ‘Una’,’Lovy’,’Pius’,
‘Tracy’]
a. Write a Python program using the slice operator to display from the second
students and the rest.
b. Write a Python program using the slice operator to display the first student
to the third using the negative indexing feature.
c. Write a Python program using the slice operator to display the fourth and
fifth students only.
Example
Start IDLE.
Navigate to the File menu and click New Window.
Type the following:
list_yours=[4,6,5]
list_yours.extend([13,7,9])
print(list_yours)#The output will be [4,6,5,13,7,9]
The plus operator (+) can also be used to combine two lists. The * operator
can be used to iterate a list a given number of times.
Example
Start IDLE.
Navigate to the File menu and click New Window.
Type the following:
list_yours=[4,6,5]
print(list_yours+[13,7,9])# Output:[4, 6, 5,13,7,9]
print([‘happy’]*4)#Output:[“happy”,”happy”, “happy”,”happy”]
Example
Start IDLE.
Navigate to the File menu and click New Window.
Type the following:
list_mine=[‘t’,’r’,’o’,’g’,’r’,’a’,’m’]
del list_mine[1]
print(list_mine) #t, o, g, r, a, m
Example
print(list_mine) #a, m
The remove() method or pop() method can be used to remove the specified
item. The pop() method will remove and return the last item if the index is
not given and helps implement lists as stacks. The clear() method is used to
empty a list.
Example
Start IDLE.
Navigate to the File menu and click New Window.
Type the following:
list_mine=[‘t’,’k’,’b’,’d’,’w’,’q’,’v’]
list_mine.remove(‘t’)
print(list_mine)#output will be [‘t’,’k’,’b’,’d’,’w’,’q’,’v’]
print(list_mine.pop(1))#output will be ‘k’
print(list_mine.pop())#output will be ‘v’
Exercise
Given list_yours=[‘K’,’N’,’O’,’C’,’K’,’E’,’D’]
a. Pop the third item in the list, save the program as list1.
b. Remove the fourth item using remove() method and save the program as
list2
c. Delete the second item in the list and save the program as list3.
d. Pop the list without specifying an index and save the program as list4.
Exercise
Use list access methods to display the following items in reverse order
list_yours=[4,9,2,1,6,7]
Use the list access method to count the elements in list_yours .
Use the list access method to sort the items in list_yours in an ascending
order/default.
Chapter 5 Modules In Python
Using “AND”
If you want to verify that two expressions are both true at the same time, the
keyword “and” serves that purpose. The expression is evaluated to be true
when both conditions test to return true. However, if one of the condition
falls, then the expression returns false. For instance, you want to ascertain if
two students in a class have over 45 score marks.
>>> score_1 = 46
>>> score_2 = 30
>>> score_1 >=45 and score_2 >= 45
False
>>> score_2 = 47
>>> score_1 >= 45 and score_2 >= 45
True
The program looks complicated but lets me explain it step-by-step. In the first
two lines, we define two scores, score_1, and score_2. However, in line 3, we
perform a check to ascertain if both scores are equal to or above 45. The
condition on the right-hand side is false, but that of the left-hand side is true.
Then in the line after the false statement, I changed the value of score_2 from
30 to 47. In this instant, the value of score_2 is now greater than 46;
therefore, both conditions will evaluate to true.
To make the code more readable, we can use parentheses in each test.
However, it is not compulsory to do such but makes it simpler. Let us use
parentheses to demonstrate the difference between the previous code and the
one below.
(score_1 >= 45) and (score_2 >=45)
Using “OR”
The “OR” keyword allows you to check multiple conditions as the “AND”
keyword. However, the difference here is that the “OR” keyword is used
when you want to ascertain that one expression is true for multiple
conditions. In this situation, if one of the expression is false, the condition
returns true. It returns false when both conditions are false.
Let us consider our previous example using the “OR” keyword. For instance,
you want to ascertain if two students in a class have over 45 score mark.
>>> score_1 = 46
>>> score_2 = 30
>>> score_1 >=45 or score_2 >= 45
True
>>> score_1 = 30
>>> score_1 >= 45 or score_2 >= 45
False
We began by declaring two variables score_1 and score_2 and assign values
to them. In the third line, we test the OR condition using the two variables.
The test in that line satisfies the condition because one of the expressions is
true. Then, it changed the value of the variable score to 30; however, it fails
both conditions and therefore evaluates false.
Besides using the “And” and “OR” conditional statements to check multiple
conditions, we can also test the availability of a value in a particular list. For
instance, you want to verify if a requested username is already in existence
from a list of usernames before the completion of online registration on a
website.
To do this, we can use the “in” keyword in such a situation. For instance, let
us use a list of animals in the zoo and check if it is already on the list.
>>>animals = [“zebra”, “lion”, “crocodile”, “monkey”]
>>> “monkey” in animals
True
>>> “rat” in animals
False
In the second and fourth lines, we use the “in” keyword to test if the request
word in a double quote exists in our list of animals. The first test ascertains
that “monkey” exists in our list, whereas the second test returns false because
the rat is not in the animal's list. This method is significant because we can
generate lists of important values and check the existence of the values in the
list.
There are situations where you want to check if a value isn’t in a list. In such
a case, instead of using the “in” keyword to return false, we can use the “not”
keyword. For instance, let us consider a list of Manchester United players
before allowing them to be part of their next match. In order words, we want
to scan the real players and ensure that the club does not field an illegible
player.
united_player = [“Rashford,” “Young,” “Pogba,” “Mata,” “De Gea”]
player = “Messi”
if player not in united_player:
print(f “{player.title()}, you are not qualified to play for Manchester
United.”)
The line “if player, not in united_player:” reads quite clearly. Peradventure,
the value of the player isn’t in the list united_player, Python returns the
expression to be True and then executed the line indented under it. The player
“Messi” isn’t part of the list united_player; therefore, he will receive a
message about his qualification status. The output will be as follow:
Messi, you are not qualified to play for Manchester United.
Exercises to Try
Conditional Testing – Write various conditional expressions. Furthermore,
print a statement to describe each condition and what the likely output of
each test will be. for instance, your code can be like this:
car = “Toyota”
print(“Is car == ‘Toyota’? My prediction is True.”(
print (car == “Toyota”)
print(“\nIs car == ‘KIA’? My prediction is False.”)
print(car== “KIA”)
Test the following condition to evaluate either True or False using any things
of your choice to form a list.
1. Test for both inequality and equality using strings and numbers
2. Test for the condition using the “or” and “and” keywords
3. Test if an item exists in the above list
4. Test if an item doesn’t exist in the list.
If Statements
Since you now know conditional tests, it will be easier for you to under if
statements. There are various types of if statements to use in Python,
depending on your choice. In this section, you will learn the different if
statements possible and the best situation to apply them, respectively.
Simple If Statements
In any programming language, the “if statement” is the simplest to come
across. It only requires a test or condition with a single action following it,
respectively. The syntax for this statement is as follows:
if condition:
perform action
The first line can contain any conditional statement with the second following
the action to take. Ensure to indent the second line for clarity purposes. If the
conditional statement is true, then the code under the condition is executed.
However, if it is false, the code is ignored.
For instance, we have set a standard that the minimum score for a person to
qualify for a football match is 20. We want to test if such a person is qualified
to participate.
person = 21
if person >= 20
print(“You are qualified for the football match against Valencia.”)
In the first line, we define the person’s age to 21 to qualify. Then the second
line evaluates if the person is greater than or equal to 20. Python then
executes the statement below because it fulfills the condition that the person
is above 20.
You are qualified for the football match against Valencia.
Indentation is very significant when using the “if statement” like we did in
the “for loop” situations. All indented lines are executed once the condition is
satisfied after the if statement. However, if the statement returns false, then
the whole code under it is ignored, and the program halted.
We can also include more code inside the if statements to display what we
want. Let us add another line to display that the match is between Chelsea
and Valencia at the Standford Bridge.
person =21
if person >= 20
print(“You are qualified for the football match against Valencia.”)
print(“The match is between Arsenal and Valencia.”)
Print(“The Venue is at the Emirate Stadium in England.”)
The conditional statement passes through the condition and prints the
indented actions once the condition is satisfied. The output will be as follow:
You are qualified for the football match against Valencia.
The match is between Arsenal and Valencia.
The Venue is at the Emirate Stadium in England.
Assuming the age is less than 20, and then there won’t be any output for this
program. Let us try another example before going into another conditional
statement.
name = “Abraham Lincoln”
if name = “Abraham Lincoln”
print(“Abraham Lincoln was a great United State President.”)
print(“He is an icon that many presidents try to emulate in the world.”)
The output will be:
Abraham Lincoln was a great United State President.
He is an icon that many presidents try to emulate in the world.
If-else Statements
At times, you may want to take certain actions if a particular condition isn’t
met. For example, you may decide what will happen if a person isn’t
qualified to play a match. Python provides the if-else statements to make this
possible. The syntax is as follows:
if conditional test
perform statement_1
else
perform statement_2
Let us use our football match qualification to illustrate how to use the if-else
statement.
person =18
if person >= 20:
print(“You are qualified for the football match against Valencia.”)
print(“The match is between Arsenal and Valencia.”)
Print(“The Venue is at the Emirate Stadium in England.”)
else:
print(“Unfortunately, you are not qualified to participate in the match.”)
print(“Sorry, you have to wait until you are qualified.”)
The conditional test (if person>=20) is first evaluated to ascertain that the
person is above 20 before it passes to the first indented line of code. If it is
true, then it prints the statements beneath the condition. However, in our
example, the conditional test will evaluate to false then passes control to the
else section. Finally, it prints the statement below it since it fulfills that part
of the condition.
Unfortunately, you are not qualified to participate in the match.
Sorry, you have to wait until you are qualified.
This program works because of the two possible scenarios to evaluate – a
person must be qualified to play or not play. In this situation, the if-else
statement works perfectly when you want Python to execute one action in
two possible situations.
Let us try another.
station_numbers = 10
if station_numbers >=12:
print(“We need additional 3 stations in this company.”)
else:
print(“We need additional 5 stations to meet the demands of our
audience.”)
The output will be:
We need an additional 5 stations to meet the demands of our audience.
Exercise to Try
Consider the list of colors we have in the world. Create a variable name-color
and assign the following colors to it – blue, red, black, orange, white, yellow,
indigo, green.
Use an “if statement” to check if the color is blue. If the color is blue, then
print a message indicating a score of 5 points.
Write a problem using the if-else chain to print if a particular selected is
green.
Write another program using the if-elif-else chain to determine the scores of
students in a class. Set a variable “score” to store the student’s score.
If the student’s score is below 40, indicate an output a message that such
student has failed.
If the student’s score is above 41 but less than 55, print a message that the
student has passed.
Chapter 8 Essential Libraries for Machine
Learning in Python
Many developers nowadays prefer the usage of Python in their data analysis.
Python is not only applied in data analysis but also statistical techniques.
Scientists, especially the ones dealing with data, also prefer using Python in
data integration. That's the integration of Webb apps and other environment
productions.
The features of Python have helped scientists to use it in Machine Learning.
Examples of these qualities include consistent syntax, being flexible and even
having a shorter time in development. It also can develop sophisticated
models and has engines that could help in predictions.
As a result, Python boasts of having a series or a set of very extensive
libraries. Remember, libraries refer to a series of routines and sorts of
functions with different languages. Therefore, a robust library can lead to
tackling more complex tasks. However, this is possible without writing
several code lines again. It is good to note that Machine Learning relies
majorly on mathematics. That's mathematical optimization, elements of
probability and also statistical data. Therefore, Python comes in with a rich
knowledge of performing complex tasks without much involvement.
The following are examples of essential libraries being used in our present.
Scikit – Learn
Scikit learn is one of the best and a trendy library in Machine Learning. It has
the ability to supporting learning algorithms, especially unsupervised and
supervised ones.
Examples of Scikit learn include the following.
k-means
decision trees
linear and logistic regression
clustering
This kind of library has major components from NumPy and SciPy. Scikit
learn has the power to add algorithms sets that are useful in Machine
Learning and also tasks related to data mining. That's, it helps in
classification, clustering, and even regression analysis. There are also other
tasks that this library can efficiently deliver. A good example includes
ensemble methods, feature selection, and more so, data transformation. It is
good to understand that the pioneers or experts can easily apply this if at all,
they can be able to implement the complex and sophisticated parts of the
algorithms.
TensorFlow
It is a form of algorithm which involves deep learning. They are not always
necessary, but one good thing about them is their ability to give out correct
results when done right. It will also enable you to run your data in a CPU or
GPU. That's, you can write data in the Python program, compile it, then run it
on your central processing unit. Therefore, this gives you an easy time in
performing your analysis. Again, there is no need for having these pieces of
information written at C++ or instead of other levels such as CUDA.
TensorFlow uses nodes, especially the multi-layered ones. The nodes perform
several tasks within the system, which include employing networks such as
artificial neutral, training, and even set up a high volume of datasets. Several
search engines such as Google depend on this type of library. One main
application of this is the identification of objects. Again, it helps in different
Apps that deal with the voice recognition.
Theano
Theano too forms a significant part of Python library. Its vital tasks here are
to help with anything related to numerical computation. We can also relate it
to NumPy. It plays other roles such as;
Pandas
Pandas is a library that is very popular and helps in the provision of data
structures that are of high level and quality. The data provided here is simple
and easy to use. Again, it’s intuitive. It is composed of various sophisticated
inbuilt methods which make it capable of performing tasks such as grouping
and timing analysis. Another function is that it helps in a combination of data
and also offering filtering options. Pandas can collect data from other sources
such as Excel, CSV, and even SQL databases. It also can manipulate the
collected data to undertake its operational roles within the industries. Pandas
consist of two structures that enable it to perform its functions correctly.
That's Series, which has only one dimension and data frames that boast of
two dimensional. The Pandas library has been regarded as the most strong
and powerful Python library over the time being. Its main function is to help
in data manipulation. Also, it has the power to export or import a wide range
of data. It is applicable in various sectors, such as in the field of Data Science.
Pandas is effective in the following areas:
Splitting of data
Merging of two or more types of data
Data aggregation
Selecting or subsetting data
Data reshaping
Diagrammatic explanations
Series Dimensional
A7
B8
C9
D3
E6
F 9
You can quickly delete some columns or even add some texts
found within the Dataframe
It will help you in data conversion
Pandas can reassure you of getting the misplaced or missing data
It has a powerful ability, especially in the grouping of other
programs according to their functionality.
Matplotlib
This is another sophisticated and helpful data analysis technique that helps in
data visualization. Its main objective is to advise the industry where it stands
using the various inputs. You will realize that your production goals are
meaningless when you fail to share them with different stakeholders. To
perform this, Matplotlib comes in handy with the types of computation
analysis required. Therefore, it is the only Python library that every scientist,
especially the ones dealing with data prefers. This type of library has good
looks when it comes to graphics and images. More so, many prefer using it in
creating various graphs for data analysis. However, the technological world
has completely changed with new advanced libraries flooding the industry.
It is also flexible, and due to this, you are capable of making several graphs
that you may need. It only requires a few commands to perform this.
In this Python library, you can create various diverse graphs, charts of all
kinds, several histograms, and even scatterplots. You can also make non-
Cartesian charts too using the same principle.
Diagrammatic explanations
The above graph highlights the overall production of a company within three
years. It specifically demonstrates the usage of Matplotlib in data analysis.
By looking at the diagram, you will realize that the production was high as
compared to the other two years. Again, the company tends to perform in the
production of fruits since it was leading in both years 1 and 2 with a tie in
year 3. From the figure, you realize that your work of presentation,
representation and even analysis has been made easier as a result of using this
library. This Python library will eventually enable you to come up with good
graphics images, accurate data and much more. With the help of this Python
library, you will be able to note down the year your production was high,
thus, being in a position to maintain the high productivity season.
It is good to note that this library can export graphics and can change these
graphics into PDF, GIF, and so on. In summary, the following tasks can be
undertaken with much ease. They include:
Seaborn
Seaborn is also among the popular libraries within the Python category. Its
main objective here is to help in visualization. It is important to note that this
library borrows its foundation from Matplotlib. Due to its higher level, it is
capable of various plots generation such as the production of heat maps,
processing of violin plots and also helping in generation of time series plots.
Diagrammatic Illustrations
The above line graph clearly shows the performance of different machines
the company is using. Following the diagram above, you can eventually
deduce and make a conclusion on which machines the company can keep
using to get the maximum yield. On most occasions, this evaluation method
by the help of the Seaborn library will enable you to predict the exact abilities
of your different inputs. Again, this information can help for future reference
in the case of purchasing more machines. Seaborn library also has the power
to detect the performance of other variable inputs within the company. For
example, the number of workers within the company can be easily identified
with their corresponding working rate.
NumPy
This is a very widely used Python library. Its features enable it to perform
multidimensional array processing. Also, it helps in the matrix processing.
However, these are only possible with the help of an extensive collection of
mathematical functions. It is important to note that this Python library is
highly useful in solving the most significant computations within the
scientific sector. Again, NumPy is also applicable in areas such as linear
algebra, derivation of random number abilities used within industries and
more so Fourier transformation. NumPy is also used by other high-end
Python libraries such as TensorFlow for Tensors manipulation. In short,
NumPy is mainly for calculations and data storage. You can also export or
load data to Python since it has those features that enable it to perform these
functions. It is also good to note that this Python library is also known as
numerical Python.
SciPy
This is among the most popular library used in our industries today. It boasts
of comprising of different modules that are applicable in the optimization
sector of data analysis. It also plays a significant role in integration, linear
algebra, and other forms of mathematical statistics.
In many cases, it plays a vital role in image manipulation. Manipulation of
the image is a process that is widely applicable in day to day activities. Cases
of Photoshops and much more are examples of SciPy. Again, many
organizations prefer SciPy in their image manipulation, especially the
pictures used for presentation. For instance, wildlife society can come up
with the description of a cat and then manipulate it using different colors to
suit their project. Below is an example that can help you understand this more
straightforwardly. The picture has been manipulated:
The original input image was a cat that the wildlife society took. After
manipulation and resizing the image according to our preferences, we get a
tinted image of a cat.
Keras
This is also part and parcel of Python library, especially within Machine
Learning. It belongs to the group of networks with high level neural. It is
significant to note that Keras has the capability of working over other
libraries, especially TensorFlow and even Theano. Also, it can operate
nonstop without mechanical failure. In addition to this, it seems to work
better on both the GPU and CPU. For most beginners in Python
programming, Keras offers a secure pathway towards their ultimate
understanding. They will be in a position to design the network and even to
build it. Its ability to prototype faster and more quickly makes it the best
Python library among the learners.
PyTorch
This is another accessible, but open-source kind of Python library. As a result
of its name, it boasts of having extensive choices when it comes to tools. It is
also applicable in areas where we have computer vision. Computer vision and
visual display, play an essential role in several types of research. Again, it
aids in the processing of Natural Language. More so, PyTorch can undertake
some technical tasks that are for developers. That's enormous calculations
and data analysis using computations. It can also help in graph creation which
mainly used for computational purposes. Since it is an open-source Python
library, it can work or perform tasks on other libraries such as Tensors. In
combination with Tensors GPU, its acceleration will increase.
Scrapy
Scrapy is another library used for creating crawling programs. That's spider
bots and much more. The spider bots frequently help in data retrieval
purposes and also applicable in the formation of URLs used on the web.
From the beginning, it was to assist in data scrapping. However, this has
undergone several evolutions and led to the expansions of its general
purpose. Therefore, the main task of the scrappy library in our present-day is
to act as crawlers for general use. The library led to the promotion of general
usage, application of universal codes, and so on.
Statsmodels
Statsmodels is a library with the aim of data exploration using several
methods of statistical computations and data assertions. It has many features
such as result statistics and even characteristic features. It can undertake this
role with the help of the various models such as linear regression, multiple
estimators, and analysis involving time series, and even using more linear
models. Also, other models, such as discrete choice are applicable here.
Chapter 9 What is the TensorFlow Library
The next thing that we need to spend some time looking at is the TensorFlow
Library. This is another option that comes from Python, and it can help you
to get some Machine Learning done. This one takes on a few different
options of what you can do when it comes to Machine Learning, so it is
definitely worth your time to learn how to use this option along with the
algorithms that we talked about with the Scikit-Learn library.
TensorFlow is another framework that you can work within Python Machine
Learning, and it is going to offer the programmer a few different features and
tools to get your project done compared to the others. You will find that the
framework that comes with TensorFlow is going to come from Google, and it
is helpful when you are trying to work on some models that are deep learning
related. TensorFlow is going to rely on graphs of data flow for numerical
computation. And it can make sure that some of the different things that you
can do with Machine Learning are easier than ever before.
TensorFlow is going to help us out in many different ways. First, it can help
us with acquiring the data, training the models of Machine Learning that we
are trying to use, helps to make predictions, and can even modify a few of the
future results that we have to make them work more efficiently. Since each of
these steps is going to be important when it comes to doing some Machine
Learning, we can see how TensorFlow can come into our project and ensure
we reach that completion that we want even better.
First, let’s take a look at what TensorFlow is all about and some of the
background that comes with this Python library. The Brain team from Google
was the first to develop TensorFlow to use on large scale options of Machine
Learning. It was developed in order to bring together different algorithms for
both deep learning and Machine Learning, and it is going to make them more
useful through what is known as a common metaphor. TensorFlow works
along with the Python language that we talked about before. In addition to
this, it is going to provide the users with a front-end API that is easy to use
when working on a variety of building applications.
It makes it a bit further, though. Even though you can work with TensorFlow
and it matches up with the Python coding language while you do the coding
and the algorithms, it is going to be able to change these up. All of the
applications that you use with the help of TensorFlow are going to be
executed using the C++ language instead, giving them an even higher level of
performance than before.
TensorFlow can be used for a lot of different actions that you would need to
do to make a Machine Learning project a success. Some of the things that
you can do with this library, in particular, will include running, training, and
building up the deep neural networks, doing some image recognition,
working with recurrent neural networks, digit classification, natural language
processing, and even word embedding. And this is just a few of the things
that are available for a programmer to do when they work with TensorFlow
with Machine Learning.
Installing TensorFlow
With this in mind, we need to take some time to learn how to install
TensorFlow on a computer before we can use this library. Just like we did
with Scikit-Learn, we need to go through and set up the environment and
everything else so that this library is going to work. You will enjoy that with
this kind of library; it is already going to be set up with a few APIs for
programming (we will take a look at these in more depth later on), including
Rust, Go, C++ and Java to name a few. We are going to spend our time here
looking at the way that the TensorFlow library is going to work on the
Windows system, but the steps that you have to use to add this library to your
other operating systems are going to be pretty much the same.
Now, when you are ready to set up and download the TensorFlow library on
your Windows computer, you will be able to go through two choices on how
to download this particular library. You can choose either to work with the
Anaconda program to get it done, or a pip is going to work well, too. The
native pip is helpful because it takes all of the parts that go with the
TensorFlow library and will make sure that it is installed on your system.
And you get the bonus of the system doing this for you without needing to
have a virtual environment set up to get it done.
However, this one may seem like the best choice, but it can come with some
problems along the way. Installing the TensorFlow library using a pip can be
a bit faster and doesn’t require that virtual environment, but it can come with
some interference to the other things that you are doing with Python.
Depending on what you plan to do with Python, this can be a problem so
consider that before starting.
The good thing to remember here is that if you do choose to work with a pip
and it doesn’t seem like it is going to interfere with what you are doing too
much, you will be able to get the whole TensorFlow library to run with just
one single command. And once you are done with this command, the whole
library, and all of the parts that you need with it, are going to be set up and
ready to use on the computer with just one command. And the pip even
makes it easier for you to choose the directory that you would like to use to
store the TensorFlow library for easier use.
In addition to using the pip to help download and install the TensorFlow
library, it is also possible for you to use the Anaconda program. This one is
going to take a few more commands to get started, but it does prevent any
interference from happening with the Python program, and it allows you to
create a virtual environment that you can work with and test out without a ton
of interference or other issues with what is on your computer.
Though there are a few benefits to using the Anaconda program instead of a
pip, it is often recommended that you install this program right along with a
pip, rather than working with just the conda install. With this in mind, we will
still show you some of the steps that it takes to just use the conda install on its
own so you can do this if you choose.
One more thing that we need to consider here before moving on is that you
need to double-check which version of Python is working. Your version
needs to be at Python 3.5 or higher for this to work for you. Python 3 uses the
pip 3 program, and it is the best and most compatible when it comes to
working with a TensorFlow install. Working with an older version is not
going to work as well with this library and can cause some issues when you
try to do some of your Machine Learning code.
You can work with either the CPU or the GPU version of this library based
on what you are the most comfortable with. The first code below is the CPU
version and the second code below is going to be the GPU version.
pip 3 install – upgrade tensorflow
pip 3 install – upgrade tensorflow-gpu
Both of these commands are going to be helpful because they are going to
ensure that the TensorFlow library is going to be installed on your Windows
system. But another option that you can use is with the Anaconda package
itself. The methods above were still working with the pip installs, but we
talked about how there are a few drawbacks when it comes to this one.
Pip is the program that is already installed automatically when you install
Python onto your system as well. But you may find out quickly that
Anaconda is not. This means that if you want to ensure that you can get
TensorFlow to install with this, then you need to first install the Anaconda
program. To do this, just go to the website for Anaconda and then follow the
instructions that come up to help you get it done.
Once you have had the time to install the Anaconda program, then you will
notice that within the files there is going to be a package that is known as
conda. This is a good package to explore a bit at this time because it is going
to be the part that helps you manage the installation packages, and it is
helpful when it is time to manage the virtual environment. To help you get
the access that you need with this package, you can just start up Anaconda
and it will be there.
When Anaconda is open, you can go to the main screen on Windows, click
the Start button, and then choose All programs from here. You need to go
through and expand things out to look inside of Anaconda at the files that are
there. You can then click on the prompt that is there for Anaconda and then
get that to launch on your screen. If you wish to, it is possible to see the
details of this package by opening the command line and writing in “conda
info.” This allows you to see some more of the details that you need about the
package and the package manager.
The virtual environment that we talk about with the Anaconda program is
going to be pretty simple to use, and it is pretty much just an isolated copy of
Python. It will come with all of the capabilities that you need to maintain all
of the files that you use, along with the directories and the paths that go with
it too. This is going to be helpful because it allows you to do all of your
coding inside the Python program, and allows you to add in some different
libraries that are associated with Python if you choose.
These virtual environments may take a bit of time to adjust to and get used to,
but they are good for working on Machine Learning because they allow you
to isolate a project, and can help you to do some coding, without all of the
potential problems that come with dependencies and version requirements.
Everything you do in the virtual environment is going to be on its own, so
you can experiment and see what works and what doesn’t, without messing
up other parts of the code.
From here, our goal is to take the Anaconda program and get it to work on
creating the virtual environment that we want so that the package from
TensorFlow is going to work properly. The conda command is going to come
into play here again to make this happen. Since we are going through the
steps that are needed to create a brand new environment now, we will need to
name it tensorenviron, and then the rest of the syntax to help us get this new
environment created includes:
conda create -n tensorenvrion
After you type this code into the compiler, the program is going to stop and
ask you whether you want to create the new environment, or if you would
rather cancel the work that you are currently doing. This is where we are
going to type in the “y” key and then hit enter so that the environment is
created. The installation may take a few minutes as the compiler completes
the environment for you.
Once the new environment is created, you have to go through the process of
actually activating it. Without this activation in place, you will not have the
environment ready to go for you. You just need to use the command of
“activate” to start and then list out the name of any environment that you
want to work with to activate. Since we used the name of tensorenviron
earlier, you will want to use this in your code as well. An example of how
this is going to look includes:
Activate tensorenviron
Now that you have been able to activate the TensorFlow environment, it is
time to go ahead and make sure that the package for TensorFlow is going to
be installed too. You can do this by using the command below:
When you get to this point, you will be presented with a list of all the
packages that are available to install in case you want to add in a few others
along with TensorFlow. You can then decide if you want to install one or
more of these packages, or if you want to just stick with TensorFlow for right
now. Make sure to agree that you want to do this and continue through the
process.
The installation of this library is going to get to work right away. But it is
going to be a process that takes some time, so just let it go without trying to
backspace or restart. The speed of your internet is going to make a big
determinant of whether you will see this take a long time or not.
Soon though, the installation process for this library is going to be all done,
and you can then go through and see if this installation process was
successful or if you need to fix some things. The good news is the checking
phase is going to be easy to work with because you can just use the import
statement of Python to set it up.
This statement that we are writing is then going to go through the regular
terminal that we have with Python. If you are still working here, like you
should, with the prompt from Anaconda, then you would be able to hit enter
after typing in the word Python. This will make sure that you are inside the
terminal that you need for Python so you can get started. Once you are in the
right terminal for this, type in the code below to help us get this done and
make sure that TensorFlow is imported and ready to go:
import tensorflow as tf
At this point, the program should be on your computer and ready to go and
we can move on to the rest of the guidebook and see some of the neat things
that you can do with this library. There may be a chance that the TensorFlow
package didn’t end up going through the way that it should. If this is true for
you, then the compiler is going to present you with an error message for you
to read through and you need to go back and make sure the code has been
written in the right format along the way.
The good news is if you finish doing this line of code above and you don’t
get an error message at all, then this means that you have set up the
TensorFlow package the right way and it is ready to use! With that said, we
need to explore some more options and algorithms that a programmer can do
when it comes to using the TensorFlow library and getting to learn how they
work with the different Machine Learning projects you want to implement.
Chapter 10 Artificial Neural Networks
This chapter discusses the integral aspect of artificial neural networks. It also
covers their component in particular activation functions and how to train an
artificial neural network, as well as the different advantages of using an
artificial neural network.
The output Y can be of any value. The neuron does not have any information
on the reasonable range of values that Y can take. For this purpose, the
activation function is implemented in the neural network to check Y values
and make a decision on whether the neural connections should consider this
neuron activated or not.
There are different types of activation functions. The most instinctive
function is the step function. This function sets a threshold and decides to
activate or not activate a neuron if it exceeds a certain threshold. In other
words, the output of this function is 1 if Y is greater than a threshold and 0
otherwise. Formally, the activation function is:
This activation function is not bounded and takes values from 0 to +inf.
Although it has a similar shape as a linear function (i.e., this function is equal
to identity for positive values), the ReLu function has a derivative. The
drawback of the ReLu is that the derivative (i.e., the gradient) is 0 when the
inputs are negative. This means as for the linear functions, the
backpropagation cannot be processed, and the neural network cannot learn
unless the inputs are greater than 0. This aspect of the ReLu, gradient equal to
0 when the inputs are negative, is called the dying ReLu problem.
To prevent the dying ReLu problem, two ReLu variations can be used,
namely the Leaky ReLu function and the Parametric ReLu function. The
Leakey ReLu function returns as output the maximum of X and X by 0.1. In
other words, the leaky ReLu is equal to the identity function when X is
greater than 0 and is equal to the product of 0.1 and X when X is less than
zero. This function is provided as follows:
This function has a small positive gradient which is 0.1 when X has negative
values, which make this function support backpropagation for negative
values. However, it may not provide a consistent prediction for these negative
values.
The parametric ReLu function is similar to the Leaky ReLu function, which
takes the gradient as a parameter to the neural network to define the output
when X is negative. The mathematical formulation of this function is as
follows:
There are other variations of the ReLu function such as the exponential linear
ReLu. This function, unlike the other variations of the ReLu the Leaky ReLu
and parametric ReLu, has a log curve for negative values of X instead of the
linear curves like the Leaky ReLu and the parametric ReLu functions. The
downside of this function is it saturates for large negative values of X. Other
variations exist which all rely on the same concept of defining a gradient
greater than 0 when X has negative values.
Given all these activation functions, where each one has its pros and cons, the
question now is: which one should be used in a neural network? The answer
is: simply having a better understanding of the problem in hand will help
guide into a specific activation function, especially if the characteristics of the
function being approximated are known beforehand. For instance, a sigmoid
function is a good choice for a classification problem. In case the nature of
the function being approximated is unknown, it is highly recommended to
start with a ReLu function rather than trying other activation functions.
Overall, the ReLu function works well for a wide range of applications. It is
an ongoing research, and you may try your activation function.
An important aspect of choosing an activation function is the sparsity of the
activation. Sparsity means that not all neurons are activated. This is a desired
characteristic in a neural network because it makes the network learns faster
and less prone to overfitting. Let’s imagine a large neural network with
multiple neurons if all neurons were activated; it means all these neurons are
processed to describe the final output. This makes the neural network very
dense and computationally exhaustive to process. The sigmoid and the tanh
activation functions have this property of activating almost all neurons, which
makes them computationally inefficient unlike the ReLu function and its
variations that cause the inactivation of some negative values. That is the
reason why it is recommended to start with the ReLu function when
approximating a function with unknown characteristics.
What are the types of artificial neural networks?
Several categories of artificial neural networks with different properties and
complexities exist. The first and simplest neural network developed is the
perceptron. The perceptron computes the sum of the inputs, applies an
activation function, and provides the result to the output layer.
Another old and simple approach is the feedforward neural network. This
type of artificial neural network has only one single layer. It is a category that
is fully connected to the following layer where each node is attached to the
others. It propagates the information in one direction from the inputs to the
outputs through the hidden layer. This process is known as the front
propagated wave that usually uses what is called the activation function. This
activation function processes the data in each node of the layers. This neural
network returns a sum of weights by the inputs calculated according to the
hidden layer’s activation function. The category of feedforward neural
network usually uses the backpropagation method for the training process
and the logistic function as an activation function.
Several other neural networks are a derivation of this type of network. For
example, the radial-basis-function neural networks. This is a feedforward
neural network that depends on the radial basis function instead of the
logistic function. This type of neural networks have two layers, wherein the
inner layer, the features, and radial basis function are combined. The radial
function computes the distance of each point to the relative center. This
neural network is useful for continuous values to evaluate the distance from
the target value.
In contrast, the logistic function is used for mapping arbitrary binary values
(i.e., 0 or 1; yes or no). Deep feedforward neural networks are a multilayer
feedforward neural network. They became the most commonly used neural
network types used in Machine Learning as they yield better results. A new
type of learning called deep learning has emerged from these types of neural
networks.
Recurrent neural networks are another category that uses a different type of
nodes. Like a feedforward neural network, each hidden layer processes the
information to the next layer. However, outputs of the hidden layers are
saved and processed back to the previous layer. The first layer, comprised of
the input layer, is processed as the product of the sum of the weighted
features. The recurrent process is applied in hidden layers. At each step,
every node will save information from the previous step. It uses memory,
while the computation is running. In short, the recurrent neural network uses
forward propagation and backpropagation to self-learn from the previous
time steps to improve the predictions. In other words, information is
processed in two directions, unlike the feedforward neural networks.
A multilayer perceptron, or multilayer neural network, is a neural network
that has at least three or more layers. This category of networks is fully
connected where every node is attached to all other nodes in the following
layers.
Convolutional neural networks are typically useful for image classification or
recognition. The processing used by this type of artificial neural network is
designed to deal with pixel data. The convolutional neural networks are a
multi-layer network that is based on convolutions, which apply filters for
neuron activation. When the same filter is applied to a neuron, it leads to an
activation of the same feature and results in what is called a feature map. The
feature map reflects the strength and importance of a feature of input data.
Modular neural networks are formed from more than one connected neural
network. These networks rely on the concept of ‘divide and conquer.’ They
are handy for very complex problems because they allow combining different
types of neural networks. Therefore, they allow combining the strengths of a
different neural network to solve a complex problem where each neural
network can handle a specific task.
Where W1 and b1 are the parameters of the neural network as the weights
and bias of the first layer, respectively.
Next, we apply the activation function F1, that can be any activation from the
function presented previously in this chapter:
The result is the output of the first layer, which is then fed to the next layer
as:
With W2 and b2 are the weights and bias of the second layer, respectively.
To this result, we apply an activation function F2:
The parameter α is the learning rate parameter. This parameter determines the
rate by which the weights are updated. The process that we just describe here
is called the gradient descent algorithm. The process is repeated until it
attains a pre-fixed maximum number of iterations. In chapter 4, we will
develop an example to illustrate a perceptron and multi-layer neural network
by following similar steps using Python. We will develop a classifier based
on an artificial neural network. Now, let’s explore the pros of using an
artificial neural network for Machine Learning applications.
Reducing the number of costs that the business has to deal with.
Helping to launch a brand new service or product and knowing it
will do well.
To help gauge the effectiveness that we see in a new marketing
campaign.
To help tap into some different demographics along the way.
To ensure that we can get into a new market and see success.
Of course, this is not an extensive list that we can look at, and knowing the
right steps and all of the benefits that come with working in data science can
help us to see some improvements and can make the business grow. No
matter what items or services you sell, what your geographic location is, or
what industry you are in, you can use data science to help your business
become more successful.
Sometimes, it is hard for companies to see how they can use data science to
help improve themselves. We may assume that this is just a bunch of hype, or
that only a few companies have been able to see success with it. However,
there are a ton of companies that can use this kind of information to get
themselves ahead, including some of the big names like Amazon, Visa, and
Google. While your business may or may not be on the same level as those
three, it is still possible for you to put data science to work for your needs,
improving what you can offer on the market, how you can help customers
out, and so much more.
It is important to note that data science is a field that is already taking over
the world, and it is helping companies in many different areas. For example,
it is showing companies the best way to grow, how to reach their customers
correctly and most efficiently, how to find new sources of value, and so much
more. It often depends on the overall goal of the company for using this
process of data science to determine what they will get out of it.
With all of the benefits that come with using this process of data science, and
all of the big-name companies who are jumping on board and trying to gain
some of the knowledge and benefits as well, we need to take a look at the life
cycle that comes with data science, and the steps that it takes to make this
project a big success. Let’s dive into some of the things that we need to know
about the data life cycle, so we know the basics of what needs to happen to
see success with data science.
Data discovery
The first step that we are going to see with this life cycle is the idea that
companies need to get out there and discover the info they want to use. This
is the phase where we will search in a lot of different sources in order to
discover the data that we need. Sometimes the data is going to be structured,
such as in text format but other times it may come in a more unstructured
format like videos and images. There are even some times when the data we
find comes to us as a relational database system instead.
These are going to be considered some of the more traditional ways that you
can collect the info that you need, but it is also possible for an organization to
explore some different options as well. For example, many companies are
relying on social media to help them reach their customers and to gain a
better understanding of the mindset and buying decisions of these customers
through this option.
Often this phase is going to include us starting with a big question that we
would like answered, and then searching either for the data in the first place
or if we already have the data, searching through the info that we have
already collected. This makes it easier for us to get through all of that data
and gain the insights that we are looking for.
Mathematical models
When working with data science, all of the projects that you will want to
work with will need to use mathematical models to help them get it all done.
These are models that we can plan out ahead of time and then the data
scientist is going to build them up to help suit the needs of the business or the
question that they would like answered. In some cases, it is possible to work
with a few different areas that fall in the world of mathematics, including
linear regression, statistics, and logistics, to get these models done.
To get all of this done, we also have to make sure we are using the right tools
and methods to make it easier. Some of the computing tools for statistics that
come with R can help as well as working with some other advanced
analytical tools, SQL, and Python, and any visualization tool that you need to
make sure the data makes sense.
Also, we have to make sure that we are getting results that are satisfactory out
of all the work and sometimes that means we need to bring in more than one
algorithm or model to see the results. In this case, the data scientist has to go
through and create a group of models that can work together to go through
that info and answer any of the questions that the business has.
After measuring out the models that they would like to use, the data scientist
can then revise some of the parameters that are in place, and do the fine-
tuning that is needed as they go through the next round of modeling. This
process is going to take up a few rounds to complete because you have to test
it out more than once to make sure that it’s going to work the way that you
would like it to.
What Is Python?
Python is an object-oriented and interpretive computer program language. Its
syntax is simple and contains a set of standard libraries with complete
functions, which can easily accomplish many common tasks. Speaking of
Python, its birth is also quite interesting. During the Christmas holidays in
1989, Dutch programmer Guido van Rossum stayed at home and found
himself doing nothing. So, to pass the "boring" time, he wrote the first
version of Python.
Python is widely used. According to statistics from GitHub, an open-source
community, it has been one of the most popular programming languages in
the past 10 years and is more popular than traditional C, C++ languages and
C# which is very commonly used in Windows systems. After using Python
for some time, Estella thinks it is a programming language specially designed
for non-professional programmers.
Its grammatical structure is very concise, encouraging everyone to write as
much code as possible that is easy to understand and write as little code as
possible.
Functionally speaking, Python has a large number of standard libraries and
third-party libraries. Estella develops her application based on these existing
programs, which can get twice the result with half the effort and speed up the
development progress.
More conveniently, Python can be shipped across platforms. For example,
Estella often writes Python code under his familiar Windows system and then
deploys the developed program to the server of the Linux system. To sum up,
in one sentence Python is studious and easy to use.
The Role of Python in Data Science
After mastering Python as a programming language, Estella can do many
interesting things, such as writing a web crawler, collecting needed data from
the Internet, developing a task scheduling system, updating the model
regularly, etc.
Below we will describe how the Python is used by Estella for Data Science
applications:
Data Cleaning
After obtaining the original data, Estella will first do preliminary processing
on the data, such as unifying the case of the string, correcting the wrong data,
etc. This is also the so-called "clean up" of "dirty" data to make the data more
suitable for analysis. With Python and its third-party library pandas, Estella
can easily complete this step of work.
Data Visualization
Estella uses Matplotlib to display data graphically. Before extracting the
features, Estella can get the first intuitive feeling of the data from the graph
and enlighten the thinking. When communicating with colleagues in other
departments, information can be clearly and effectively conveyed and
communicated with the help of graphics, so those insights can be put on
paper.
Feature Extraction
In this step, Richard usually associates relevant data stored in different
places, for example, integrating basic customer information and customer
shopping information through customer ID. Then transform the data and
extract the variables useful for modeling. These variables are called features.
In this process, Estella will use Python's NumPy, SciPy, pandas, and
PySpark.
Model Building
The open-source libraries sci-kit-learn, StatsModels, Spark ML, and
TensorFlow cover almost all the commonly used basic algorithms. Based on
these algorithm bases and according to the data characteristics and algorithm
assumptions, Estella can easily build the basic algorithms together and create
the model she wants.
The above four things are also the four core steps in Data Science. No
wonder Estella, like most other data scientists, chose Python as a tool to
complete his work.
Python Installation
After introducing so many advantages of Python, let's quickly install it and
feel it for ourselves.
Python has two major versions: Python 2 and Python 3. Python 3 is a higher
version with new features that Python 2 does not have. However, because
Python 3 was not designed with backward compatibility in mind, Python 2
was still the main product in actual production (although Python 3 had been
released for almost 10 years at the time of writing this book). Therefore, it is
recommended that readers still use Python 2 when installing completely. The
code accompanying this book is compatible with Python 2 and Python 3.
The following describes how to install Python and the libraries listed in
section
It should be noted that the distributed Machine Learning library Spark ML
involves the installation of Java and Scala, and will not be introduced here for
the time being.
Conda
It is a management system for the Python development environment and
open source libraries. If readers are familiar with Linux, Conda is
equivalent to pip+virtualenv under Linux. Readers can list installed Python
libraries by entering "Condolist" on the command line.
Spyder
It is an integrated development environment (IDE) specially designed for
Python for scientific computing. If readers are familiar with the mathematical
analysis software MATLAB, they can find that Spyder and MATLAB are
very similar in syntax and interface.
Install Python
install [insert command here]
Pip is a Python software package management system that facilitates us to
install the required third-party libraries. The steps for installing pip are as
follows.
The value 'wooden house' is a string type object, which includes a sequence
of 12 characters. This value is assigned to the variable s, which refers to the
same object. We access the first element with the index 0 (letter_1 = s [0]).
As we indicated in the introduction of the lists, remember that in Python, the
first element of the sequences is at position 0 when indexed (accessed). To
calculate the number of elements, or length, of the sequence of structured
data, we use the internal function len (). The chain has 12 elements and its
last element is in position 11 (length - 1) or -1.
w o o d e n h o u s e
1 2 3 4 5 6 7 8 9 10 11 12
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
String elements and indexes to access them (positive and negative)
An empty string can be created: s = '' (two single quotes without space),
len(s) = 0.
In the cutting operator, if the first index [: m] (before the colon) is omitted,
trimming starts from the first element. If the second index [n:] is omitted, it is
trimmed to the end of the sequence. Negative indexes are useful for accessing
the last element [-1] or last, without requiring the use of the len () function.
The other operators such as concatenate (+) or repeat (*) strings are
applicable to any sequence of data,
>>> s1 = 'house'
>>> s2 = s1 + 'big'
>>> s2
'big house'
>>> s3 = 3 * s1 + '!'
>>> s3
'househousehouse!'
The in operator is considered a Boolean operator over two strings and returns
True if the string on the left is a segment (or substring) of the one on the
right. If it is not, it returns False. The not in operator returns the opposite
logical result. Examples:
This action of capitalizing the first letter of the string can be done
automatically, as shown in the following section, but by creating a new
variable.
String methods Python is an object-oriented language and the data in Python
is in the objects. In object-oriented programming, objects have associated
methods to manipulate their data. The methods are similar to the functions
since they receive arguments and return values. Strings have methods that are
their own. For example, the upper method takes a string and returns another
string but with the uppercase letters.
The upper method instead of being applied to the string s = 'wooden house',
as a function, upper (s), is applied in the form s.upper (). That is, a method is
applied to its values. Let's look at several methods of the strings (there are
methods with and without arguments):
>>> s = 'wooden house'
>>> sM = s.upper () # converts the letters to uppercase
>>> sM
'WOODEN HOUSE'
>>> sM.lower () # converts the letters to lowercase
'wooden house'
>>> s.capitalize () # first letter of the string in uppercase
'Wooden house'
>>> s.title () # first letter of each string word in uppercase
'Wooden House'
>>> i = s.find ('e') # searches the index (position) of the first string 'e'
>>> i # if it doesn't find the string it returns -1
5
>>> s.count ('a') # count how many times the element or string appears
0
>>> s.replace ('o', 'e') # replace the first string with the second
'weeden heuse'
>>> s.split ('') # part s using the string '' producing list
['wooden', 'house']
>>> s1 = 'Hello'
>>> s1.isupper () # True if all characters in S are uppercase
False # False otherwise
>>> s1 [0].isupper ()
True
>>> s1.islower () # True if all characters in S are lowercase
False # False otherwise
>>> s1 [1].islower ()
True
The search for a character or substring in a string can be done with the
structures for, while or directly with some method of the strings. Let's look
first at the classic search options with for and while.
In Python, we can exit the function within a loop, so the search can be:
def search (s, c1):
"" "search for letter c1 in string s
Examples:
>>> search ('hot potato', 'a')
True
>>> search ('potato', 'u')
False
"" "
for c in s:
if c == c1:
return True
return False
But this search can be done with the methods count (), find () or simply with
the Boolean operator in:
def search (s, c1):
"" "search for letter c1 in string s
Examples:
>>> search ('hot potato', 'a')
True
>>> search ('potato', 'u')
False
"" "
return c1 in s
#return s.count (c1)> 0
#return s.find (c1)> = 0
Tuples
Tuples, like strings, are a sequence of elements arranged in a Python object.
Unlike strings (elements are characters) tuples can contain elements of any
type, including elements of different types. The elements are indexed the
same as the strings, through an integer. The syntax of tuples is a sequence of
values separated by commas. Although they are not necessary, they are
usually enclosed in parentheses,
# Example of tuples
>>> a = 1, 2, 3
>>> to
(1, 2, 3)
>>> b = (3, 4, 5, 'a')
>>> b
(3, 4, 5, 'a')
>>> type (a)
<class 'tuple'>
>>> type (b)
<class 'tuple'>
The objects assigned to variables a and b are tuples type. The important thing
is to include commas between the elements. For example,
>>> t = 'k',
>>> t
('k',)
>>> type (t)
<class 'tuple'>
>>> t2 = 'k'
>>> t2
'k'
>>> type (t2)
<class 'str'>
The object 'k' is a tuple, however 'k' is a string. An empty tuple can be created
using parentheses without including anything: (). We can also use the internal
tuple () function to convert an iterable sequence, such as a string or list, to
tuple, or create an empty tuple:
The iterative Python for - in composition can use any iterable sequence,
including tuples:
>>> games = ('tennis', 'baseball', 'football', 'volleyball', 'swimming')
>>> for sport in games:
... print (sport)
tennis
baseball
football
volleyball
swimming
Also, as in string sequences, in tuples, you can use the operations to
concatenate (+) and repeat (*) tuples and the in and not in operators of
membership of elements in tuples.
In the first tuple of variables (a, b, c) the variables receive integer values.
Although this object is structured type, tuple, its elements are variables of
integer type. Similarly, the tuple of variables (d, e, f) each receives values of
type string and its variables will be type string.
This feature of tuple assignments allows solving easily the typical problem of
variable exchange, without requiring an auxiliary variable. For example, if
we want to exchange the values of the variables x = 5 and y = 7, in the
classical languages it would be done:
>>> x = 5
>>> y = 7
>>> temp = x # use of auxiliary (temporary) variable temp
>>> x = y
>>> y = temp
>>> print (x, y)
7 5
In the case of functions, they can also return multiple results that can be
assigned to multiple variables with the use of tuples. Being strict, functions
only return one result. But if that value is a tuple, then it can be assigned to a
tuple of variables. The number of elements is required to match. Let's look at
the following function as an example:
def myFunction (x):
"" "
Returns 2 values: x increased and decreased by 1
"" "
return x + 1, x - 1
a, b = myFunction (10)
print (a, b)
print (myFunction (20))
>>>
11 9
(21, 19)
The function returns a tuple of two values. In the first instruction of the main
body of the program, these values are assigned to the tuple with the variables
a and b. Each of these variables is of the integer type and, for argument 10 of
the function, they receive the values 11 and 9, respectively. These values are
shown by the first print (). The second print () directly shows the tuple that
the function returns.
Functions with an arbitrary number of parameters, using tuples
In the previous topic, we analyzed functions with keywords arguments. There
is the option to define a function with an arbitrary (variable) number of
parameters using the * operator before the parameter name. Let's look at the
function of the following example and its different calls.
def mean (* pair):
sum = 0
for elem in pair:
sum = sum + elem
return sum / len (pair)
print (average (3, 4))
print (average (10.2, 14, 12, 9.5, 13.4, 8, 9.2))
print (average (2))
>>>
3.5
10.9
2.0
The function calculates the average value of the sequence of numbers that is
sent as an argument to the input parameter, which expects to receive a tuple.
The function can be improved to avoid dividing by 0, in case of entering an
empty tuple.
Tuple Methods
As in the strings, there are methods associated with tuple type objects and
lists. but only the methods: s.index (x) and s.count (x). You can also use the
internal functions max and min when the tuples (or lists) are of numerical
values. If the elements are strings, calculate the major or minor element,
according to the position in the ASCII table of the first character. Let's see
some examples,
a = (2, 3, 4, 5, 79, -8, 5, -4)
>>> a.index (5) # index of the first occurrence of 5 in a
3
>>> a.count (5) # total occurrences of 5 in a
2
>>> max
79
>>> min (a)
-8
>>> b = ('az', 'b', 'x')
>>> max (b)
'x'
>>> min (b)
'az'
Zip function
It is an iterator that operates on several iterable ones and creates tuples by
adding elements of the iterable sequences (string, tuples, lists, etc.). Example,
def AccountElemsSamePosition(s1, s2):
"" "tell how many equal letters are in the same position in 2
S1 and S2 words. You can use lists or tuples
>>> AccountElemsSamePosition ('Hello', 'Casting')
3
"" "
counter = 0
for c1, c2, in zip (s1, s2):
if c1 == c2:
counter + = 1
return counter
The repeated elements (25 and 'a') that we included in the frozen set were
discarded.
The mutability of the lists can be observed. Lists can be concatenated and
repeated with the + and * operators, respectively, such as strings and tuples,
>>> v1 = [2, 4, 6, 8, 10]
>>> v3 = [3, 5, 7]
>>> v1 + v3
[2, 4, 6, 8, 10, 3, 5, 7]
>>> 3*v3
[3, 5, 7, 3, 5, 7, 3, 5, 7]
The iterative Python for - in composition has already been used with lists in
the subject of iterative compositions. With the previously defined games list,
we get:
>>> for sport in games:
... print (sport)
tennis
baseball
football
volleyball
swimming
In addition, as in the string and tuple sequences, the lists can be concatenated
with the + operator and repeated with the * operator. Boolean operators in
and not in, as in strings and tuples, evaluate whether or not an element
belongs to a sequence (string, tuple or list). Examples:
>>> v2 = [7, 8, 'a', 'Hello', (2,3), [11, 12]]
>>> 8 in v2
True
>>> 'Hello' in v2
True
>>> 'HELLO' not in v2
True
You can see that both variables a and b are referred to the same object, which
has a 'house' value and occupies the memory position 123917904 (this
position is arbitrary). The instruction “a is b” is true.
'house'
b
In string data types, being immutable, Python creates only one object per
memory economy and both variables are referred to the same object.
However, with the lists, being mutable, although two lists with the same
values are formed, Python creates two objects, which occupy different
memory locations:
>>> a = [1, 2, 3]
>>> b = [1, 2, 3]
>>> id(a)
123921992
>>> id(b)
123923656
>>> a is b
False
a [1, 2, 3]
b [1, 2, 3]
The lists assigned to variables a and b, although with the same value, are
different objects. But you have to be careful with the variable assignments to
the same mutable object. In the following example, when copying a variable,
no other object is created, but copying refers to the same object:
>>> a = [1, 2, 3]
>>> b = a
>>> id(b) # a and b --> [1, 2, 3]
123921992
>>> a is b
True
It can be said that the variable b is an alias of a and that they are referenced.
Therefore, if we modify or add a value to the object [1, 2, 3], through one of
the variables, then we modify the other. Let's see
>>> b[0] = 15
>>> a
[15, 2, 3]
This effect can give unexpected results if not handled carefully. However,
this property is used to pass parameters by reference in functions that behave
as a procedure. If we want to copy one variable from another, we have the
copy method, which will be presented below.
The class has a special main function __init __ () that builds the element of
the Star class (called an object) and is executed when it creates a new object
or instance of that class; we have put name as the only mandatory parameter,
but it does not have to have any.
The mysterious self-variable with which each function begins (called
methods on objects), refers to the specific object we are creating, this will be
clearer with an example. Now we can create Star type objects:
# Star.py library that includes the Star import star class
# New instance (object) of Star, with a parameter (the name), mandatory
star1 = star.Star («Altair»)
# What returns when printing the object, according to the method __str__
print (star1) # Star Altair
print (star1.name) # Altair
When creating the object with name star1, which in the class definition we
call self, we have a new data type with the name property. Now we can add
some methods that can be applied to the Star object:
class Star:
"" "Star class
Example classes with Python
File: star.py
"" "
# Total number of stars
num_stars = 0
def __init __ (self, name):
self.name = name
Star.num_stars + = 1
def set_mag (self, mag):
self.mag = mag
def set_pair (self, pair):
"" "Assigns parallax in arc seconds" ""
self.pair = pair
def get_mag (self):
print "The magnitude of {} of {}". format (self.name, self.mag)
def get_dist (self):
"" "Calculate the distance in parsec from the parallax" ""
print "The distance of {} is {: .2f} pc" .format (self.name, 1 /
self.par)
def get_stars_number (self):
print "Total number of stars: {}". format (Star.num_stars)
Now we can do more things a Star object:
import star
# I create a star instance
altair = star.Star ('Altair')
altair.name
# Returns 'Altair'
altair.set_pair (0.195)
altair.get_stars_number ()
# Returns: Total number of stars: 1
# I use a general class method
star.pc2ly (5.13)
# Returns: 16.73406
altair.get_dist ()
# Returns: The distance of Altair is 5.13 pc
# I create another star instance
other = star.Star ('Vega')
otro.get_stars_number ()
# Returns: Total number of stars: 2
altair.get_stars_number ()
# Returns: Total number of stars: 2
Is not all this familiar? It is similar to the methods and properties of Python
elements such as strings or lists, which are also objects defined in classes
with their methods.
Objects have an interesting property called inheritance that allows you to
reuse properties of other objects. Suppose we are interested in a particular
type of star called a white dwarf, which are Star stars with some special
properties, so we will need all the properties of the Star object and some new
ones that we will add:
class WBStar (Star):
"" "Class for White Dwarfs (WD)" ""
def __init __ (self, name, type):
"" "WD type: dA, dB, dC, dO, dZ, dQ" ""
self.name = name
self.type = type
Star.num_stars + = 1
def get_type (self):
return self.type
def __str __ (self):
return "White Dwarf {} of type {}". format (self.name, self.type)
Neural Networks
It is hard to have a discussion about Machine Learning and data analysis
without taking some time to talk about neural networks and how these forms
of coding are meant to work. Neural networks are a great addition to any
Machine Learning model because they can work similarly to the human
brain. When they get the answer right, they can learn from that, and some of
the synapses that bring it all together will get stronger. The more times that
this algorithm can get an answer right, the faster and more efficient it can
become with its job as well.
With neural networks, each of the layers that you go through will spend a bit
of time at that location, seeing if there is any pattern. This is often done with
images or videos so it will go through each layer of that image and see
whether or not it can find a new pattern. If the network does find one of these
patterns, then it is going to instigate the process that it needs to move over to
the following layer. This is a process that continues, with the neural network
going through many layers until the algorithm has created a good idea of
what the image is and can give an accurate prediction.
There are then going to be a few different parts that can show up when we
reach this point, and it depends on how the program is set up to work. If the
algorithm was able to go through the process above and could sort through all
of the different layers, then it is going to make a prediction. If the prediction
it provides is right, the neurons in the system will turn out stronger than ever.
This is because the program is going to work with artificial intelligence to
make the stronger connections and associations that we need to keep this
process going. The more times that our neural network can come back with
the correct answer, the more efficient this neural network will become in the
future when we use it.
If the program has been set up properly, it is going to make the right
prediction that there is a car in the picture. The program can come up with
this prediction based on some of the features that it already knows belongs to
the car, including the color, the number on the license plate, the placement of
the doors, the headlights, and more.
When you are working with some of the available conventional coding
methods, this process can be really difficult to do. You will find that the
neural network system can make this a really easy system to work with.
For the algorithm to work, you would need to provide the system with an
image of the car. The neural network would then be able to look over the
picture. It would start with the first layer, which would be the outside edges
of the car. Then it would go through some other layers that help the neural
network understand if any unique characteristics are present in the picture
that outlines that it is a car. If the program is good at doing the job, it is going
to get better at finding some of the smallest details of the car, including things
like its windows and even wheel patterns.
There could potentially be a lot of different layers that come with this one,
but the more layers and details that the neural network can find, the more
accurately it will be able to predict what kind of car is in front of it. If your
neural network is accurate in identifying the car model, it is going to learn
from this lesson. It will remember some of these patterns and characteristics
that showed up in the car model and will store them for use later. The next
time that they encounter the same kind of car model, they will be able to
make a prediction pretty quickly.
When working with this algorithm, you are often going to choose one and use
it, when you want to go through a large number of pictures and find some of
the defining features that are inside of them. For example, there is often a big
use for this kind of thing when you are working with face recognition
software. All of the information wouldn’t be available ahead of time with this
method. And you can teach the computer how to recognize the right faces
using this method instead. It is also one that is highly effective when you
want it to recognize different animals, define the car models, and more.
As you can imagine, there are several advantages that we can see when we
work with this kind of algorithm. One of these is that we can work with
this method, and we won’t have to worry as much about the statistics that
come with it. Even if you need to work with the algorithm and you don’t
know the statistics or don’t have them available, the neural network can be
a great option to work with to ensure that any complex relationship will
show up.
Naïve Bayes
We can also work with an algorithm that is known as the Naïve Bayes
algorithm. This is a great algorithm to use any time that you have people who
want to see some more of the information that you are working on, and who
would like to get more involved in the process, but they are uncertain about
how to do this, and may not understand the full extent of what you are doing.
It is also helpful if they want to see these results before the algorithm is all
the way done.
As you work through some of the other algorithms on this page and see what
options are available for handling the data, you will notice that they often
take on hundreds of thousands of points of data. This is why it takes some
time to train and test the data, and it can be frustrating for those on the
outside to find out they need to wait before they can learn anything about the
process. Showing information to the people who make the decisions and the
key shareholders can be a challenge when you are just getting started with the
whole process.
This is where the Naïve Bayes algorithm comes in. It is able to simplify some
of the work that you are doing. It will usually not be the final algorithm that
you use, but it can often give a good idea to others outside of the process
about what you are doing. It can answer questions, puts the work that you are
doing in a much easier to understand the form, and can make sure that
everyone will be on the same page.
Clustering algorithms
One of the best types of algorithms that you can work with is going to be the
clustering algorithm. There are a variety of clustering algorithms out there to
focus on, but they are going to help us ensure that the program can learn
something on its own, and will be able to handle separating the different data
points that we have. These clustering algorithms work best when you can
keep things simple. It takes some of the data that you are working with and
then makes some clusters that come together. Before we start with the
program, though, we can choose the number of clusters that we want to fit the
information too.
The number of clusters that you go with is going to depend on what kind of
information you are working with as well. If you just want to separate your
customers by gender, then you can work with just two clusters. If you would
like to separate the customers by their age or some other feature, then you
may need some more clusters to get this done. You can choose the number of
clusters that you would like to work with.
The nice thing that comes with the clustering algorithms is that they will
handle most of the work of separating and understanding the data for you.
This is because the algorithm is in charge of how many points of data go into
each of the clusters you choose, whether there are two clusters or twenty that
you want to work with. When you take a look at one of these clusters, you
will notice that with all of the points inside, it is safe to assume that these data
points are similar or share something important. This is why they fell into the
same cluster with one another.
Once we can form some of these original clusters, it is possible to take each
of the individual ones and divide them up to get some more sets of clusters
because this can sometimes provide us with more insights. We can do this a
few times, which helps to create more division as we go through the steps. In
fact, it is possible to go through these iterations enough times that the
centroids will no longer change. This is a sign that it is time to be done with
the process.
Decision Trees
Decisions trees are also a good option that we can work with when we want
to take a few available options, and then compare them to see what the
possible outcome of each option is all about. We can even combine a few of
these decision trees to make a random forest and get more results and
predictions from this.
The decision tree is going to be one of the best ways to compare a lot of
options, and then choose the path that is going to be the best for your needs.
Sometimes there are a whole host of options that we can choose from, and
many times they will all seem like great ideas. For businesses who need to
choose from the best option out of the group, and need to know which one is
likely to give them the results that they are looking for, the decision tree is
the best option.
With the decision tree, we can place the data we have into it, and then see the
likely outcome that is going to result from making a certain decision. This
prediction can help us to make smart business decisions based on what we
see. If we had a few different options with this and compare the likely
outcomes from each one, it is much easier to determine which course of
action is the best one for us to take.
K-Nearest Neighbors
The next algorithm that we can look at is known as the K-Nearest Neighbors
algorithm or KNN. When we work with this algorithm, the goal is to search
through all of the data that we have for the k most similar example of any
instance that we want to work with. Once we can complete this process, then
the algorithm can move on to the next step, which is where it will look
through all of the information that you have and provide you with a
summary. Then the algorithm will take those results and give you some of the
predictions you need to make good business decisions.
With this learning algorithm, you will notice that the learning you are
working with becomes more competitive. This works to your advantage
because there will be a big competition going on between the different
elements or the different parts in the models so that you can get the best
solution or prediction based on the data you have at hand.
There are several benefits that we can receive when it comes to working with
this algorithm. For example, it is a great one that cuts through all of that noise
that sometimes shows up in our data. This noise, depending on the set of data
that you use, can be really loud, and cutting this down a bit, can help make a
big difference in the insights that you can see.
And if you are trying to handle and then go through some of the larger
amounts of data that some companies have all at once, then this is a great
algorithm to go with as well. Unlike some of the others that need to limit the
set of data by a bit, the KNN algorithm is going to be able to handle all of
your data, no matter how big the set is. Keep in mind that sometimes the
computational costs are going to be higher with this kind of method, but in
some cases, this is not such a big deal to work with.
To make the K-Nearest neighbors algorithm work the way that you want,
there are going to be a few steps that will make this process a little bit easier.
Working with this algorithm can help us to get a lot done when it is time to
work with putting parts together, and seeing where all of our data is meant to
lie. If you follow the steps that we have above, you will be able to complete
this model for yourself, and see some of the great results in the process when
it is time to make predictions and good business decisions.
Logistic Regression
Logistic regression comprises of logistic model, logistic function, statistics
model, and much more. Therefore, many organizations apply logistic
regression in their day to day activities which mainly composed of data
predictions and analysis. You can always conduct this regression analysis,
especially when the dependent variable is binary. That's dichotomous.
Just like other types of regression analyses, logistic regression is entirely
applied in any analysis dealing with prediction. Its primary function, in this
case, is to describe data. Also, logistic regression can be used to explain or
illustrate the kind of relationship between the binary variable, which is the
dependent one, and the other variables, which are the independent ones. This
regression might look challenging to interpret, but with the help of specific
tools such as Intellectus Statistics, you can easily undertake your data
analysis.
Logistic regression knowledge can be easily applied in statistics with the help
of the logistic model. In this case, the primary function of the logistic model
is actually to come up with the correct results of certain predictions or classes
with the help of probability. For example, probability works best in areas
where you are only required to predict the outcome of the existing events.
These events include: healthy or sick, win or lose, alive or dead, or even in
places where you are making your analysis about the test where someone
either fails or passes. Still, in this model, you will be able to fine-tune your
result primarily through probability. In the case of an image, you will be able
to extend your model to cover up various classes. You will be able to detect
whether the image in your analysis is a lion or a cat, and so on. In this case,
the individual variables within the image will have their probability numbers
between 0 and 1. However, the sum here should be adding up to one.
Therefore, logistic regression refers to a basic statistical model that makes
greater use of the available logistic function regardless of the complexity of
more extensions that might exist. Logistic regression is part and parcel of the
regression analysis, and on many occasions, it is applied in various analyses
where logistic model parameters are estimated. Remember, the logistic model
is like a form or a type of binary regression. Therefore, a binary regression
consists of a binary logistic model. This model is composed of a dependent
variable which includes two possible values of events. These values can be
represented as pass/fail, alive/dead, good/bad, and much more. You need to
note that the indicator variable actually denotes these possible values and
always they have labeled 0 and 1. Within this logistic model, the odds
logarithm that’s log-odds, for the values of 1 represents a linear combination.
In that, this combination has got one or more variables that are entirely
independent. In this case, they are called predictors here.
Moreover, in logistic regression analysis, independent variables sometimes
may each form a binary variable or sometimes a continuous variable. In the
case of a binary variable, there must be the presence of two classes or events,
and they have to be coded by the indicator variables. However, on the other
hand, continuous variable represents real value. In the logistic regression
analysis, the corresponding probability of these values always varies between
0 and 1 as has been denoted previously above. In this analysis, these log-
odds, that’s, algorithms of odds will be converted by logistic function into
probability. Log odds are measured in logit which also a derivative of its
name (logistic unit). Again, you can also use a probit model with a different
sigmoid function to convert the log odds into a probability for easy analysis.
You need to note that the probit model is an example of an analogous model
which comprises of the sigmoid function.
All in all, you will realize that the logistic model is the most preferred in this
conversion due to its defining attributes or characteristics. One such feature
of the logistic model is its ability to increase the multiplicatively scales of
each of the independent variables. As a result of this, it produces an outcome
with parameters assigned to each independent variable at a constant rate.
However, this will generalize the odd ratio if at all, it is part of a variable
which is a binary dependent.
It is also good to note that there are extensions when it comes to dependent
variables, especially in some regression such as binary logistic. However, this
extension is only applicable where two or more levels are used. These two
extensions include multinomial logistic regression which works best with
categorical outputs, especially the one having several values that's, two
values and above. The next type of logistic regression extension is the ordinal
logistic regression which deals with a huge collection of multiple categories.
A good example here is the ordinal logistic model dealing with the
proportional odds. However, this system only does modeling and not
performing any classifications dealing with the statistics since it is not a
classifier. Therefore, it will only convert the probability input into an output.
Following this, let us discuss the applications of logistic regression in a real-
life situation.
Machine learning changes the game because it can keep up. The algorithms
that you can use with it can handle all of the work while getting the results
back that you need, in almost real-time. This is one of the big reasons that
businesses find that it is one of the best options to go with to help them make
good and sound decisions, to help them predict the future, and it is a welcome
addition to their business model.
Chapter 7 Data Aggregation and Group Operations
Taking the time to categorize our set of data, and giving a function to each of
the different groups that we have, whether it is transformation or aggregation,
is often going to be a critical part of the workflow for data analysis. After we
take the time to load, merge, and prepare a set of data, it is then time to
compute some more information, such as the group statistics or the pivot
tables. This is done to help with reporting or with visualizations of that data.
There are a few options that we can work with here to get this process done.
But Pandas is one of the best because it provides us with a flexible interface.
We can use this interface to slice, dice, and then summarize some of the sets
of data we have more easily.
One reason that we see a lot of popularity for SQL and relational databases of
all kinds is that we can use them to ease up the process which joins, filters,
transforms, and aggregates the data that we have. However, some of the
query languages, including SQL, that we want to use are going to be more
constrained in the kinds of group operations that we can perform right with
them.
As we are going to see with some of the expressiveness that happens with the
Pandas library, and with Python, in general, we can perform a lot of more
complex operations. This is done by simply utilizing any function that can
accept an array from NumPy or an object from Pandas.
Each of the grouping keys that you want to work with can end up taking a
variety of forms. And we can see that the keys don’t have to all come in as
the same type. Some of the forms that these grouping keys can come in for us
to work on include:
elif answers == 5:
print(“Concentrate and ask again”)
elif answers == 6:
print(“Reply hazy, try again.”)
elif answers == 7:
print(“My reply is no”)
elif answers == 8:
print(“My sources say no”)
Remember, in this program, we chose to go with eight options because it is a
Magic 8 ball and that makes the most sense. But if you would like to add in
some more options, or work on another program that is similar and has more
options, then you would just need to keep adding in more of the elif statement
to get it done. This is still a good example of how to use the elif statement
that we talked about earlier and can give us some good practice on how to use
it. You can also experiment a bit with the program to see how well it works
and make any changes that you think are necessary to help you get the best
results.
How to make a Hangman Game
The next project that we are going to take a look at is creating your own
Hangman game. This is a great game to create because it has a lot of the
different options that we have talked about throughout this guidebook and
can be a great way to get some practice on the various topics that we have
looked at. We are going to see things like a loop present, some comments,
and more and this is a good way to work with some of the conditional
statements that show up as well.
Now, you may be looking at this topic and thinking it is going to be hard to
work with a Hangman game. It is going to have a lot of parts that go together
as the person makes a guess and the program tries to figure out what is going
on, whether the guesses are right, and how many chances the user gets to
make these guesses. But using a lot of the different parts that we have already
talked about in this guidebook can help us to write out this code without any
problems. The code that you need to use to create your very own Hangman
game in Python includes:
# importing the time module
importing time
#welcoming the user
Name = raw_input(“What is your name?”)
print(“Hello, + name, “Time to play hangman!”)
print(“
“
#wait for 1 second
time.sleep(1)
print(“Start guessing…”)
time.sleep(.05)
#here we set the secret
word = “secret”
#creates a variable with an empty value
guesses = ‘ ‘
#determine the number of turns
turns = 10
#create a while loop
#check if the turns are more than zero
while turns > 0:
#make a counter that starts with zero
failed = 0
#for every character in secret_word
for car in word:
#see if the character is in the players guess
if char in guesses:
After this part, we are going to take the time to define your function so that it
is able to run the k-means algorithm before plotting the result. This is going
to end up with a scatterplot where the color will represent how much of the
membership is inside of a particular cluster. We would do that with the
following code:
def plot_k_means(X, K, max_iter=20, beta=1.0):
N, D = X.shape
M = np.zeros((K, D))
R = np.ones((N, K)) / K
# initialize M to random
for k in xrange(K):
M[k] = X[np.random.choice(N)]
grid_width = 5
grid_height = max_iter / grid_width
random_colors = np.random.random((K, 3))
plt.figure()
costs = np.zeros(max_iter)
for i in xrange(max_iter):
# moved the plot inside the for loop
colors = R.dot(random_colors)
plt.subplot(grid_width, grid_height, i+1)
plt.scatter(X[:,0], X[:,1], c=colors)
# step 1: determine assignments / responsibilities
# is this inefficient?
for k in xrange(K):
for n in xrange(N):
R[n,k] = np.exp(-beta*d(M[k], X[n])) / np.sum( np.exp(-
beta*d(M[j], X[n])) for j in xrange(K) )
# step 2: recalculate means
for k in xrange(K):
M[k] = R[:,k].dot(X) / R[:,k].sum()
costs[i] = cost(X, R, M)
if i > 0:
if np.abs(costs[i] - costs[i-1]) < 10e-5:
break
plt.show()
Notice here that both the M and the R are going to be matrices. The R is
going to become the matrix because it holds onto 2 indices, the k and the n.
M is also a matrix because it is going to contain the K individual D-
dimensional vectors. The beta variable is going to control how fuzzy or
spread out the cluster memberships are and will be known as the
hyperparameter. From here, we are going to create a main function that will
create random clusters and then call up the functions that we have already
defined above.
def main():
# assume 3 means
D = 2 # so we can visualize it more easily
s = 4 # separation so we can control how far apart the means are
mu1 = np.array([0, 0])
mu2 = np.array([s, s])
mu3 = np.array([0, s])
N = 900 # number of samples
X = np.zeros((N, D))
X[:300, :] = np.random.randn(300, D) + mu1
X[300:600, :] = np.random.randn(300, D) + mu2
X[600:, :] = np.random.randn(300, D) + mu3
# what does it look like without clustering?
plt.scatter(X[:,0], X[:,1])
plt.show()
K = 3 # luckily, we already know this
plot_k_means(X, K)
# K = 5 # what happens if we choose a "bad" K?
# plot_k_means(X, K, max_iter=30)
# K = 5 # what happens if we change beta?
# plot_k_means(X, K, max_iter=30, beta=0.3)
if __name__ == '__main__':
main()
Yes, this process is going to take some time to write out here, and it is
not always an easy process when it comes to working through the
different parts that come with Machine Learning and how it can affect
your code. But when you are done, you will be able to import some of
the data that your company has been collecting, and then determine how
this compares using the K-means algorithm as well.
Chapter 9 Functions and Modules in Python
In Python programming, functions refer to any group of related statements
that perform a given activity. Functions are used in breaking down programs
into smaller and modular bits. In that sense, functions are the key factors that
make programs easier to manage and organize as they grow bigger over time.
Functions are also helpful in avoiding repetition during coding and make
codes reusable.
Docstring
The docstring is the first string that comes after the function header. The
docstring is short for documentation string and is used in explaining what a
function does briefly. Although it is an optional part of a function, the
documentation process is a good practice in programming. So, unless you
have got an excellent memory that can recall what you had for breakfast on
your first birthday, you should document your code at all times. In the
example shown below, the docstring is used directly beneath the function
header.
>>> greet(“Amos”)
Hello, Amos. Good morning!
Triple quotation marks are typically used when writing docstrings so they can
extend to several lines. Such a string is inputted as the __doc__ attribute of
the function. Take the example below.
You can run the following lines of code in a Python shell and see what it
outputs:
>>> print(greet.__doc__)
This function greets to the person passed into the name parameter
Syntax of return
This statement can hold expressions that have been evaluated and have their
values returned. A function will return the Noneobject if the statement is
without an expression, or its return statement is itself absent in the function.
For instance:
>>> print(greet('Amos'))
Hello, Amos. Good morning!
None
In this case, the returned value is None.
Chapter 10 Interaction with Databases
Data management is not a scientific discipline per se. However, increasingly,
it permeates the activities of basic scientific work. The increasing volume of
data and increasing complexity has long exceeded manageability through
simple spreadsheets.
Currently, the need to store quantitative, qualitative data and media of
different formats (images, videos, sounds) is very common in an integrated
platform from which they can be easily accessed for analysis, visualization or
simply consultation.
The Python language has simple solutions to solve this need at its most
different levels of sophistication. Following the Python included batteries, its
standard library introduces us to the Pickle and cPickle module and, starting
with Version 2.5, the SQLite3 relational database.
f= open ('pictest’,’r’)
b=pickle.load (f)
b.say_hi()
hi alex !
This way we can modify the class, and the stored instance will recognize the
new code as it is restored from the file, as we can see above. This feature
means that pickles do not become obsolete when the code they are based on
is updated (of course this is only for modifications that do not remove
attributes already included in the pickles).
The pickle module is not built for data storage, simply, but for complex
computational objects that may contain data themselves. Despite this
versatility, it is because it consists of a readable storage structure only by the
pickle module itself in a Python program.
The cursor object can also be used as an iterator to get the result of a query.
c.execute (' selectfrom specimens by weight’)
for reg in c:
print reg
(' jerry’, 5.1, 0.2)
(' tom’, 1 2.5, 2.2 9 9 9 9 9 9 9 9 9 9 9 9 9 9 8)
(' butch’, 4 2.4, 1 0.3)
The SQLite module is really versatile and useful, but it requires the user to
know at least the rudiments of the SQL language. The following solution
seeks to solve this problem in a more Pythonic way.
Mapreduce Technique
Data Mining applications manage vast amounts of data constantly. You must
opt for a new software stack to tackle such applications. Stack software has
its file system stored that is called a distributed file system. This file system is
used for retrieving parallelism from a computing cluster or clusters. This
distributed file system replicates data to enforce security against media
failures. Other than this stack file system, there is a higher-level
programming system developed to ease the process viz. Mapreduce.
Mapreduce is a form of computed implemented in various systems, including
Hadoop and Google. Mapreduce implementation is a data mining technique
used to tackle large-scale computations. It is easy to implement, i.e.; you
have to type only three functions viz. Map and Reduce. The system will
automatically control parallel execution and task collaboration.
Distance Measures
The main limitation of data Mining is that it is unable to track similar
data/items. Consider an example where you have to track duplicate websites
or web content while browsing various websites. Another example can be
discovering similar images from a large database. To handle such problems,
the Distance Measure technique is made available to you. Distance Measure
helps to search for the nearest neighbors in a higher-dimensional space. It is
very crucial to define what similarity is. Jaccard Similarity can be one of the
examples. The methods used to identify similarity and define the Distance
Measure Technique
Shingling
Min-Hashing
Locality Sensitive Hashing
A K-Shingle
Locality-Sensitive Hashing
Link Analysis
Link Analysis is performed when you can scan the spam vulnerabilities.
Earlier, most of the traditional search engines failed to scan the spam
vulnerabilities. However, as technology got its wings, Google was able to
Introduce some techniques to overcome this problem.
Pagerank
Pagerank techniques use the method of simulation. It monitors every page
you are surfing to scan spam vulnerability. This whole process works
iteratively, meaning pages that have a higher number of users are ranked
better than pages without users visiting.
The Content
The content on every page is determined by some specific phrases used in a
page to link with external pages. It is a piece of cake for spammers to modify
the internal page where they are administrators, but it becomes difficult for
them to modify the external pages. Every page is allocated a real number via
a function. The page with a higher rank becomes more important than the
page that does not have a considerable page rank. There are no algorithms set
for assigning ranks to pages. But for highly confidential or connected Web
Graphics, they have a transition matrix based ranking. This principle is used
for calculating the rank of a page.
Data Streaming
At times, it is difficult to know datasets in advance; also, the data appears in
the form of a stream and gets processed before it disappears. The speed of
arrival of the data is so fast that it is difficult to store it in the active storage.
Here, data streaming comes into the picture. In the dataStream management
system, an unlimited number of streams can be stored in a system. Each data
stream produces elements at its own time. Elements have the same rate and
time in a particular stream cycle. Such streams are archived into the store. By
doing this, it is somewhat difficult to reply to queries already stored in the
archival. But such situations are handled by specific retrieval methods. There
is a working store as well as an active store that stores the summaries to reply
to specific queries. There are certain data Streaming problems viz.
Filtering Streams
To select specific tuples to fit a particular criterion, there is a separate process
where the accepted tuples are carried forward, whereas others are terminated
and eliminated. There is a modern technique known as Bloon Filtering that
will allow you to filter out the foreign elements. The later process is that the
selected elements are hashed and collected into buckets to form bits. Bits
have a binary working, i.e., 0 and 1. Such bits are set to 1. After this, the
elements are set to be tested.
Network
As specified earlier, networks can have a simple or small group of computers
connected or large groups of computers connected. The largest network can
be the Internet. The small groups can be home local networks like Wi-Fi, and
Local Area Network that is limited to certain computers or locality. There are
shared networks such as media, web pages, app servers, data storage, and
printers, and scanners. Networks have nodes, where a computer is referred to
as a node. The communication between these computers is established by
using protocols. Protocols are the intermediary rules set for a computer.
Protocols like HTTP, TCP, and IP are used on a large scale. All the
information is stored on the computer, but it becomes difficult to search for
information on the computer every time. Such information is usually stored
in a data Centre. Data Centre is designed in such a way that it is equipped
with support security and protection for the data. Since the cost of computers
and storage has decreased substantially, multiple organizations opt to make
use of multiple computers that work together that one wants to scale. This
differs from other scaling solutions like buying other computing devices. The
intent behind this is to keep the work going continuously even if a computer
fails; the other will continue the operation. There is a need to scale some
cloud applications, as well. Having a broad look at some computing
applications like YouTube, Netflix, and Facebook that requires some scaling.
We rarely experience such applications failing, as they have set up their
systems on the cloud. There is a network cluster in the cloud, where many
computers are connected to the same networks and accomplish similar tasks.
You can call it as a single source of information or a single computer that
manages everything to improve performance, scalability, and availability.