Final Last
Final Last
SUBMITTED
NALLALA HARI VIJAY RAM GOPAL
Regd.no:208297603033
A.Y: 2020-2024
I
ADIKAVI NANNAYA UNIVERSITY: RAJAMAHENDRAVARAM
CERTIFICATE
This is to certify that this internship report entitled, “MACHINE LEARNING USING PYTHON”
bonafide work of NALLALA HARI VIJAY RAM GOPAL Regd.no:208297603033 submitted in
the partial fulfillment of requirements for the award of Degree of B. Tech (ECE) during the period
2020- 2024. This work is carried out in the EASY SHIKSHA
COOURSE COORDINATOR
II
INTERNSHIP CERTIFICATE
III
DECLARATION
I, NALLALA HARI VIJAY RAM GOPAL, Reg.no:208297603033 hereby declare that the
Internship project report entitled “MACHINE LEARING USING PYTHON” done by me under
the guidance of Mr.B. Krishna M.Tech, Assistant professor, Department of Electronics and
Communication Engineering University College of Engineering, Adikavi Nannaya
University,RAJAMAHENDRAVARAM is submitted for the partial fulfillment of requirement for
the award of the degree, Bachelor of Technology in Electronics and Communication Engineering in
the academic year 2020-2024.
208297603033
IV
ACKNOWLEDGEMENT
I would like to take this opportunity to express my deep gratitude to the members who assisted us
directly and indirectly for the completion of this project work
I feel fortunate to pursue my Bachelor Degree from the campus of Adikavi Nannaya University. It
provided all the facilities in the areas of Electronics and Communication Engineering.
I profusely thank Dr. V PERSIS, Principal, University College of Engineering for all the
encouragement and support.
I also thank Mr. B. SUDHA KIRAN M.Tech, Course Coordinator, Electronics and Communication
Engineering for the guidance throughout the academic year.
It is a genuine pleasure to express my deep sense of thanks and gratitude to my mentor and project
guide Mr.B. Krishna M.Tech, Assistant Professor, Department of Electronics and Communication
Engineering, excellent guidance right from the selection of internship.
I wish to express my deep sense of gratitude to the management of EASY SHIKSHA for giving me
an opportunity to complete my MACHINE LEARNING WITH PYTHON for the partial
fulfillment of my degree in Bachelors of Technology in Electronics and Communication
Engineering.
A great deal of thanks goes to review committee members and the entire faculty members for their
support throughout the project.
208297603033
V
ABSTRACT
Machine learning (ML) is the scientific study of algorithms and statistical models that
computer systems use to perform a specific task without being explicitly programmed. Every
time a web search engine like Google is used to search the internet, one of the reasons that
work so well is because a learning algorithm that has learned how to rank web pages. These
algorithms are used for various purposes like data mining, image processing, predictive
analytics, etc. to name a few. The main advantage of using machine learning is that, once an
algorithm learns what to do with data, it can do its work automatically.
1. Introduction to Machine Learning and structure of the data that is being fed to ML
model.
2. Basics of Python to get started with Machine Learning and implementing the concept
using hands on examples.
VI
COMPANY PROFILE
INTRODUCTION:
MR. SUNIL SHARMA our founder & CEO, started Easy Shiksha as a blog
with a mission to bring a culture of meaningful internships in India. And for the first two
years, he hired only virtual interns.
After building a small team, we then launched our website with just one goal - to
equip every student in India with their dream internship. And we did it all for free.
The next big step could not have been anything other than launching our very own
Android app, bringing Easy Shiksha in the ‘hands’ of the students.
After many successful years as an internship platform, our motivation to upskill the
students only increased, and that’s when we kickstarted a new journey with Easy Shiksha
Trainings.
With Fresher jobs, we embarked on a journey filled with newer challenges, which
allowed us to provide bigger & better opportunities to graduates with 0-2 years of
experience.
With an insight that more than 90% of the graduates in India start their careers with
a job that pays less than 3LPA, we came up with Jobs Oriented Specialization programs
to help the students start their careers in their dream profiles.
Sunil Sharma, The one who started it all. It takes a lot of guts to create a start up
and he's the one with all of it. Meet the CEO. Yes, he's got loads of patience and humility.
But on that once-in-a-blue-moon day when he breathes fire, no one dares to tread his
territory. He's a master manager and rightfully at the helm of affairs at Easy Shiksha
VII
INDEX
2.3 ADVANTAGES
3.1 NUMPY
3.2 PANDAS
3.3 MATPLOTLIB
VIII
CHAPTER – 6: DECISION TREE 21-22
CHAPTER – 8: 25
CONCLUSION
REFERENCES
IX
CHAPTER 1
INTRODUCTION
In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which work on our
instructions. But can a machine also learn from experiences or past data like a human does? So here
comes the role of Machine Learning.
Machine Learning is said as a subset of artificial intelligence that is mainly concerned with
the development of algorithms which allow a computer to learn from the data and past experiences
on their own. The term machine learning was first introduced by Arthur Samuel in 1959. Define it in
a summarized way as:
“Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.”
A Machine Learning system learns from historical data, builds the prediction models, and whenever
it receives new data, predicts the output for it. The accuracy of predicted output depends upon the
amount of data, as the huge amount of data helps to build a better model which predicts the output
more accurately.
Suppose we have a complex problem, where we need to perform some predictions, so instead of
writing a code for it, we just need to feed the data to generic algorithms, and with the help of these
algorithms, machine builds the logic as per the data and predict the output. The below block diagram
explains the working of Machine Learning algorithm:
1
1.2 CLASSIFICATION OF MACHINE LEARNING
I. Supervised learning
II. Unsupervised learning
III. Reinforcement learning
Facebook Newsfeed
Facebook photo auto tagging feature
Product Recommendations by shopping portals
Weather report Prediction
CONCLUSION
Machine learning is a field of artificial intelligence that deals with the design and
development of algorithms that can learn from and make predictions on data. The aim of machine
learning is to automate analytical model building and enable computers to learn from data without
being explicitly programmed to do so.
2
CHAPTER 2
PYTHON
INTRODUCTION
declarations of variables, parameters, functions, or methods in source code. This makes the code
short and flexible, and you lose the compile-time type checking of the source code. Python tracks
the types of all values at runtime and flags code that does not make sense as it runs.
Data types determine whether an object can do something, or whether it just would not make
sense. Other programming languages often determine whether an operation makes sense for an
object by making sure the object can never be stored some where the operation will be performed on
the object (this type system is called static typing). Python does not do that. Instead, it stores the type
of an object with the object, and checks when the operation is performed whether that operation
makes sense for that object Python has many native data types. Here are the important ones:
• Numbers can be integers (1 and 2), floats (1.1 and 1.2), fractions, or even complex
numbers
3
2.2 STRUTURED DATA AND UNSTRUCTURED DATA
Structured Data:
This type of data is either numbers or words. This can take numerical values but
mathematical operations cannot be performed on it. This type of data is expressed in tabular format.
Unstructured Data:
This type of data does not have the proper format and is therefore known as unstructured
data. This comprises textual data, sounds, images, videos, etc.
2.3 ADVANTAGES
Simple Syntax
4
Robust Standard Libraries
Open Source
Compatible with different platforms like Windows, Mac, Linux, Raspberry Pie etc.
An IDE (Integrated Development Environment) is a Software Suite that consolidates the basic
tools that developers need to write and test a software. We not only need Python but also need IDE to
manage code, files etc. An IDE consolidates basic environmental tools like code editor, compiler and
debugger.
Some of the IDE tools are Anaconda Jupyter, Pycharm, Anaconda Spyder etc.
Ability to execute specific part of the code without having to execute the entire code.
CONCLUSION
Python is both a free and open-source programming language. Python is the easiest language to
learn. Its English-like structure aids non-technical pupils in quickly picking up the language. A
newcomer can always turn to the most active python community for assistance. Python also does
not necessitate the installation of any software. It can be researched in a browser using Google
Collabo or Kaggle. Python generates a considerable number of employments as its popularity and
usability grow.
5
CHAPTER – 3
PYTHON LIBRARIES
INTRODUCTION
A Python library is a reusable chunk of code that you may want to include in your programs/
projects. Compared to languages like C++ or C, a Python libraries do not pertain to any specific
context in Python. Here, a ‘library’ loosely describes a collection of core modules. Essentially, then,
a library is a collection of modules. A package is a library that can be installed using a package
manager like rubygems or npm.
3.1 NUMPY
• NumPy is a library for the Python programming language, adding support for large,
multi- dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays.
• Library for scientific computing.
6
NUMPY FUNCTIONS:
1. np.array()
2. np.ones()
3. np.zeros()
4. np.arange(10)
5. np.linspace(1,3,5)
6. np.sum()
7. np.max()
8. np.min()
9. np.mean()
NumPy is a Python library used for working with arrays. It also has functions for
working in domain of linear algebra, Fourier transform, and matrices.
3.2 PANDAS
Pandas is a software library written for the Python programming language for data
manipulation and analysis. It offers data structures and operations for manipulating numerical
tables and time series. It is free software released under the three-clause BSD license.
7
Example:
Advantages:
Fast and efficient for manipulating and analysing data.
Data from different file objects can be loaded.
Easy handling of missing data (represented as NaN) in floating point as well as
non- floating-point data
Size mutability: columns can be inserted and deleted from Data Frame and
higher dimensional objects
Data set merging and joining.
Flexible reshaping and pivoting of data sets
Provides time-series functionality.
Powerful group by functionality for performing split-apply-combine operations on
data sets.
3.3 MATPLOTLIB
Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It provides an object-oriented API for embedding plots into
applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK.
8
3.4 SCIKIT LEARN
Scikit-learn is a free software machine learning library for the Python programming language.
It features various classification, regression and clustering algorithms including support-
vector machines, etc.
scikit-learn allows us to define machine learning algorithms and compare them to one another, as
well as offers tools to pre-process data. K-means clustering, Random Forests, Support Vector
Machines, and any other machine learning model that we might want to develop are all included
in Scikit-learn.
CONCLUSION
Python libraries are very much important for any specific operation to be done using
python and only single library does not contain all the commands there are some libraries as we
discussed above.
9
CHAPTER – 4
MACHINE LEARNING
The types of machine learning algorithms differ in their approach, the type of data they input
and output, and the type of task or problem that they are intended to solve. Broadly Machine
Learning can be categorized into four categories.
1. Supervised
2. Unsupervised
3. Reinforcement
4. Semi-supervised
Machine learning enables analysis of massive quantities of data. While it generally delivers
faster, more accurate results in order to identify profitable opportunities or dangerous risks, it may
also require additional time and resources to train it properly.
1. Supervised Learning
Supervised learning is one of the most basic types of machine learning. In this type, the
machine learning algorithm is trained on labelled data. Even though the data needs to be labelled
10
accurately for this method to work, supervised learning is extremely powerful when used in the
right circumstances.
In supervised learning, the ML algorithm is given a small training dataset to work with. This
training dataset is a smaller part of the bigger dataset and serves to give the algorithm a basic
idea of the problem, solution, and data points to be dealt with. The training dataset is also very
similar to the final dataset in its characteristics and provides the algorithm with the labelled
parameters required for the problem.
The algorithm then finds relationships between the parameters given, essentially establishing
a cause-and-effect relationship between the variables in the dataset. At the end of the training,
the algorithm has an idea of how the data works and the relationship between the input and
the output.
2. Unsupervised Learning
Unsupervised machine learning holds the advantage of being able to work with unlabelled
data. This means that human lab or is not required to make the dataset machine-readable,
allowing much larger datasets to be worked on by the program.
In supervised learning, the labels allow the algorithm to find the exact nature of the relationship
between any two data points. However, unsupervised learning does not have labels to work off
of, resulting in the creation of hidden structures. Relationships between data points are perceived
by the algorithm in an abstract manner, with no input required from human beings.
11
Fig4.1.3: Working of unsupervised learning
The creation of these hidden structures is what makes unsupervised learning algorithms versatile.
Instead of a defined and set problem statement, unsupervised learning algorithms can adapt to
the data by dynamically changing hidden structures.
3. Reinforcement Learning
We model an environment after the problem statement. The model interacts with this
environment and comes up with solutions all on its own, without human interference.
4. Semi-supervised
I. Regression
II. Classification
12
Regression:
Regression is a technique for investigating the relationship between independent
variables or features and a dependent variable or outcome. It's used as a method for predictive
modelling in machine learning, in which an algorithm is used to predict continuous outcomes.
Linear Regression
Decision tree regression
Random forest Regression
Classification:
Classification is a technique for determining which class the dependent belongs to base
on one or more independent variables.
Logistic Regression
Decision Tree
Random Forest
KNN
Logistic regression is a supervised learning classification algorithm used to predict the probability of
a target variable. The nature of target or dependent variable is dichotomous, which means there
would be only two possible classes.
In simple words, the dependent variable is binary in nature having data coded as either 1
(stands for success/yes) or 0 (stands for failure/no).
The unsupervised learning algorithm can be further categorized into two types of problems:
Clustering
Association
Clustering (k – means)
Clustering or Cluster Analysis is a machine learning technique, which groups the unlabeled dataset.
It can be defined as:
13
"A way of grouping the data points into different clusters, consisting of similar data points. The
objects with the possible similarities remain in a group that has less or no similarities with another
group."
Clustering Algorithms
The clustering algorithm is based on the kind of data that we are using. Such as, some algorithms
need to guess the number of clusters in the given dataset, whereas some are required to find the
minimum distance between the observation of the dataset.
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems. It groups the unlabeled dataset into different clusters. “It is an iterative algorithm that
divides the unlabelled dataset into k different clusters in such a way that each dataset belongs only
one group that has similar properties.”
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of
this algorithm is to minimize the sum of distances between the data point and their corresponding
clusters.
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to the
particular k- center, create a cluster.
The below diagram explains the working of the K-means Clustering Algorithm:
14
4.4 APPLICATIONS OF MACHINE LEARNING:
Machine learning is a buzzword for today’s technology, and it is growing very rapidly day by
day. We are using machine learning in our daily life even without knowing it such as Google
Maps, Google assistant, Alexa, etc. Below are some most trending real-world applications of
Machine Learning:
Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image recognition
and face detection is, Automatic friend tagging suggestion.
Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car manufacturing
company, is working on self-driving cars. It is using an unsupervised learning method to train the
car models to detect people and objects while driving.
15
Medical Diagnosis:
CONCLUSION
Machine learning is a powerful tool for making predictions from data. However, it is
important to remember that machine learning is only as good as the data that is used to train the
algorithms. In order to make accurate predictions, it is important to use high-quality data that is
representative of the real-world data that the algorithm will be used on.
16
CHAPTER – 5
DATA EXPLORATION AND PREPROCESSING
INTRODUCTION
Data exploration is a key aspect of data analysis and model building. Without spending
significant time on understanding the data and its patterns one cannot expect to build efficient
predictive models. Data exploration takes major chunk of time in a Machine Learning Project
comprising the data cleaning and preprocessing.
Identify variables
Variable analysis
Handling missing values
Handling outliers
Feature engineering
1. Identifying Variables:
Variable can be of different types such as character, numeric, categorical, and continuous.
Identifying the predictor and target variable is also a key step in model building. Target is the
dependent variable and predictor is the independent variable based on which the prediction is
made. Categorical or discrete variables are those that cannot be mathematically manipulated. It is
made up of fixed values such as 0 and 1.
2. Variable Analysis:
Variable analysis can be done in three ways, univariate analysis, bivariate analysis, and multivariate
analysis
Univariate analysis
It is used to highlight missing and outlier values. Here each variable is analyzed on its own
for range and distribution.
17
For categorical variables, you can use frequency table to understand distribution of each
category.
For continuous variables, you have to understand the central tendency and spread of the
variable. It can be measured using mean, median, mode, etc..
Bivariate Analysis
o Scatter plot is suitable for analyzing two continuous variables. It indicates the linear
or non-linear relationship between the variables.
o Matplotlib and Seaborn libraries can be used to plot different relational graphs that
help visualizing bivariate relationship between different types of variables.
Missing values in the dataset can reduce model fit. It can lead to a biased model as the data cannot be
analysed completely. Behaviour and relationship with other variables cannot be deduced correctly. It
can lead to wrong predictions or classifications. Missing values can be treated by deletion,
mean/mode/median imputation, at Transformation, and Binning.
4. Handling Outliers:
Outliers can occur naturally in a data or can be due to data entry errors. They can drastically change
the results of the data analysis and statistical modelling. Outliers are easily detected by visualization
methods, like box- plots, histograms, and scatter plot. Outliers are handled like missing values by
deleting by deleting, transforming them, binning or grouping them, treating them as a separate group,
or imputing values.
5. Feature Engineering:
Feature engineering is the process of extracting more information from existing data. Feature
selection also can be part of it. Two common techniques of feature engineering are variable
transformation and variable creation. In variable transformation existing variable is transformed
using certain functions. For example, a number can be replaced by its logarithmic value.
18
5.2 SPLITTING OF DATA
It is suggested to divide your dataset into three parts to avoid overfitting andmodel selection bias
called:
Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning model.
A real-world data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data preprocessing is required tasks for
cleaning the data and making it suitable for a machine learning model which also increases the
accuracy and efficiency of a machine learning model.
19
Importing libraries
Importing datasets
Finding Missing Data
Encoding Categorical Data
Splitting dataset into training and test set
Feature scaling
CONCLUSION
Data exploration is very helpful whenever the need is to gain new insights from data. It can
be the most effective – and sometimes only – approach when the data suffers from quality problems
(e.g., gaps, anomalies, outliers) and other real-world complexities such as process changes or when it
is not practical to formulate a precise question beforehand that can be answered by a numeric result.
In this respect, research & development, engineering as well as data science are among those fields
that can benefit a lot from data exploration. With today’s computing power and the support of
modern analytics, interactive data exploration can be an exciting and engaging experience for
everyone to discover unexpected value in large amounts of complex data.
20
CHAPTER 6
DECISION TREE
INTRODUCTION
Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. “It is a graphical
representation for getting all the possible solutions to a problem/decision based on given conditions”.
Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Parent/Child node: The root node of the tree is called the parent node,and other nodes are
called the child nodes.
Splitting: Splitting is the process of dividing the decision node/root nodeinto sub- nodes
according to the given conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute Selection Measure (ASM). By this measurement, easily select the best attribute
for the nodes of the tree. There are two popular techniques for ASM, which are:
1. Information Gain
21
2. Gini Index
1.Information Gain:
Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
It calculates how much information a feature provides us about a class.
According to the value of information gain, we split the node and build the decision tree.
A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy (each feature)]
2. Gini Index:
CONCLUSION
As one of the most important and supervised algorithms, Decision Tree plays a vital role in
decision analysis in real life. As a predictive model, it is used in many areas for its split approach
which helps in identifying solutions based on different conditions by either classification or
regression method.
22
CHAPTER – 7
PROJECT
Multiple linear Regression is a statistical technique that uses multiple linear regression
to model more complex relationships between two or more independent variables and one
dependent variable. It is used when there are two or more x variables.
I was instructed to perform project on SENTIMENTAL ANALYSIS, which comes
under Multilinear Regression. In project, they provided a dataset of restaurant reviews
which is a
example of sentimental analysis. Hence, I developed the code to predict the reviews given by
the customers and found the accuracy score for the prediction. Finally designed a webapp for
analyzing the review.
sentimental analysis:
The process of computationally identifying and categorizing opinions from piece of text,
and determine whether the writer’s attitude towards a particular the topic or the product, is
positive, negative or neutral. The process for performing sentimental analysis is followed by
below steps:
Tokenization
Cleaning the data (remove the special characters)
Removing stop words
Classification (apply supervised algorithm for classification)
For example, “the android system is so good.” This sentence will undergo those five steps and
will get the result as positive, negative, or neutral (+1, -1, 0).
Also, there is twitter sentimental analysis, like restaurant reviews in this also twitter comment
is identified.
The Naive bayes algorithms were tied for the highest accuracy of the 12 twitter sentiment
analysis approaches tested. The sentimental analysis of twitter data has many organizational
23
benefits such as understanding your brand more deeply, growing your influence, and improving
your customer service.
Training data and test data are two important concepts in machine learning.
Training Data
The observations in the training set form the experience that the algorithm uses to learn. In
supervised learning problems, each observation consists of an observed output variable and one or
more observed input variables.
Test Data
The test set is a set of observations used to evaluate the performance of the model using some
performance metric. It is important that no observations from the training set are included in the test
set. If the test set does contain examples from the training set, it will be difficult to assess whether
the algorithm has learned to generalize from the training set or has simply memorized it.
Some training sets may contain only a few hundred observations; others may include millions.
Inexpensive storage, increased network connectivity, the ubiquity of sensor-packed smartphones, and
shifting attitudes towards privacy have contributed to the contemporary state of big data, or training
sets with millions or billions of examples.
Many supervised training sets are prepared manually, or by semi-automated processes. Creating a
large collection of supervised data can be costly in some domains. Fortunately, several datasets are
bundled with scikit-learn, allowing developers to focus on experimenting with models instead.
24
CHAPTER – 8
CONCLUSION
Now, we know that, Machine learning is a powerful tool for making predictions from data.It is a
technique of training machines to perform the activities a human brain can do, a bit faster and better
than an average human-being. However, it is important to remember that machine learning is only as
good as the data that is used to train the algorithms. In order to make accurate predictions, it is
important to use high-quality data that is representative of the real-world data that the algorithm will
be used on.
I understand that the Machine Learning can be a Supervised or Unsupervised. If there is lesser
amount of data and clearly labelled data for training, opt for Supervised Learning. Unsupervised
Learning would generally give better performance and results for large data sets.
REFERENCES
www.tutorials.com
www.javatpoint.com
www.geeksforgeeks.org
25