0% found this document useful (0 votes)
7 views10 pages

Unit 1-1

This document introduces machine learning, explaining its significance and the reasons for using Python in this field. It covers various machine learning concepts such as supervised and unsupervised learning, essential libraries like scikit-learn, NumPy, and pandas, and outlines the steps involved in building a machine learning model. Additionally, it provides an example of classifying iris species using a machine learning approach.

Uploaded by

darshnmit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views10 pages

Unit 1-1

This document introduces machine learning, explaining its significance and the reasons for using Python in this field. It covers various machine learning concepts such as supervised and unsupervised learning, essential libraries like scikit-learn, NumPy, and pandas, and outlines the steps involved in building a machine learning model. Additionally, it provides an example of classifying iris species using a machine learning approach.

Uploaded by

darshnmit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT 1

INTRODUCTION
SYLLABUS
➢ Why Machine Learning?
➢ Why Python?
➢ Essentials libraries and tools
➢ Experiment : Basics of python libraries

MACHINE LEARNING
➢ Machine learning (ML) is a branch of artificial intelligence (AI) that enables computers to
“self-learn” from training data and improve over time, without being explicitly programmed.
➢ Machine learning is about extracting knowledge from data.

WHY MACHINE LEARNING?

Problems Machine Learning Can Solve


Supervised Learning
➢ It takes a known set of input data (the learning set) and known responses to the data (the
output), and forms a model to generate reasonable predictions for the response to the new
input data.
Examples of Supervised Machine Learning tasks:
1. Identifying the zip code from handwritten digits on an envelope.
➢ Here the input is a scan of the handwriting, and the desired output is the actual digits in
the zip code.
➢ To create a dataset for building a ML model, collect many envelopes. Then read the zip
codes and store the digits as desired outcomes.
2. Determining whether a tumor is benign based on a medical image
➢ Here the input is the image, and the output is whether the tumor is benign.
➢ To create a dataset for building a model, a database of medical images is needed.
➢ An expert opinion is needed, so a doctor needs to look at all of the images and decide
which tumors are benign and which are not.
➢ It might even be necessary to do additional diagnosis beyond the content of the image to
determine whether the tumor in the image is cancerous or not.
3. Detecting fraudulent activity in credit card transactions
➢ Input is a record of the credit card transaction, and the output is whether it is likely to be
fraudulent or not.
➢ Collecting a dataset means storing all transactions and recording if a user reports any
transaction as fraudulent.
Unsupervised Learning
➢ It is a type of machine learning in which models are trained using unlabeled dataset and are
allowed to act on that data without any supervision.
Examples of unsupervised learning include:
1. Identifying topics in a set of blog posts
➢ When there is a large collection of text data, want to summarize it and find prevalent themes
in it. Topics are unknown or no clue about the number of topics. Therefore, there are no
known outputs.
2. Segmenting customers into groups with similar preferences
➢ Given a set of customer records, it is required to identify which customers are similar, and
whether there are groups of customers with similar preferences. There is no information about
the groups.
3. Detecting abnormal access patterns to a website
➢ To identify abuse or bugs, it is often helpful to find access patterns that are different from the
norm. Each abnormal pattern might be very different, and there might not have any recorded
instances of abnormal behavior.

Representation of input data that a computer can understand


➢ The data should be considered as a table.
➢ Each data point infers a row, and each property that describes that data point is a column.
➢ Each entity or row here is known as a Sample (or data point) in machine learning, while the
columns the properties that describe these entities are called Features.
➢ Building a good representation of the data is called Feature Extraction or Feature
Engineering.
➢ No machine learning algorithm will be able to make a prediction on data for which it has no
information.
➢ Ex: If the only feature that have for a patient is their last name, no algorithm will be able to
predict their gender.
Adding another feature that contains the patient’s first name, will have much better luck, as it
is often possible to tell the gender by a person’s first name.

Knowing the Task and Knowing the Data


➢ Important part in the machine learning process is understanding the data and how it relates to
the task.
➢ It will not be effective to randomly choose an algorithm and throw data at it.
➢ It is necessary to understand what is going on in dataset before building a model.
➢ Each algorithm is different in terms of what kind of data and what problem setting it works
best for.
➢ While building a machine learning solution, answer the following questions:
1. What question(s) am I trying to answer?
2. Do I think the data collected can answer that question?
3. What is the best way to phrase my question(s) as a machine learning problem?
4. Have I collected enough data to represent the problem I want to solve?
5. What features of the data did I extract, and will these enable the right predictions?
6. How will I measure success in my application?
7. How will the machine learning solution interact with other parts of my research or
business product?

WHY PYTHON?
➢ Python combines the power of general-purpose programming languages with the ease of use
of domain-specific scripting languages like MATLAB or R.
➢ Python has libraries for data loading, visualization, statistics, natural language processing,
image processing, and more.
➢ It provides data scientists with a large array of general-and special-purpose functionality.
➢ Advantages of using Python is the ability to interact directly with the code, using a terminal
or other tools like the Jupyter Notebook.
➢ Machine learning and data analysis are iterative processes, in which the data drives the
analysis. It is essential to have tools that allow quick iteration and easy interaction.
➢ As a general-purpose programming language, Python also allows for the creation of complex
graphical user interfaces (GUIs) and web services, and for integration into existing systems.
Scikit – learn
➢ scikit-learn is an open source project, it is free to use and distribute, and source code is easily
available. The scikit-learn project is constantly being developed and improved.
➢ It has a very active user community.
➢ It contains a number of state-of-the-art machine learning algorithms, as well as
comprehensive documentation about each algorithm.
➢ It is a very popular tool, and the most prominent Python library for machine learning.
➢ It is widely used in industry and academia, tutorials and code snippets are available online.
➢ It works well with a number of other scientific Python tools.
➢ scikit-learn depends on two other Python packages, NumPy and SciPy.
➢ For plotting and interactive development, install matplotlib, IPython, and the Jupyter
Notebook.
➢ It is recommended to use one of the following prepackaged Python distributions, which will
provide the necessary packages:
1. Anaconda
2. Enthought Canopy
3. Python(x,y)
1. Anaconda
➢ A Python distribution made for large-scale data processing, predictive analytics, and
scientific computing.
➢ Anaconda comes with NumPy, SciPy, matplotlib, pandas, IPython, Jupyter Notebook, and
scikit-learn.
➢ Available on Mac OS, Windows, and Linux.
➢ It is a very convenient solution and without an existing installation of the scientific Python
packages.
➢ Anaconda includes the commercial Intel MKL library for free.
➢ MKL can give significant speed improvements for many algorithms in scikit-learn.
2.Enthought Canopy
➢ Python distribution for scientific computing.
➢ This comes with NumPy, SciPy, matplotlib, pandas, and IPython, but the free version does
not come with scikit-learn.
➢ Academic, degree-granting institution, can request an academic license and get free access to
the paid subscription version of Enthought Canopy.
➢ Enthought Canopy is available for Python 2.7.x, and works on Mac OS, Windows, and Linux.
3. Python(x,y)
➢ A free Python distribution for scientific computing, specifically for Windows.
➢ Python(x,y) comes with NumPy, SciPy, matplotlib, pandas, IPython, and scikit-learn.

ESSENTIAL LIBRARIES AND TOOLS


➢ scikit-learn is built on top of the NumPy and SciPy scientific Python libraries.
➢ In addition to NumPy and SciPy, pandas and matplotlib can be used.
➢ Jupyter Notebook, is a browser-based interactive programming environment.

Jupyter Notebook
➢ It is an interactive environment for running code in the browser.
➢ It is a great tool for exploratory data analysis and is widely used by data scientists.
➢ It supports many programming languages, only need the Python support.
➢ The Jupyter Notebook makes it easy to incorporate code, text, and images.

NumPy
➢ It is one of the fundamental packages for scientific computing in Python.
➢ It contains functionality for multidimensional arrays, high-level mathematical functions such
as linear algebra operations and the Fourier transform, and pseudorandom number generators.
➢ In scikit-learn, the NumPy array is the fundamental data structure.
➢ scikit-learn takes in data in the form of NumPy arrays.
➢ Any data have to be converted to a NumPy array.
➢ The core functionality of NumPy is the ndarray class, a multidimensional (n-dimensional)
array.
➢ All elements of the array must be of the same type.

NumPy
➢ A NumPy array looks like this:

➢ Objects of the NumPy ndarray class are referred as “NumPy arrays” or just “Arrays”.
SciPy
➢ SciPy is a collection of functions for scientific computing in Python.
➢ It provides advanced linear algebra routines, mathematical function optimization, signal
processing, special mathematical functions, and statistical distributions.
➢ The most important part of SciPy is scipy.sparse: This provides sparse matrices, another
representation that is used for data in scikitlearn.
➢ Sparse matrices are used to store a 2D array that contains mostly zeros:

➢ It is not possible to create dense representations of sparse data so need to create sparse
representations directly.
➢ To create the same sparse matrix, using the COO format:
matplotlib
➢ It is the primary scientific plotting library in Python.
➢ It provides functions for making publication-quality visualizations such as line charts,
histograms, scatter plots, and so on.
➢ Visualizing data and different aspects of analysis can give important insights.
➢ When working inside the Jupyter Notebook, figures can be showed directly in the browser by
using the %matplotlib notebook and %matplotlib inline commands.
➢ Using %matplotlib notebook, provides an interactive environment.

pandas
➢ pandas is a Python library for data wrangling and analysis.
➢ It is built around a data structure called the DataFrame that is modeled after the R
DataFrame.
➢ pandas DataFrame is a table, similar to an Excel spreadsheet.
➢ pandas provides a great range of methods to modify and operate on this table; it allows SQL-
like queries and joins of tables.
➢ pandas allows each column to have a separate type (for example, integers, dates, floating-
point numbers, and strings).
➢ Ability to ingest from a great variety of file formats and data‐ bases, like SQL, Excel files,
and comma-separated values (CSV) files.

mglearn
➢ Helper functions or a library of utility functions.

Application: Classifying Iris Species


Assume that a hobby botanist is interested in distinguishing the species of some iris flowers that she
has found. She has collected some measurements associated with each iris: the length and width of the
petals and the length and width of the sepals, all measured in centimeters. She also has the
measurements of some irises that have been previously identified by an expert botanist as belonging
to the species setosa, versicolor, or virginica. For these measurements, she can be certain of which
species each iris belongs to. Let’s assume that these are the only species hobby botanist will encounter
in the wild. Build a machine learning model that can learn from the measurements of these irises whose
species is known, so that can predict the species for a new iris.
Meet the Data:
➢ It is included in scikit-learn in the datasets module.
➢ Can load it by calling the load_iris function:

➢ It contains keys and values:

➢ The value of the key DESCR is a short description of the dataset.


➢ The value of the key target_names is an array of strings, containing the species of flower that
is to be predicted:

➢ The value of feature_names is a list of strings, giving the description of each feature:

➢ The data itself is contained in the target and data fields. data contains the numeric
measurements of sepal length, sepal width, petal length, and petal width in a NumPy array:

➢ The rows in the data array correspond to flowers, while the columns represent the four
measurements that were taken for each flower:

➢ Here are the feature values for the first four samples:

➢ The target array contains the species of each of the flowers that were measured, also as a
NumPy array:

➢ Target is a one-dimensional array, with one entry per flower:


➢ The species are encoded as integers from 0 to 2:

Measuring Success: Training and Testing Data:


➢ Call train_test_split on the data and assign the outputs using the nomenclature:

Steps involved in Machine Learning modelling:


1. Collecting Data:
➢ Machines initially learn from the data. Collect reliable data so that machine learning
model can find the correct patterns.
➢ The quality of the data that is fed to the machine will determine how accurate the model
is.
➢ If the data is incorrect or outdated, the model gives wrong outcomes or predictions which
are not relevant.
➢ Data should be from a reliable source, as it will directly affect the outcome of the model.
➢ Good data is relevant, contains very few missing and repeated values, and has a good
representation of the various subcategories/classes present.
2. Preparing the Data:
➢ Putting together all the data and randomizing it - data is evenly distributed, and the
ordering does not affect the learning process.
➢ Cleaning the data to remove unwanted data, missing values, rows, and columns, duplicate
values, data type conversion, etc. Restructure the dataset and change the rows and
columns or index of rows and columns.
➢ Visualize the data to understand how it is structured and understand the relationship
between various variables and classes present.
➢ Splitting the cleaned data into two sets - a training set and a testing set. The training set is
the set to train the model. A testing set is used to check the accuracy of the model after
training.
3. Choosing a Model:
➢ It is important to choose a model which is relevant to the task. Also check whether the
model is suitable for numerical or categorical data and choose accordingly.
4. Training the Model:
➢ Pass the prepared data to the machine learning model to find patterns and make
predictions. The model learns from the data and accomplish the task set. Over time, with
training, the model gets better at predicting.
5. Evaluating the Model:
➢ Test the performance of the model on previously unseen data. The unseen data used is the
testing set. If testing was done on the same training data, the model gives accurate
measure, as the model is already used to the data, and finds the same patterns in it, as it
previously did. This will give you disproportionately high accuracy.
➢ When used on testing data, will get an accurate measure of how the model will perform
and its speed.
6. Parameter Tuning:
➢ Accuracy can be improved by tuning the parameters present in the model.
➢ Parameters are the variables in the model that the programmer generally decides.
➢ At a particular value of the parameter, the accuracy will be the maximum. Parameter
tuning refers to finding these values.
7. Making Predictions
➢ Use model on unseen data to make predictions accurately.

You might also like