Unit 1-1
Unit 1-1
INTRODUCTION
SYLLABUS
➢ Why Machine Learning?
➢ Why Python?
➢ Essentials libraries and tools
➢ Experiment : Basics of python libraries
MACHINE LEARNING
➢ Machine learning (ML) is a branch of artificial intelligence (AI) that enables computers to
“self-learn” from training data and improve over time, without being explicitly programmed.
➢ Machine learning is about extracting knowledge from data.
WHY PYTHON?
➢ Python combines the power of general-purpose programming languages with the ease of use
of domain-specific scripting languages like MATLAB or R.
➢ Python has libraries for data loading, visualization, statistics, natural language processing,
image processing, and more.
➢ It provides data scientists with a large array of general-and special-purpose functionality.
➢ Advantages of using Python is the ability to interact directly with the code, using a terminal
or other tools like the Jupyter Notebook.
➢ Machine learning and data analysis are iterative processes, in which the data drives the
analysis. It is essential to have tools that allow quick iteration and easy interaction.
➢ As a general-purpose programming language, Python also allows for the creation of complex
graphical user interfaces (GUIs) and web services, and for integration into existing systems.
Scikit – learn
➢ scikit-learn is an open source project, it is free to use and distribute, and source code is easily
available. The scikit-learn project is constantly being developed and improved.
➢ It has a very active user community.
➢ It contains a number of state-of-the-art machine learning algorithms, as well as
comprehensive documentation about each algorithm.
➢ It is a very popular tool, and the most prominent Python library for machine learning.
➢ It is widely used in industry and academia, tutorials and code snippets are available online.
➢ It works well with a number of other scientific Python tools.
➢ scikit-learn depends on two other Python packages, NumPy and SciPy.
➢ For plotting and interactive development, install matplotlib, IPython, and the Jupyter
Notebook.
➢ It is recommended to use one of the following prepackaged Python distributions, which will
provide the necessary packages:
1. Anaconda
2. Enthought Canopy
3. Python(x,y)
1. Anaconda
➢ A Python distribution made for large-scale data processing, predictive analytics, and
scientific computing.
➢ Anaconda comes with NumPy, SciPy, matplotlib, pandas, IPython, Jupyter Notebook, and
scikit-learn.
➢ Available on Mac OS, Windows, and Linux.
➢ It is a very convenient solution and without an existing installation of the scientific Python
packages.
➢ Anaconda includes the commercial Intel MKL library for free.
➢ MKL can give significant speed improvements for many algorithms in scikit-learn.
2.Enthought Canopy
➢ Python distribution for scientific computing.
➢ This comes with NumPy, SciPy, matplotlib, pandas, and IPython, but the free version does
not come with scikit-learn.
➢ Academic, degree-granting institution, can request an academic license and get free access to
the paid subscription version of Enthought Canopy.
➢ Enthought Canopy is available for Python 2.7.x, and works on Mac OS, Windows, and Linux.
3. Python(x,y)
➢ A free Python distribution for scientific computing, specifically for Windows.
➢ Python(x,y) comes with NumPy, SciPy, matplotlib, pandas, IPython, and scikit-learn.
Jupyter Notebook
➢ It is an interactive environment for running code in the browser.
➢ It is a great tool for exploratory data analysis and is widely used by data scientists.
➢ It supports many programming languages, only need the Python support.
➢ The Jupyter Notebook makes it easy to incorporate code, text, and images.
NumPy
➢ It is one of the fundamental packages for scientific computing in Python.
➢ It contains functionality for multidimensional arrays, high-level mathematical functions such
as linear algebra operations and the Fourier transform, and pseudorandom number generators.
➢ In scikit-learn, the NumPy array is the fundamental data structure.
➢ scikit-learn takes in data in the form of NumPy arrays.
➢ Any data have to be converted to a NumPy array.
➢ The core functionality of NumPy is the ndarray class, a multidimensional (n-dimensional)
array.
➢ All elements of the array must be of the same type.
NumPy
➢ A NumPy array looks like this:
➢ Objects of the NumPy ndarray class are referred as “NumPy arrays” or just “Arrays”.
SciPy
➢ SciPy is a collection of functions for scientific computing in Python.
➢ It provides advanced linear algebra routines, mathematical function optimization, signal
processing, special mathematical functions, and statistical distributions.
➢ The most important part of SciPy is scipy.sparse: This provides sparse matrices, another
representation that is used for data in scikitlearn.
➢ Sparse matrices are used to store a 2D array that contains mostly zeros:
➢ It is not possible to create dense representations of sparse data so need to create sparse
representations directly.
➢ To create the same sparse matrix, using the COO format:
matplotlib
➢ It is the primary scientific plotting library in Python.
➢ It provides functions for making publication-quality visualizations such as line charts,
histograms, scatter plots, and so on.
➢ Visualizing data and different aspects of analysis can give important insights.
➢ When working inside the Jupyter Notebook, figures can be showed directly in the browser by
using the %matplotlib notebook and %matplotlib inline commands.
➢ Using %matplotlib notebook, provides an interactive environment.
pandas
➢ pandas is a Python library for data wrangling and analysis.
➢ It is built around a data structure called the DataFrame that is modeled after the R
DataFrame.
➢ pandas DataFrame is a table, similar to an Excel spreadsheet.
➢ pandas provides a great range of methods to modify and operate on this table; it allows SQL-
like queries and joins of tables.
➢ pandas allows each column to have a separate type (for example, integers, dates, floating-
point numbers, and strings).
➢ Ability to ingest from a great variety of file formats and data‐ bases, like SQL, Excel files,
and comma-separated values (CSV) files.
mglearn
➢ Helper functions or a library of utility functions.
➢ The value of feature_names is a list of strings, giving the description of each feature:
➢ The data itself is contained in the target and data fields. data contains the numeric
measurements of sepal length, sepal width, petal length, and petal width in a NumPy array:
➢ The rows in the data array correspond to flowers, while the columns represent the four
measurements that were taken for each flower:
➢ Here are the feature values for the first four samples:
➢ The target array contains the species of each of the flowers that were measured, also as a
NumPy array: