0% found this document useful (0 votes)
3 views

3-numpy_pandas

This document is a crash course on NumPy, Pandas, and data visualization techniques for a Machine Learning course (CS 334). It covers essential concepts such as array creation, indexing, arithmetic operations in NumPy, as well as data manipulation and statistical functions in Pandas, along with visualization methods using Matplotlib and Seaborn. Additionally, it includes coding examples and exercises to reinforce learning.

Uploaded by

hokumura032
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

3-numpy_pandas

This document is a crash course on NumPy, Pandas, and data visualization techniques for a Machine Learning course (CS 334). It covers essential concepts such as array creation, indexing, arithmetic operations in NumPy, as well as data manipulation and statistical functions in Pandas, along with visualization methods using Matplotlib and Seaborn. Additionally, it includes coding examples and exercises to reinforce learning.

Uploaded by

hokumura032
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

NUMPY/PANDAS/VISUALIZATION

CRASH COURSE
CS 334: Machine Learning
COURSE REMINDERS

• In-Class exercise #1 due 1/31

• Read the syllabus

• Homework #1 out and due 2/6

• Python workshops running for the first 3 weeks with M/W offerings

2
WORKING WITH OTHER LIBRARIES
• Import the module / library of interest

• import <library> as <something>

• from <library> import <some function or


class>

• Example:

• from exercise2 import sum_list


NUMPY

• NumPy (short for Numerical Python) provides efficient


implementation to store and operate on dense numeric data

• Useful for logical and mathematical calculations on arrays and


matrices

• Much faster than lists

• Less memory than lists


4
NUMPY ARRAY AND MATRICES

• Numpy arrays and matrices


must be the same data Code Print(a)
type! a = np.array([3.14, 4, 2, 3]) array([3.14, 4. , 2. , 3. ])
a = np.array([[3.14, 4], [2, 3]]) array([[3.14, 4. ],
[2. , 3. ]])
• Multiple ways to create
a = np.zeros((2,2)) array([[0., 0.],
arrays and matrices [0., 0.]])
a = np.ones(3) array([1., 1., 1.])
• Common import:
import numpy as np
5
ARRAY ATTRIBUTES
• ndim - Number of array dimensions

• shape - Tuple of array dimensions

• size - Number of elements in the array

• dtype – Array element types

https://nustat.github.io/DataScience_Intro_python/NumPy.html 6
CODING EXAMPLE

• Try ndim, shape, size, dtype on the


following arrays (x1-x3):
import numpy as np
rng = np.random.default_rng(334)
x1 = rng.random(4)
x2 = rng.integers(0, 20, size=(5,4))
x3 = rng.random((5,4,3))
x4 = rng.integers(0, 10, size=(5,4))

Exercise #3 involves questions


related to this
NUMPY ARRAY INDEXING
• Array

• Indexing specific element: a[5]

• Indexing range of elements: a[start:stop:step], unspecified


values default to start=0, stop=size of dimension, step=1

• Matrix

• Indexing dimensions is separated by comma (e.g., a[5, 4])


8
CODING EXAMPLE

• For x2 from above, what do the


following lines do?
• x2[:2, :]

• x2[1:4, ::2]

• x2[:, 1]
NUMPY ARITHMETIC OPERATORS

Operator Python Python Example


Addition + x1+x5
Minus - x1-x5
Division / x1 / x5
Multiplication * x1 * x5
Modulo % x1 % x5

10
NUMPY AGGREGATION FUNCTIONS

Operator Numpy Function NaN-safe version


Sum of elements np.sum np.nansum
Product of elements np.prod np.nanprod
Mean of elements np.mean np.nanmean
Max of elements np.max np.nanmax
Min of elements np.min np.nanmin
Index of max value np.argmax np.nanargmax
Index of min value np.argmin np.nanargmin
All elements are true np.all -
Any element is true np.any -

11
NUMPY MATRIX OPERATIONS

Operator Numpy Function


Transpose a.T
Matrix multiplication a@b or np.matmul(a,b)
Determinant of array np.linalg.det(a)
Inverse of square matrix np.linagl.inv(a)

There are many more useful methods! Read the numpy


documentation if you want some mathematical function.
12
CODING EXAMPLE

• For x1, x2, and x4 from above, what


do the following lines do?
• X2*x4

• x2@x1

Exercise #3 involves questions for


this example
PANDAS

• Library for computation with tabular data

• Mixed types of data allowed in a single table

• Columns and rows of data can be named

• Advanced data aggregation and statistical functions

14
PANDAS: SERIES + DATAFRAME

• Series is a one-dimensional
object with sequence of values

• DataFrame is a two-
dimensional object (often
thought of as tabular data)

https://nustat.github.io/DataScience_Intro_python/Pandas.html 15
PANDAS READING FROM FILE

• Comma-separated files are common file formats


for data

• Given the filepath, you can read a csv file using


the command read_csv

• Example:
import pandas as pd
foo = pd.read_csv(“foo.csv”)
PANDAS ATTRIBUTES

• columns – Column labels of dataframe

• index – Row labels or indices

• shape – Number of rows and columns in dataframe

• size - Number of elements in the dataframe

• dtypes – Data types of the columns

17
PANDAS METHODS

• head(n) – Prints the first n rows

• tail(n) – Prints the last n rows

• describe() – Summary statistics of the dataframe

18
CODING EXAMPLE

• Read in the file iris.csv (Exercises -> Ex 3)

• How many rows and columns are there?

• Print the first 5 rows

• What are the summary statistics of the


dataframe?
PANDAS STATISTICAL CALCULATIONS
Operator pandas Function
Sum of series a.sum()
Product of series a.prod()
Mean of series a.mean()
Median of series a.median()
Standard Deviation a.std()
Max of series a.max()
Min of series a.min()
Index of max of series a.argmax()
Index of min of series a.argmin()

20
PANDAS INDEXING
• loc – Axis labels (row labels and column names) to subset data

• foo.loc[0:1, “x1”]

• foo.loc[3:5, ["x1", "y"]]

• iloc – Use position of rows and columns (i.e., integers)

• foo.iloc[0:1, 0]
Main difference is loc uses the
foo.iloc[3:5, [0,2]]

names (this can be integers) and
iloc must be integers!
21
PANDAS SUBSETTING

• Use comparison to find samples with specific values in columns

• foo.loc[foo[“y”] == 0, :]

• foo.loc[foo[“x1”] < 1, :]

• Finding the samples with the largest/smallest value in a column

• foo.iloc[foo[”x1”].argmax(), :]

• foo.iloc[foo[”x2”].argmin(), :]

22
CODING EXAMPLE
• Given iris dataframe

• What is the median of the petal length?

• What is the mean of the sepal width for


Virginica species samples?

• What is the standard deviation of the


sepal length for Setosa species samples?

Exercise #3 involves questions for this example


VISUALIZATION

• 3 common ways to visualize data

• Matplotlib – low-level graph plotting library

• Pandas (via matplotlib)

• Seaborn (via matplotlib) – high-level plotting library

Many other libraries available, see this article for top 10:
https://www.projectpro.io/article/python-data-visualization-libraries/543
24
COMPONENTS OF MATPLOTLIB

• Figure object contains the


outermost container

• Axes translates to
individual plot/graph

• Lines, tickmarks, text


boxes, legends

https://matplotlib.org/stable/gallery/showcase/anatomy.html 25
MATPLOTLIB: USEFUL METHODS

• import matplotlib.pyplot as plt

• plt.show() – Display all open figures

• Plt.savefig(filename) – Save the figure in the filename


specified

26
MATPLOTLIB SCATTERPLOT
• Use pyplot module to make plots

• scatter() makes the scatterplot

• color code points by setting c to the appropriate categorical variable

• set_xlabel(), set_ylabel(), set_title() labels


your plot

• legend() is necessary to plot the legend explicitly


27
MATPLOTLIB FOO SCATTERPLOT

• import matplotlib.pyplot as plt


fig, ax = plt.subplots()
scatter = ax.scatter(foo[‘x1’], foo[‘x2’], c=foo[’y’])
# produce a legend with the unique colors from the scatter
legend1 = ax.legend(*scatter.legend_elements(),
loc="upper left", title="Classes")
ax.add_artist(legend1)
ax.set_xlabel(‘x1’)
ax.set_ylabel(‘x2’)
ax.set_title(“Foo x1 vs x2 scatterplot”)
# if in terminal need to show
plt.show()

28
MATPLOTLIB BOXPLOT

• boxplot() makes a boxplot


(demonstrates locality, spread, and
skewness of numerical data through
their quartiles)

https://matplotlib.org/stable/gallery/statistics/boxplot_demo.html 29
MATPLOTLIB FOO BOXPLOT
• import matplotlib.pyplot as plt
fig, ax = plt.subplots()
data = [foo.loc[foo['y'] == 0, 'x1'], foo.loc[foo['y'] == 1, 'x1']]
ax.boxplot(data)
ax.set_title('Boxplot of x1 distribution based on y')
ax.set_xticklabels([0, 1])
ax.set_xlabel("y")
ax.set_ylabel("x1")
# if in terminal need to show
plt.show()

30
PANDAS PLOTTING
• Provides a mechanism to generate plots directly from dataframes

• Call the dataframe.plot(kind=<method>) or


dataframe.plot.<method>

• box()

• scatter()

• hist()
31
PANDAS FOO SCATTERPLOT
• colors = {0: 'orange', 1: 'purple’}
color_list = [colors[group] for group in foo['y’]]
# Create a scatter plot with color-coding based on
'categorical_variable’
ax = foo.plot.scatter('x1', 'x2', c=color_list)
# Create legend handles, labels for each group and add legend to the
plot
import matplotlib.patches as mpatches
legend_handles = [
mpatches.Patch(color=colors[0], label=0),
mpatches.Patch(color=colors[1], label=1)]
ax.legend(handles=legend_handles, loc='upper left’)
ax.set_xlabel("x1")
ax.set_ylabel("x2")
ax.set_title("Foo x1 vs x2 scatterplot")
plt.show()
32
PANDAS FOO BOXPLOT

• foo.boxplot(column='x1', by='y')
# if in terminal need to show
plt.show()

33
SEABORN PLOTTING

• A high-level interface for drawing attractive and informative


statistical graphics

• Call different methods to obtain different types of plots

• boxplot()

• scatterplot()

34
SEABORN FOO SCATTERPLOT

• import seaborn as sns


snsax = sns.scatterplot(data=foo, x='x1', y='x2', hue='y’)
snsax.set(title="Foo x1 vs x2 scatterplot")
plt.show()

35
SEABORN FOO BOXPLOT

• snsax = sns.boxplot(data=foo, x="y", y="x1")


snsax.set(title='Boxplot of x1 distribution based on y')
plt.show()

36
CODING EXAMPLE
• Given iris dataframe

• Generate a boxplot of the petal length for the


iris dataset where the x-axis contains the
species label and the y is the petal length.

• Generate a scatterplot of the petal length


versus petal width for the iris dataset where the
x-axis contains the petal length and the y is the
petal width. Each point should be colored
according to the species label.

Exercise #3 involves questions for this example

You might also like