Python Libraries
Python Libraries
Python Libraries
Python – Lambda
A lambda function can take any number of arguments, but can only have
one expression.
x = lambda a : a + 10
15
print(x(5))
Python – Lambda
x = lambda a, b, c : a + b + c
13
print(x(5, 6, 2))
Python – Lambda
Example
Multiply argument a with argument b and return the result:
x = lambda a, b : a * b
30
print(x(5, 6))
Python – Lambda
def myfunc(n):
return lambda a : a * n
Python – Lambda
Use that function definition to make a function that always doubles the
number you send in:
def myfunc(n):
return lambda a : a * n
22
mydoubler = myfunc(2)
print(mydoubler(11))
Python – Lambda
Or, use the same function definition to make both functions, in the
same program:
def myfunc(n):
return lambda a : a * n
mydoubler = myfunc(2) 22
mytripler = myfunc(3) 33
print(mydoubler(11))
print(mytripler(11))
NumPy
Installation of NumPy
import numpy
NumPy
Example
import numpy as np
print(arr)
NumPy
print(arr)
print(type(arr))
NumPy
import numpy as np
# using tuple
arr = np.array((1, 2, 3, 4, 5)) [1 2 3 4 5]
print(arr)
NumPy
Dimensions in Arrays
A dimension in arrays is one level of array depth (nested
arrays).
• 0-D Arrays: or Scalars, are the elements in an array.
Each value in an array is a 0-D array.
import numpy as np
arr = np.array(42) 42
<class 'numpy.ndarray'>
print(arr)
print(type(arr))
NumPy
1-D Arrays
An array that has 0-D arrays as its elements is called uni-
dimensional or 1-D array.
print(arr)
print(type(arr))
NumPy
2-D Arrays
• An array that has 1-D arrays as its elements is
called a 2-D array.
• These are often used to represent matrix or 2nd
order tensors.
import numpy as np
[[1 2 3]
arr = np.array([[1, 2, 3], [4, 5, 6]]) [4 5 6]]
print(arr)
NumPy
import numpy as np
a = np.array(42)
0
b = np.array([1, 2, 3, 4, 5]) 1
c = np.array([[1, 2, 3], [4, 5, 6]]) 2
3
d = np.array([[[1, 2, 3], [4, 5, 6]],
[[1, 2, 3], [4, 5, 6]]])
NumPy
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
0
d = np.array([[[1, 2, 3], [4, 5, 6]], 1
[[1, 2, 3], [4, 5, 6]]]) 2
print(a.ndim) 3
print(b.ndim)
print(c.ndim)
print(d.ndim)
NumPy
import numpy as np
print(arr[0])
NumPy
import numpy as np
#print()
NumPy
import numpy as np
print(arr[2] + arr[3])
NumPy
import numpy as np
arr = np.array([[1,2,3,4,5],
2nd element on 1st dim: 2
[6,7,8,9,10]])
#print()
NumPy
import numpy as np
arr = np.array([[1,2,3,4,5],
[6,7,8,9,10]]) 2nd element on 1st dim: 2
import numpy as np
arr = np.array([[1,2,3,4,5],
5th element on 2nd dim: 10
[6,7,8,9,10]])
#print()
NumPy
import numpy as np
arr = np.array([[1,2,3,4,5],
[6,7,8,9,10]]) 5th element on 2nd dim: 10
#print()
NumPy
Access the third element of the second array of the first array:
import numpy as np
print(arr[0, 1, 2])
NumPy
Example Explained
The first number represents the first dimension, which contains two arrays:
[[1, 2, 3], [4, 5, 6]]
and:
[[7, 8, 9], [10, 11, 12]]
Since we selected 0, we are left with the first array:
[[1, 2, 3], [4, 5, 6]]
The second number represents the second dimension, which also contains two arrays:
[1, 2, 3]
and:
[4, 5, 6]
NumPy
Example Explained
The third number represents the third dimension, which contains three values:
4
5
6
Since we selected 2, we end up with the third value:
6
NumPy
Negative Indexing
import numpy as np
arr = np.array([[1,2,3,4,5],
Last element from 2nd dim: 10
[6,7,8,9,10]])
#print()
NumPy
Negative Indexing
arr = np.array([[1,2,3,4,5],
[6,7,8,9,10]]) Last element from 2nd dim: 10
Installation of Pandas
import pandas
Pandas
Example
import pandas as pd
mydataset = {
cars passings
'cars': ["BMW", "Volvo", "Ford"], 0 BMW 3
'passings': [3, 7, 2] 1 Volvo 7
2 Ford 2
}
myvar = pd.DataFrame(mydataset)
print(myvar)
Pandas
import pandas as pd
a = [1, 7, 2] 0 1
1 7
2 2
myvar = pd.Series(a)
print(myvar)
Pandas
import pandas as pd
a = [1, 7, 2]
1
myvar = pd.Series(a)
print(myvar[0])
Pandas
Create Labels:
With the index argument, you can name your own labels
import pandas as pd
a = [1, 7, 2] x 1
y 7
z 2
myvar = pd.Series(a, index =
["x", "y", "z"])
print(myvar)
Pandas
Create Labels:
With the index argument, you can name your own labels
You can access an item by referring to the label
import pandas as pd
a = [1, 7, 2]
7
myvar = pd.Series(a, index =
["x", "y", "z"])
print(myvar[“y”])
Pandas
import pandas as pd
calories =
{"day1": 420, "day2": 380, "day3": 390 day1 420
day2 380
} day3 390
myvar = pd.Series(calories)
print(myvar)
Pandas
import pandas as pd
calories =
{"day1": 420, "day2": 380, "day3": 390
day1 420
} day2 380
myvar = pd.Series(calories, index =
["day1", "day2"])
print(myvar)
Pandas
DataFrames
• Data sets in Pandas are usually multi-dimensional tables,
called DataFrames.
• Series is like a column, a DataFrame is the whole table.
import pandas as pd
data = {
"calories": [420, 380, 390], calories duration
0 420 50
"duration": [50, 40, 45] 1 380 40
} 2 390 45
df = pd.DataFrame(data)
print(df)
Pandas
DataFrames
• DataFrame is like a table with rows and columns.
• Pandas use the loc attribute to return one or more
specified row(s)
import pandas as pd
data = {
"calories": [420, 380, 390],
calories 420
"duration": [50, 40, 45] duration 50
}
df = pd.DataFrame(data)
print(df.loc[0])
Pandas
Named Indexes
• With the index argument, you can name your own
indexes
import pandas as pd
data = {
"calories": [420, 380, 390],
calories duration
"duration": [50, 40, 45] day1 420 50
} day2 380 40
day3 390 45
df = pd.DataFrame(data, index =
["day1", "day2", "day3"])
print(df)
Pandas
df = pd.DataFrame(data, index =
["day1", "day2", "day3"])
print(df.loc[“day2”])
Pandas
print(df)
Pandas
Dictionary as JSON
Data Cleaning
Data cleaning means fixing bad data in your data set.
Bad data could be:
– Empty cells
– Data in wrong format
– Wrong data
– Duplicates
Pandas
Data Set
Empty Cells
• One way to deal with empty cells is to remove rows that
contain empty cells.
import pandas as pd
print(new_df.to_string())
Pandas
import pandas as pd
<Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 130
df = pd.read_csv('data.csv') 1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
df.fillna(130, inplace = 5 60 '2020/12/06' 102 127 300.0
True)
Pandas
import pandas as pd
<Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 130
df = pd.read_csv('data.csv') 1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
df[“Calories”].fillna(130, 5 60 '2020/12/06' 102 127 300.0
inplace = True)
Pandas
df["Calories"].fillna(x,
inplace = True)
Pandas
import pandas as pd
print(df.to_string())
Pandas
import pandas as pd
import pandas as pd
print(df.to_string())
Pandas
import pandas as pd
df = pd.read_csv('data.csv') <Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
for x in df.index: 1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
if df.loc[x, "Duration"] > 120: 3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
df.loc[x, "Duration"] = 120
print(df.to_string())
Pandas
import pandas as pd
df = pd.read_csv('data.csv') <Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
for x in df.index: 1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
if df.loc[x, "Duration"] > 120: 3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
df.drop(x, inplace = True)
print(df.to_string())
Pandas
Discovering Duplicates
• Duplicate rows are rows that have been registered more
than one time.
0 False
1 False
import pandas as pd
2 False
3 False
4 False
df = pd.read_csv('data.csv')
5 False
6 False
7 False
Print(df.duplicated())
8 False
9 .......
Pandas
Removing Duplicates
• To remove duplicates, use the drop_duplicates() method.
import pandas as pd
print(df.toString())
Pandas
Finding Relationships
• A great aspect of the Pandas module is the corr() method.
• The corr() method calculates the relationship between
each column in your data set.
import pandas as pd
Duration Pulse Maxpulse Calories
Duration 1.000000 -0.059452 -0.250033 0.344341
df = pd.read_csv('data.csv') Pulse -0.059452 1.000000 0.269672 0.481791
Maxpulse -0.250033 0.269672 1.000000 0.335392
Calories 0.344341 0.481791 0.335392 1.00000
print(df.corr())
Pandas
Finding Relationships
• The corr() method calculates the relationship between
each column in your data set.
• The corr() method ignores "not numeric" columns.
Results Explained
Results Explained
Pandas - Plotting
Pandas
Dataset Plotting
• We can use plotting library called matplotlib
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot()
plt.show()
Pandas
Plotting
• Pandas uses the plot() method to create diagrams.
• We can use Pyplot, a submodule of the Matplotlib library to
visualize the diagram on the screen.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot()
plt.show()
Pandas
Scatter Plot
• Specify that you want a scatter plot with the kind argument:
• kind = 'scatter'
• A scatter plot needs an x- and a y-axis..
import pandas as pd
import matplotlib.pyplot as plt
df.plot(kind = 'scatter', x
= 'Duration', y = 'Calories')
plt.show()
Pandas
Remember: In
the previous
example, we
learned that the
correlation
between
"Duration" and
"Calories" was
0.922721, and
we concluded
with the fact that
higher duration
means more
calories burned.
Pandas
?
Pandas
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
plt.show()
Pandas
• Let's create another scatterplot, where there is a bad relationship between the
columns, like "Duration" and "Maxpulse", with the correlation 0.009403:
Pandas
Histogram
• Use the kind argument to specify that you want a
histogram:
• kind = 'hist'
Histogram
• we will use the "Duration" column to create the histogram.
The histogram tells us that there were over 100 workouts that lasted
between 50 and 60 minutes.
***
df["Duration"].plot(kind = 'hist')
***