PYTHON FOUNDATION FOR DATA SCIENCE
import numpy as np
data = {i : np.random.randn() for i in range(7)}
data
Running the Jupyter Notebook
To start up Jupyter, run the command jupyter notebook in a terminal:
Defendin on your installation you will see something like
Then you will be redirected to the browser like
To create a new notebook, click the New button and select the “Python 3” or “conda
[default]” option. You should see something like this. If this is your first time,
try clicking on the empty code “cell” and entering a lines of Python code. Then press
Shift-Enter to execute it.
When you save the notebook (see “Save and Checkpoint” under the notebook File
menu), it creates a file with the extension. ipynb. This is a self-contained file format
that contains all the content (including any evaluated code output) currently in the
notebook. These can be loaded and edited by other Jupyter users.
To load an existing notebook, put the file in the same directory where you started the
notebook process (or in a subfolder within it), then double-click the name from the landing
page.
Data Structures and Sequences
Tuple: A tuple is a fixed-length, immutable sequence of Python objects. You can create
tuple in different ways, such as:
create one with a comma-separated sequence of values:
tup = 4, 5, 6
tup = (4, 5, 6)
When you’re defining tuples in more complicated expressions, it’s often necessary to
enclose the values in parentheses, as in this example of creating a tuple of tuples:
nested_tup = (4, 5, 6), (7, 8)
- Note that element of tuple can take any other object or scalar
- Tuple are immutable
List are variable-length and their contents can be modified in-place.
- You can define list using square brackets [] or using the list type function:
score_list = [70, 60, 60, None]
grade_tup = ('A', 'B', 'C')
grade_list = list(grade_tup)
you can add or insert into the list
grade_list.append('E')
grade_list.insert(3, 'D')
print (grade_list)
['A', 'B', 'C', 'D', 'E']
Other functions of the list include
.extend() //to add another list
.pop(index) //remove the value of a given index
.remove(I’tem’) //remove the first instance of the given item
‘item’ in list //return True if the item is in the list otherwise False
slice and dice, sorted, zip, reverse????
DICT: likely the most important built-in Python data structure.
It is a flexibly sized collection of key-value pairs, where key and value are Python objects. In
record= {'Level' : 100, ‘Sex’: ‘M’, ‘Programme’: ‘CSC’; ‘Entry_Year’: 2023]} // is a Dict of
student record
You can access, insert, or set elements using the same syntax as for accessing elements
of a list or tupl
SET
NUMPY
NumPy is a foundational package for numerical computing in Python. Most computational
packages providing scientific functionality use NumPy’s array objects as the lingua franca
for data exchange.
The NumPy ndarray: A multidimensional array object, is a fast, flexible container for large
datasets in Python. Arrays enable you to perform mathematical operations on whole
blocks of data using similar syntax to the equivalent operations between scalar elements.
Run the following code and explain what happens
import numpy as np
data = np.random.randn(2, 3)
print(data)
data + data
An ndarray is a generic multidimensional container for homogeneous data (that is, all of
the elements must be the same type).
Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an
object describing the data type of the array:
print(data.shape)
print(data.dtype)
Creating ndarrays
The easiest way to create an array is to use the array function. For example,
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
zeros and ones create arrays of 0s or 1s, respectively,
np.zeros(10)
np.zeros((3, 6))
np.empty((2, 3, 2)
You can carry out all arithmetic on numerical ndarray. E.g
arr1 + arr1 // return an array that is element-wise sum of the 2 array
arr1 - arr1 // Returns all zeros array
arr1 * arr1 // Returns an array that is the square of each element
arr1 > arr2 // Returns True where elementwise of arr1 > arr2, otherwise False
arr1 * 3 //????
Mathematical and Statistical Methods
- A set of mathematical functions that compute statistics about an entire array or
about the data along an axis are accessible as methods of the array class.
- You can use aggregations like sum, mean, and std (standard deviation) either by
calling the array instance method or using the top-level NumPy function.
Here is a generated normally distributed random data and compute aggregate statistics
data= np.random.randn(5, 4)
datamean()
np.mean(data)
data.sum()
Functions like mean and sum take an optional axis argument that computes the statis tic
over the given axis, resulting in an array with one fewer dimension:
data.mean(axis=1) //compute mean across the columns
data.sum(axis=0) //compute sum down the rows.
Pandas
pandas contains data structures and data manipulation tools designed to make data
cleaning and analysis fast and easy in Python. The pandas adopts significant parts of
NumPy’s style of array-based computing, especially array-based functions and a
preference for data processing without for loops. The biggest difference is that pandas is
designed for working with tabular or heterogeneous data.
Throughout the rest of this section, I use the following import convention for pandas: I
import pandas as pd
Thus, whenever you see pd. in code, it’s referring to pandas.
pandas Data Structures
To get started with pandas, you will need to understand its two workhorse data structures:
Series and DataFrame.
Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) and an associated array of data labels, called its index.
The simplest series is formed from only an array of data:
import pandas as pd
/from panads import Series, DatafFame
obj = pd.Series([4, 7, -5, 3])
print(obj)
0 4
1 7
2 -5
3 3
dtype: int64
//Explicitly specify the index
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)
d 4
b 7
a -5
c 3
dtype: int64
//get the values only
print(obj.values)
[ 4 7 -5 3]
getijg the index only
print(obj2.index)
Index(['d', 'b', 'a', 'c'], dtype='object')
With series you can access every value or set of values by their index or set of indices. You
can also carry out all other mathematical and logical operation as in NumPy
You can create a Series from your data in Dict by passing the dict:
stud_pop_data = {'Bauchi': 3000, 'Gombe': 2000, 'Kano': 1600}
stud_pop_series= pd.Series(stud_pop_data )
print(stud_pop_series)
Bauchi 3000
Gombe 2000
Kano 1600
dtype: int64
You can specify only part of the data needed in the series
states = ['Bauchi', 'Kano', 'Plateau']
stud_pop = pd.Series(stud_pop_data , index=states)
print(stud_pop)
Bauchi 3000.0
Kano 1600.0
Plateau NaN
dtype: float64
Here, two values found in states were placed in the appropriate locations, but since no
value for 'Plateau' was found, it appears as NaN (not a number), which is considered in
pandas to mark missing or NA values. And since 'Gombe' was not included in states, it is
excluded from the resulting stud_pop object
The isnull and notnull functions in pandas should be used to detect missing data:
pd.isnull(stud_pop)
Bauchi False
Kano False
Plateau True
dtype: bool
pd.notnull(stud_pop)
Bauchi True
Kano True
Plateau False
dtype: bool
One other useful Series feature is that it automatically aligns by index label in arithmetic
operations: Check yourself
DataFrame
A DataFrame represents a rectangular table of data and contains an ordered collection
of columns, each of which can be a different value type (numeric, string, boolean, etc.).
The DataFrame has both a row and column index
There are many ways to construct a DataFrame, though one of the most common is
from a dict of equal-length lists or NumPy arrays:
import pandas as pd
data = {'state': ['Bauchi', 'Gombe', 'Plateau ', 'Kano'], 'stud': [6000, 2001, 3102, 880]}
d_frame = pd.DataFrame(data)
print(d_frame)
state stud
0 Bauchi 6000
1 Gombe 2001
2 Plateau 3102
3 Kano 880
A column in a DataFrame can be retrieved as a Series either by dict-like notation or by
attribute:
print(d_frame['stud'])
0 6000
1 2001
2 3102
3 880
Name: stud, dtype: int64
A row can also be retreated by specifying the row index, as follows:
print(d_frame.loc[2])
state Plateau
stud 3102
Name: 2, dtype: object
Columns can be modified by assignment, eg
d_frame['stud'] = 0
print(d_frame['stud'])
0 0
1 0
2 0
3 0
Name: stud, dtype: int64
Summarizing and Computing Descriptive Statistics
pandas objects are equipped with a set of common mathematical and statistical methods. The
functions, and has built-in handling for missing data.
Consider the following DataFrame:
import numpy as np
import pandas as pd
score = pd.DataFrame([[25.00, np.nan], [20.50, 35.5], [np.nan, np.nan], [0.5, 12.5]],
index=['CSCU230001', 'CSCU230002', 'CSCU230003', 'CSCU230004'],
columns=['CA', 'EXAM'])
>>>score
CA EXAM
CSCU230001 25.0 NaN
CSCU230002 20.5 35.5
CSCU230003 NaN NaN
CSCU230004 0.5 12.5
Calling DataFrame’s sum method returns a Series containing column sums:
>>> score.sum()
CA 46.0
EXAM 48.0
dtype: float64
Passing axis='columns' or axis=1 sums across the columns instead:
>>> score.sum(axis='columns')
CSCU230001 25.0
CSCU230002 56.0
CSCU230003 0.0
CSCU230004 13.0
dtype: float64
>>> score.mean()
CA 15.333333
EXAM 24.000000
dtype: float64
>>> score.mean(skipna=False) //Exclude column where there are no data
CA NaN
EXAM NaN
dtype: float64
>> score.cumsum() // Compute cumulative sums
CA EXAM
CSCU230001 25.0 NaN
CSCU230002 45.5 35.5
CSCU230003 NaN NaN
CSCU230004 46.0 48.0
describe is one of the reach methods, that produce multiple summary statistics in one shot:
>>> score.describe()
CA EXAM
count 3.000000 2.000000
mean 15.333333 24.000000
std 13.041600 16.263456
min 0.500000 12.500000
25% 10.500000 18.250000
50% 20.500000 24.000000
75% 22.750000 29.750000
max 25.000000 35.500000
When you run describe on non-numeric data, the results is summary statistics:
>>> data = ['A', 'B', 'A', 'D', 'F', 'A', 'C', 'C']
>>> grade = pd.Series(data * 4)
>>> grade.describe()
count 32
unique 5
top A
freq 12
dtype: object
These are just a few examples!!!