0% found this document useful (0 votes)

6 views9 pages

Python Foundation For Data Science

The document provides an overview of Python's foundational libraries for data science, including NumPy and pandas, detailing their data structures such as arrays, Series, and DataFrames. It covers how to create and manipulate these structures, perform mathematical operations, and handle missing data. Additionally, it explains how to run Jupyter Notebook for executing Python code and saving notebooks.

Uploaded by

hacksbank.net

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views9 pages

Python Foundation For Data Science

Uploaded by

hacksbank.net

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

PYTHON FOUNDATION FOR DATA SCIENCE

import numpy as np
data = {i : np.random.randn() for i in range(7)}
data

Running the Jupyter Notebook

To start up Jupyter, run the command jupyter notebook in a terminal:
Defendin on your installation you will see something like

Then you will be redirected to the browser like

To create a new notebook, click the New button and select the “Python 3” or “conda
[default]” option. You should see something like this. If this is your first time,

try clicking on the empty code “cell” and entering a lines of Python code. Then press
Shift-Enter to execute it.

When you save the notebook (see “Save and Checkpoint” under the notebook File
menu), it creates a file with the extension. ipynb. This is a self-contained file format
that contains all the content (including any evaluated code output) currently in the
notebook. These can be loaded and edited by other Jupyter users.

To load an existing notebook, put the file in the same directory where you started the
notebook process (or in a subfolder within it), then double-click the name from the landing
page.

Data Structures and Sequences

Tuple: A tuple is a fixed-length, immutable sequence of Python objects. You can create
tuple in different ways, such as:
create one with a comma-separated sequence of values:
tup = 4, 5, 6
tup = (4, 5, 6)

When you’re defining tuples in more complicated expressions, it’s often necessary to
enclose the values in parentheses, as in this example of creating a tuple of tuples:
nested_tup = (4, 5, 6), (7, 8)

- Note that element of tuple can take any other object or scalar
- Tuple are immutable

List are variable-length and their contents can be modified in-place.

- You can define list using square brackets [] or using the list type function:
score_list = [70, 60, 60, None]
grade_tup = ('A', 'B', 'C')
grade_list = list(grade_tup)

you can add or insert into the list

grade_list.append('E')
grade_list.insert(3, 'D')
print (grade_list)
['A', 'B', 'C', 'D', 'E']

Other functions of the list include

.extend() //to add another list
.pop(index) //remove the value of a given index
.remove(I’tem’) //remove the first instance of the given item
‘item’ in list //return True if the item is in the list otherwise False

slice and dice, sorted, zip, reverse????

DICT: likely the most important built-in Python data structure.

It is a flexibly sized collection of key-value pairs, where key and value are Python objects. In
record= {'Level' : 100, ‘Sex’: ‘M’, ‘Programme’: ‘CSC’; ‘Entry_Year’: 2023]} // is a Dict of
student record

You can access, insert, or set elements using the same syntax as for accessing elements
of a list or tupl

SET
NUMPY
NumPy is a foundational package for numerical computing in Python. Most computational
packages providing scientific functionality use NumPy’s array objects as the lingua franca
for data exchange.

The NumPy ndarray: A multidimensional array object, is a fast, flexible container for large
datasets in Python. Arrays enable you to perform mathematical operations on whole
blocks of data using similar syntax to the equivalent operations between scalar elements.

Run the following code and explain what happens

import numpy as np
data = np.random.randn(2, 3)
print(data)
data + data

An ndarray is a generic multidimensional container for homogeneous data (that is, all of
the elements must be the same type).
Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an
object describing the data type of the array:

print(data.shape)
print(data.dtype)

Creating ndarrays
The easiest way to create an array is to use the array function. For example,
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)

zeros and ones create arrays of 0s or 1s, respectively,

np.zeros(10)
np.zeros((3, 6))
np.empty((2, 3, 2)

You can carry out all arithmetic on numerical ndarray. E.g

arr1 + arr1 // return an array that is element-wise sum of the 2 array
arr1 - arr1 // Returns all zeros array
arr1 * arr1 // Returns an array that is the square of each element
arr1 > arr2 // Returns True where elementwise of arr1 > arr2, otherwise False
arr1 * 3 //????

Mathematical and Statistical Methods

- A set of mathematical functions that compute statistics about an entire array or
about the data along an axis are accessible as methods of the array class.
- You can use aggregations like sum, mean, and std (standard deviation) either by
calling the array instance method or using the top-level NumPy function.
Here is a generated normally distributed random data and compute aggregate statistics
data= np.random.randn(5, 4)
datamean()
np.mean(data)
data.sum()

Functions like mean and sum take an optional axis argument that computes the statis tic
over the given axis, resulting in an array with one fewer dimension:
data.mean(axis=1) //compute mean across the columns
data.sum(axis=0) //compute sum down the rows.

Pandas
pandas contains data structures and data manipulation tools designed to make data
cleaning and analysis fast and easy in Python. The pandas adopts significant parts of
NumPy’s style of array-based computing, especially array-based functions and a
preference for data processing without for loops. The biggest difference is that pandas is
designed for working with tabular or heterogeneous data.

Throughout the rest of this section, I use the following import convention for pandas: I
import pandas as pd

Thus, whenever you see pd. in code, it’s referring to pandas.

pandas Data Structures

To get started with pandas, you will need to understand its two workhorse data structures:
Series and DataFrame.

Series is a one-dimensional array-like object containing a sequence of values (of

similar types to NumPy types) and an associated array of data labels, called its index.
The simplest series is formed from only an array of data:
import pandas as pd
/from panads import Series, DatafFame
obj = pd.Series([4, 7, -5, 3])
print(obj)
0 4
1 7
2 -5
3 3
dtype: int64
//Explicitly specify the index
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)
d 4
b 7
a -5
c 3
dtype: int64

//get the values only

print(obj.values)
[ 4 7 -5 3]

getijg the index only

print(obj2.index)
Index(['d', 'b', 'a', 'c'], dtype='object')

With series you can access every value or set of values by their index or set of indices. You
can also carry out all other mathematical and logical operation as in NumPy

You can create a Series from your data in Dict by passing the dict:
stud_pop_data = {'Bauchi': 3000, 'Gombe': 2000, 'Kano': 1600}
stud_pop_series= pd.Series(stud_pop_data )
print(stud_pop_series)
Bauchi 3000
Gombe 2000
Kano 1600
dtype: int64

You can specify only part of the data needed in the series
states = ['Bauchi', 'Kano', 'Plateau']
stud_pop = pd.Series(stud_pop_data , index=states)
print(stud_pop)
Bauchi 3000.0
Kano 1600.0
Plateau NaN
dtype: float64

Here, two values found in states were placed in the appropriate locations, but since no
value for 'Plateau' was found, it appears as NaN (not a number), which is considered in
pandas to mark missing or NA values. And since 'Gombe' was not included in states, it is
excluded from the resulting stud_pop object

The isnull and notnull functions in pandas should be used to detect missing data:
pd.isnull(stud_pop)
Bauchi False
Kano False
Plateau True
dtype: bool

pd.notnull(stud_pop)
Bauchi True
Kano True
Plateau False
dtype: bool

One other useful Series feature is that it automatically aligns by index label in arithmetic
operations: Check yourself

DataFrame
A DataFrame represents a rectangular table of data and contains an ordered collection
of columns, each of which can be a different value type (numeric, string, boolean, etc.).
The DataFrame has both a row and column index

There are many ways to construct a DataFrame, though one of the most common is
from a dict of equal-length lists or NumPy arrays:

import pandas as pd
data = {'state': ['Bauchi', 'Gombe', 'Plateau ', 'Kano'], 'stud': [6000, 2001, 3102, 880]}
d_frame = pd.DataFrame(data)
print(d_frame)
state stud
0 Bauchi 6000
1 Gombe 2001
2 Plateau 3102
3 Kano 880

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by

attribute:
print(d_frame['stud'])
0 6000
1 2001
2 3102
3 880
Name: stud, dtype: int64

A row can also be retreated by specifying the row index, as follows:

print(d_frame.loc[2])
state Plateau
stud 3102
Name: 2, dtype: object

Columns can be modified by assignment, eg

d_frame['stud'] = 0
print(d_frame['stud'])
0 0
1 0
2 0
3 0
Name: stud, dtype: int64

Summarizing and Computing Descriptive Statistics

pandas objects are equipped with a set of common mathematical and statistical methods. The
functions, and has built-in handling for missing data.

Consider the following DataFrame:

import numpy as np
import pandas as pd
score = pd.DataFrame([[25.00, np.nan], [20.50, 35.5], [np.nan, np.nan], [0.5, 12.5]],
index=['CSCU230001', 'CSCU230002', 'CSCU230003', 'CSCU230004'],
columns=['CA', 'EXAM'])
>>>score
CA EXAM
CSCU230001 25.0 NaN
CSCU230002 20.5 35.5
CSCU230003 NaN NaN
CSCU230004 0.5 12.5

Calling DataFrame’s sum method returns a Series containing column sums:

>>> score.sum()
CA 46.0
EXAM 48.0
dtype: float64

Passing axis='columns' or axis=1 sums across the columns instead:

>>> score.sum(axis='columns')
CSCU230001 25.0
CSCU230002 56.0
CSCU230003 0.0
CSCU230004 13.0
dtype: float64

>>> score.mean()
CA 15.333333
EXAM 24.000000
dtype: float64

>>> score.mean(skipna=False) //Exclude column where there are no data

CA NaN
EXAM NaN
dtype: float64

>> score.cumsum() // Compute cumulative sums

CA EXAM
CSCU230001 25.0 NaN
CSCU230002 45.5 35.5
CSCU230003 NaN NaN
CSCU230004 46.0 48.0

describe is one of the reach methods, that produce multiple summary statistics in one shot:

>>> score.describe()
CA EXAM
count 3.000000 2.000000
mean 15.333333 24.000000
std 13.041600 16.263456
min 0.500000 12.500000
25% 10.500000 18.250000
50% 20.500000 24.000000
75% 22.750000 29.750000
max 25.000000 35.500000

When you run describe on non-numeric data, the results is summary statistics:
>>> data = ['A', 'B', 'A', 'D', 'F', 'A', 'C', 'C']
>>> grade = pd.Series(data * 4)
>>> grade.describe()
count 32
unique 5
top A
freq 12
dtype: object

These are just a few examples!!!

Data Science Lab Manual
No ratings yet
Data Science Lab Manual
63 pages
Grade-XII-IP_Ch-1_Series Notes (1)
No ratings yet
Grade-XII-IP_Ch-1_Series Notes (1)
28 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
23 pages
PYTHON UNIT-5 Part-C
No ratings yet
PYTHON UNIT-5 Part-C
4 pages
Pandas Series - Notes for PA3.Docx
No ratings yet
Pandas Series - Notes for PA3.Docx
9 pages
FDS Lab Manual (Print)
No ratings yet
FDS Lab Manual (Print)
43 pages
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
75 pages
Machine Learning Using Phython
No ratings yet
Machine Learning Using Phython
25 pages
MLL Ip Xii
No ratings yet
MLL Ip Xii
22 pages
Pandas For Machine Learning: Acadview
No ratings yet
Pandas For Machine Learning: Acadview
18 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
12 pages
Python Pandas
No ratings yet
Python Pandas
22 pages
Python Pandas For Class XI Tutorial 1
No ratings yet
Python Pandas For Class XI Tutorial 1
8 pages
Pandas Notes
No ratings yet
Pandas Notes
19 pages
Numpy Basics Introduction To
No ratings yet
Numpy Basics Introduction To
35 pages
Pandas
No ratings yet
Pandas
49 pages
Lab 2 DWM
No ratings yet
Lab 2 DWM
13 pages
Ln. 1 - Data Handling Using Pandas - Series & Dataframe
No ratings yet
Ln. 1 - Data Handling Using Pandas - Series & Dataframe
14 pages
Pandas Shan Ver2
No ratings yet
Pandas Shan Ver2
25 pages
Pandas AI ML Python Software Engineering
No ratings yet
Pandas AI ML Python Software Engineering
63 pages
PP&DS Unit Iii
No ratings yet
PP&DS Unit Iii
26 pages
RAW Data
No ratings yet
RAW Data
22 pages
Mohit
No ratings yet
Mohit
19 pages
Leip 102
No ratings yet
Leip 102
36 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
62 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
36 pages
DV Lab2 Updated
No ratings yet
DV Lab2 Updated
12 pages
Introduction To Pandas & Data Structures
No ratings yet
Introduction To Pandas & Data Structures
11 pages
Q-Step WS 06112019 Data Analysis and Visualisation With Python
No ratings yet
Q-Step WS 06112019 Data Analysis and Visualisation With Python
76 pages
DAY6 Pandas Seaborn
No ratings yet
DAY6 Pandas Seaborn
97 pages
Ilovepdf Merged (2) Merged
No ratings yet
Ilovepdf Merged (2) Merged
65 pages
22mbada303 Module 4
No ratings yet
22mbada303 Module 4
32 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
Data Handling Using Pandas - 1-2-1
No ratings yet
Data Handling Using Pandas - 1-2-1
10 pages
Unit 5
No ratings yet
Unit 5
28 pages
Python For DScience & D Visualisation Updated
No ratings yet
Python For DScience & D Visualisation Updated
11 pages
04 Introduction To Python-1
No ratings yet
04 Introduction To Python-1
29 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
Unit 3 (FODS)
No ratings yet
Unit 3 (FODS)
34 pages
Data Handlinng Using Pandas
No ratings yet
Data Handlinng Using Pandas
46 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
Module 6
No ratings yet
Module 6
48 pages
Pandas
No ratings yet
Pandas
163 pages
XII - Ip - Panda - I - Part - I - 2023 (1) 1 1
No ratings yet
XII - Ip - Panda - I - Part - I - 2023 (1) 1 1
25 pages
Ip 102
No ratings yet
Ip 102
36 pages
Logic and Design PDF 1
No ratings yet
Logic and Design PDF 1
16 pages
P03 Introduction To Pandas Ans
No ratings yet
P03 Introduction To Pandas Ans
45 pages
02 - Decision Tree Classification On Iris Dataset
No ratings yet
02 - Decision Tree Classification On Iris Dataset
6 pages
Working With Pandas Notes
No ratings yet
Working With Pandas Notes
27 pages
CH 2
No ratings yet
CH 2
36 pages
Panda Ncert 1
No ratings yet
Panda Ncert 1
36 pages
Bch 202 Lecture Note
No ratings yet
Bch 202 Lecture Note
20 pages
Numpy & Pandas
No ratings yet
Numpy & Pandas
13 pages
NK DT Project
No ratings yet
NK DT Project
54 pages
Logic and Design PDF 2
No ratings yet
Logic and Design PDF 2
33 pages
Pandas
No ratings yet
Pandas
82 pages
Python Programming For Economics Finance
No ratings yet
Python Programming For Economics Finance
267 pages
BCH 202 PPT._083303
No ratings yet
BCH 202 PPT._083303
14 pages
Data Visualisation
No ratings yet
Data Visualisation
12 pages
Basic Math: 1.1 Scipy Constants (Scipy - Constants)
No ratings yet
Basic Math: 1.1 Scipy Constants (Scipy - Constants)
32 pages
Trade Backtest
No ratings yet
Trade Backtest
23 pages
Pthon Libraries - MCQs
No ratings yet
Pthon Libraries - MCQs
5 pages
Machine Learning Scikit Handson
0% (1)
Machine Learning Scikit Handson
4 pages
Super 40 - Pandas Series Worksheet Qs
No ratings yet
Super 40 - Pandas Series Worksheet Qs
11 pages
Ds Python Unit-I
No ratings yet
Ds Python Unit-I
30 pages
DA Lab ANSWERS
No ratings yet
DA Lab ANSWERS
10 pages
Swastika
No ratings yet
Swastika
60 pages
Lecture 3sazu Edu 204
No ratings yet
Lecture 3sazu Edu 204
5 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
49 pages
Rakshitha.M - 1BO17EC031
No ratings yet
Rakshitha.M - 1BO17EC031
26 pages
ANPR Final Project Report
100% (3)
ANPR Final Project Report
53 pages
Cheating Detection AI Approach
No ratings yet
Cheating Detection AI Approach
10 pages
Numerical for Deadlock_071921
No ratings yet
Numerical for Deadlock_071921
2 pages
Computational Sci. & Engg PDF
No ratings yet
Computational Sci. & Engg PDF
407 pages
Data Science With Python - Lesson 07 - Data Manipulation With Python - Pandas
No ratings yet
Data Science With Python - Lesson 07 - Data Manipulation With Python - Pandas
72 pages
Sandeep Nagar Introduction To Python - For Scientists and Engineers PDF
No ratings yet
Sandeep Nagar Introduction To Python - For Scientists and Engineers PDF
141 pages
Practical File (Class 12)
No ratings yet
Practical File (Class 12)
18 pages
CP Assingment53545556575859 - Sign
No ratings yet
CP Assingment53545556575859 - Sign
28 pages
DA-Interview Reference Material
No ratings yet
DA-Interview Reference Material
8 pages
Air Canvas Synopsis
No ratings yet
Air Canvas Synopsis
23 pages
Data Handling Python NCERT
No ratings yet
Data Handling Python NCERT
36 pages
Unit-1: Data Handling - (DH) : PAN DA S
No ratings yet
Unit-1: Data Handling - (DH) : PAN DA S
12 pages
Numercals on Virtual Memory and Disk Scheduling_071919
No ratings yet
Numercals on Virtual Memory and Disk Scheduling_071919
2 pages
Python Cheatsheet 2
No ratings yet
Python Cheatsheet 2
4 pages
Question Paper Pattern CIE
No ratings yet
Question Paper Pattern CIE
6 pages
Introduction To Python
No ratings yet
Introduction To Python
11 pages
Attachment 3 Python For Data Analysis Lyst9850
No ratings yet
Attachment 3 Python For Data Analysis Lyst9850
31 pages
Study Material IP 2022
No ratings yet
Study Material IP 2022
55 pages
Advanced Python Lab
No ratings yet
Advanced Python Lab
17 pages
Final Report Indhu
No ratings yet
Final Report Indhu
23 pages
Standard Module & Python Package
No ratings yet
Standard Module & Python Package
17 pages
MCQ
No ratings yet
MCQ
8 pages
14 NumPy
No ratings yet
14 NumPy
4 pages

Python Foundation For Data Science

Uploaded by

Python Foundation For Data Science

Uploaded by

PYTHON FOUNDATION FOR DATA SCIENCE

Running the Jupyter Notebook

Then you will be redirected to the browser like

Data Structures and Sequences

List are variable-length and their contents can be modified in-place.

you can add or insert into the list

Other functions of the list include

slice and dice, sorted, zip, reverse????

DICT: likely the most important built-in Python data structure.

Run the following code and explain what happens

zeros and ones create arrays of 0s or 1s, respectively,

You can carry out all arithmetic on numerical ndarray. E.g

Mathematical and Statistical Methods

Thus, whenever you see pd. in code, it’s referring to pandas.

pandas Data Structures

Series is a one-dimensional array-like object containing a sequence of values (of

//get the values only

getijg the index only

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by

A row can also be retreated by specifying the row index, as follows:

Columns can be modified by assignment, eg

Summarizing and Computing Descriptive Statistics

Consider the following DataFrame:

Calling DataFrame’s sum method returns a Series containing column sums:

Passing axis='columns' or axis=1 sums across the columns instead:

>>> score.mean(skipna=False) //Exclude column where there are no data

>> score.cumsum() // Compute cumulative sums

These are just a few examples!!!

You might also like