0% found this document useful (0 votes)
2 views55 pages

Python - Data Science Lecture 1

The document outlines a lecture on Python for Data Science at WSB University, covering topics such as Python basics, data analysis, and data mining. It includes information on course materials, assessment criteria, and a bibliography of recommended readings. The lecture will also introduce key libraries like Pandas and NumPy, and discuss data processing and visualization techniques.

Uploaded by

lutvaliyev.r
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views55 pages

Python - Data Science Lecture 1

The document outlines a lecture on Python for Data Science at WSB University, covering topics such as Python basics, data analysis, and data mining. It includes information on course materials, assessment criteria, and a bibliography of recommended readings. The lecture will also introduce key libraries like Pandas and NumPy, and discuss data processing and visualization techniques.

Uploaded by

lutvaliyev.r
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Python - Data Science

Python language in data processing and


data mining
Lecture 1.
WSB University
Dariusz Badura

Python in Data Science 2023-24 1


Information about Lecture materials
presented on On-lineWSB platform
 On-lineWSB platform
 Link: https://online.wsb.edu.pl/course/view.php?id=10797
 Password: PLPBZima23

The course discusses the basics of:


 Python and libraries,
 data analisis,
 data science.

Python in Data Science 2023-24 2


The criteria for passing the lecture
• At the end of the semester there will be a test
on the topics presented in the lecture: 20 – 30
of test questions.

• Four tests and tasks per semester on lecture


issues.

• Materials of lecture presenting will be posted


on the e-learning platform On-line WSB.

Python in Data Science 2023-24 3


Bibliography
 Wes McKinney: Python for Data Analysis: Data
Wrangling with pandas, NumPy, and Jupyter, 3rd
Edition; O’Reilly Media, Inc. © 2022.
 Sarah Guido, Andreas Müller: Introduction to Machine
Learning with Python: A Guide for Data Scientists;
O’Reilly Media, Inc. © 2017.
 Sandy Ryza, Uri Laserson, Sean Owen, & Josh Wills:
Advanced Analytics with Spark; by O’Reilly Media, Inc.
June 2017;
 Others Internet sources …
Python in Data Science 2023-24 4
Data processing
• …, manipulation of data by a computer. It includes the conversion of
raw data to machine-readable form, flow of data through the CPU and
memory to output devices, and formatting or transformation of
output.
• … the collection and manipulation of digital data to produce
meaningful information. Data processing is a form of information
processing, which is the modification of information in any manner
detectable by an observer.
• The term "Data Processing", has also been used to refer to a
department within an organization responsible for the operation of
data processing programs.
Python in Data Science 2023-24 5
Data processing functions
Data processing may involve various processes, including:
• Validation – Ensuring that supplied data is correct and relevant.
• Sorting – "arranging items in some sequence and/or in different
sets."
• Summarization(statistical) or (automatic) – reducing detailed data
to its main points.
• Aggregation – combining multiple pieces of data.
• Analysis – the "collection, organization, analysis, interpretation and
presentation of data."
• Reporting – list detail or summary data or computed information.
• Classification – separation of data into various categories.

Python in Data Science 2023-24 6


Data mining
• Data mining is the process of sorting through large data sets to identify
patterns and relationships that can help solve business problems through data
analysis.
• Data mining is the process of extracting and discovering patterns in large data
sets involving methods at the intersection of machine learning, statistics,
and database systems. Data mining is an interdisciplinary subfield of computer
science and statistics with an overall goal of extracting information and
transforming the information into a comprehensible structure for further use.
• Data mining is the analysis step of the "knowledge discovery in databases"
process, or KDD. It also involves database and data management aspects, data
pre-processing, model and inference considerations, interestingness
metrics, complexity considerations, post-processing of discovered
structures, visualization, and online updating.

Python in Data Science 2023-24 7


Lecture issues
 Data exploration & analysis.
– Pandas; NumPy; SciPy. (and others)
 Data visualization.
– Matplotlib; Seaborn; Datashader; others.
 Classical machine learning.
– Scikit-Learn, StatsModels.
 Deep learning.
– Keras, TensorFlow, and a whole host of others.
 Data storage and big data frameworks.
– Apache Spark; Apache Hadoop; HDFS; Dask; h5py/pytables.
 Odds and ends.
– nltk; Spacy; OpenCV/cv2; scikit-image; Cython.
Python in Data Science 2023-24 8
Plan of the first lecture

 A quick overview of the features of Python,


 NumPy library overview,
 Pandas library overview.

Python in Data Science 2023-24 9


Python
BASICS OF THE LANGUAGE

Python in Data Science 2023-24 10


Why Python?
 … A general programming language, thanks to
the libraries pandas, NumPy, scipy, matplotlib,
TensorFlow ... it has become a powerful
environment for scientific calculations.
 The basic features of the language:
 Basic data types
 Containers
 Functions
 classes.
https://docs.python.org/3/tutorial/index.html

Python in Data Science 2023-24 11


The Interpreter and Its Environment
 Anaconda:
 Spider
 Jupyter
 InPy
 Colab

Other development environments:


 PyDev (free) – an integrated development environment built on the Eclipse platform;
 PyCharm by JetBrains;
 Python Tools for Visual Studio for Windows users;
 Spyder (free) – an integrated development environment included with the Anaconda
interpreter;
 Komod - commercial integrated development environment.

Python in Data Science 2023-24 12


Python versions
 There are currently two different supported
versions of Python, 2.7 and 3.7.
 Python 3.0 introduced many backwards-
incompatible changes to the language, so code
written for 2.7 may not work in 3.7 and vice
versa.
 the code presented will use Python 3.5.
 You can check the Python version on the
command line by running python --version.
Python in Data Science 2023-24 13
Example: Python implementation of the classic
„quicksort” algorithm:

def quicksort(arr):
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot]
middle = [x for x in arr if x == pivot]
right = [x for x in arr if x > pivot]
return quicksort(left) + middle +
quicksort(right)

print(quicksort([3,6,8,10,1,2,1]))
# Prints "[1, 1, 2, 3, 6, 8, 10]"
Python in Data Science 2023-24 14
Basic data types
 Numbers: Integers and floating point numbers
work as they do in other languages:

x = 3
print(type(x)) # Prints "<class 'int'>"
print(x) # Prints “5"
print(x + 1) # Addition; prints “6"
print(x - 1) # Subtraction; prints “4"
print(x * 2) # Multiplication; prints “10"
print(x ** 2) # Exponentiation; prints “25"
x += 1
print(x) # Prints “6"
x *= 2
print(x) # Prints “10"
y = 3.7
print(type(y)) # Prints "<class 'float'>"
print(y, y + 1, y * 2, y ** 2) # Prints “3.7 4.7 7.4 13.69"

Python in Data Science 2023-24 15


Basic data types
Python does not have unary increment (x++) or decrement (x--) operators.
Python has built-in complex number types.
 Booleans: Python implements all the usual boolean logic operators, but
uses English words instead of symbols (&&, ||, etc.):

t = True
f = False
print(type(t)) # Prints "<class 'bool'>"
print(t and f) # Logical AND; prints "False"
print(t or f) # Logical OR; prints "True"
print(not t) # Logical NOT; prints "False"
print(t != f) # Logical XOR; prints "True"

Python in Data Science 2023-24 16


Basic data types
 Strings: Python has strong support for strings:

hello = 'hello' # String literals can use single quotes


world = "world" # or double quotes; it does not matter.
print(hello) # Prints "hello"
print(len(hello)) # String length; prints "5"
hw = hello + ' ' + world # String concatenation
print(hw) # prints "hello world"
hw12 = '%s %s %d' % (hello, world, 12) # sprintf style string formatting
print(hw12) # prints "hello world 12"
Useful methods
s = "hello"
print(s.capitalize()) # Capitalize a string; prints "Hello"
print(s.upper()) # Convert a string to uppercase; prints "HELLO"
print(s.rjust(7)) # Right-justify a string, padding with spaces; prints "
hello"
print(s.center(7)) # Center a string, padding with spaces; prints " hello "
print(s.replace('l', '(ell)')) # Replace all instances of one substring with
another;
# prints "he(ell)(ell)o"
print(' world '.strip()) # Strip leading and trailing whitespace; prints
"world"
Python in Data Science 2023-24 17
Containers
Python comes with several built-in container types: lists,
dictionaries, sets, and tuples.
 Lists

A list is equivalent to an array in Python, but it is resizeable and


can contain elements of different types:
xs = [3, 1, 2] # Create a list
print(xs, xs[2]) # Prints "[3, 1, 2] 2"
print(xs[-1]) # Negative indices count from the end of the list; prints "2"
xs[2] = 'foo' # Lists can contain elements of different types
print(xs) # Prints "[3, 1, 'foo']"
xs.append('bar') # Add a new element to the end of the list
print(xs) # Prints "[3, 1, 'foo', 'bar']"
x = xs.pop() # Remove and return the last element of the list
print(x, xs) # Prints "bar [3, 1, 'foo']"

Python in Data Science 2023-24 18


Containers -> lists
 Slicing: In addition to accessing list elements individually, Python provides a
concise syntax to access sublists; this is known as slicing:

nums = list(range(5)) # range is a built-in function that


creates a list of integers
print(nums) # Prints "[0, 1, 2, 3, 4]"
print(nums[2:4]) # Get a slice from index 2 to 4
(exclusive); prints "[2, 3]"
print(nums[2:]) # Get a slice from index 2 to the end;
prints "[2, 3, 4]"
print(nums[:2]) # Get a slice from the start to index 2
(exclusive); prints "[0, 1]"
print(nums[:]) # Get a slice of the whole list; prints
"[0, 1, 2, 3, 4]"
print(nums[:-1]) # Slice indices can be negative; prints
"[0, 1, 2, 3]"
nums[2:4] = [8, 9] # Assign a new sublist to a slice
print(nums) # Prints "[0, 1, 8, 9, 4]"

Python in Data Science 2023-24 19


Containers -> lists -> Loops
 List items can be enclosed in a loop

animals = ['cat', 'dog', 'monkey']


for animal in animals:
print(animal)

 To access the index of each element in the loop body you can
access it by using a built-in function enumerate:

animals = ['cat', 'dog', 'monkey']


for idx, animal in enumerate(animals):
print('#%d: %s' % (idx + 1, animal))
# Prints "#1: cat", "#2: dog", "#3:
monkey", each on its own line

Python in Data Science 2023-24 20


Containers -> lists -> List comprehension
 When programming, we often want to transform one type of data into
another. As a simple example, consider the following code that calculates
square numbers:
nums = [0, 1, 2, 3, 4]
squares = []
for x in nums:
squares.append(x ** 2)
print(squares) # Prints [0, 1, 4, 9, 16]
We can simplify this code:
nums = [0, 1, 2, 3, 4]
squares = [x ** 2 for x in nums]
print(squares) # Prints [0, 1, 4, 9, 16]

List words may also contain the conditions:


nums = [0, 1, 2, 3, 4]
even_squares = [x ** 2 for x in nums if x % 2 == 0]
print(even_squares) # Prints "[0, 4, 16]"

Python in Data Science 2023-24 21


Containers -> dictionaries
 A dictionary stores (key, value) pairs, similar to a map in Java or
an object in JavaScript. We can use it this way:

d = {'cat': 'cute', 'dog': 'furry'} # Create a new dictionary


with some data
print(d['cat']) # Get an entry from a dictionary; prints
"cute"
print('cat' in d) # Check if a dictionary has a given key;
prints "True"
d['fish'] = 'wet' # Set an entry in a dictionary
print(d['fish']) # Prints "wet"
# print(d['monkey']) # KeyError: 'monkey' not a key of d
print(d.get('monkey', 'N/A')) # Get an element with a default;
prints "N/A"
print(d.get('fish', 'N/A')) # Get an element with a default;
prints "wet"
del d['fish'] # Remove an element from a dictionary
print(d.get('fish', 'N/A')) # "fish" is no longer a key; prints
"N/A"

Python in Data Science 2023-24 22


Containers -> dictionaries-> Loops
 Iterating through a dictionary by key:
d = {'person': 2, 'cat': 4, 'spider': 8}
for animal in d:
legs = d[animal]
print('A %s has %d legs' % (animal, legs))
# Prints "A person has 2 legs", "A cat has 4 legs", "A spider has
8 legs"

 Access the keys and their corresponding values using the items:

d = {'person': 2, 'cat': 4, 'spider': 8}


for animal, legs in d.items():
print('A %s has %d legs' % (animal, legs))
# Prints "A person has 2 legs", "A cat has 4 legs", "A spider has
8 legs"

Python in Data Science 2023-24 23


Containers -> dictionaries comprehensions

 Similar to list descriptions, but allow you to easily create


dictionaries. For example:

nums = [0, 1, 2, 3, 4]
even_num_to_square = {x: x ** 2 for x in nums if x % 2 == 0}
print(even_num_to_square) # Prints "{0: 0, 2: 4, 4: 16}"

Python in Data Science 2023-24 24


Sets
 A set is an unordered collection of distinct elements.
Example:

animals = {'cat', 'dog'}


print('cat' in animals) # Check if an element is in a set;
prints "True"
print('fish' in animals) # prints "False"
animals.add('fish') # Add an element to a set
print('fish' in animals) # Prints "True"
print(len(animals)) # Number of elements in a set; prints
"3"
animals.add('cat') # Adding an element that is already in
the set does nothing
print(len(animals)) # Prints "3"
animals.remove('cat') # Remove an element from a set
print(len(animals)) # Prints "2"

Python in Data Science 2023-24 25


Sets
 Loops: Iterating over a set has the same syntax as iterating over a list;
however, because sets are unordered, assumptions cannot be made about
the order in which the set's elements are visited:
animals = {'cat', 'dog', 'fish'}
for idx, animal in enumerate(animals):
print('#%d: %s' % (idx + 1, animal))
# Prints "#1: fish", "#2: dog", "#3: cat"

 Set comprehensions: Like lists and dictionaries, we can easily construct sets using
the comprehensions set:

from math import sqrt


nums = {int(sqrt(x)) for x in range(30)}
print(nums) # Prints "{0, 1, 2, 3, 4, 5}"

Python in Data Science 2023-24 26


Tuples
 A tuple is an (immutable) ordered list of values. A tuple is similar
to a list in many ways; one difference is that tuples can be used as
keys in dictionaries and as elements of sets, while lists cannot.
Example:

d = {(x, x + 1): x for x in range(10)} # Create a


dictionary with tuple keys
t = (5, 6) # Create a tuple
print(type(t)) # Prints "<class 'tuple'>"
print(d[t]) # Prints "5"
print(d[(1, 2)]) # Prints "1"

Python in Data Science 2023-24 27


Functions
 Python functions are defined using the def keyword. Example:

def sign(x):
if x > 0:
return 'positive'
elif x < 0:
return 'negative'
else:
return 'zero'

for x in [-1, 0, 1]:


print(sign(x))
# Prints "negative", "zero", "positive"

Python in Data Science 2023-24 28


Functions
 We often define functions to take optional keyword arguments, like
this:

def hello(name, loud=False):


if loud:
print('HELLO, %s!' % name.upper())
else:
print('Hello, %s' % name)

hello('Bob') # Prints "Hello, Bob"


hello('Fred', loud=True) # Prints "HELLO, FRED!"

Python in Data Science 2023-24 29


Classes
 Example

class Greeter(object):

# Constructor
def __init__(self, name):
self.name = name # Create an instance variable

# Instance method
def greet(self, loud=False):
if loud:
print('HELLO, %s!' % self.name.upper())
else:
print('Hello, %s' % self.name)

g = Greeter('Fred') # Construct an instance of the Greeter class


g.greet() # Call an instance method; prints "Hello,
Fred"
g.greet(loud=True) # Call an instance method; prints "HELLO,
FRED!"

Python in Data Science 2023-24 30


NumPy library
• The topic of NumPy and pandas libraries refers to datasets can come from a wide range of
sources and in a wide range of formats, including:
– collections of documents,
– collections of images,
– collections of sound clips,
– collections of numerical measurements, or
– … nearly anything technical issues.
• Data sets can be represented as arrays of numbers.
– digital images—can be thought of as simply two-dimensional arrays of numbers representing pixel
brightness across the area.
– sound clips can be thought of as one-dimensional arrays of intensity versus time.
– text can be converted in various ways into numerical representations, such as binary digits
representing the frequency of certain words or pairs of words.
• Efficient storage and manipulation of numerical arrays is fundamental to the process of
doing data science.
• Install NumPy : http://www.numpy.org/
• The import NumPy and for example double-check the version:
Python in Data Science 2023-24 31
NumPy module
• The Numpy module is a basic library for
scientific calculations in Python (including
matrix multiplication and addition,
diagonalization or inversion, integration,
solving equations, etc.).
• It provides us with specialized data types,
operations and functions that are not available
in a typical Python installation.

Python in Data Science 2023-24 32


Creating Arrays from Python Lists
• Unlike Python lists, NumPy arrays can only contain data
of the same type. If the types do not match, NumPy will
upcast them according to its type promotion rules;
here, integers are upcast to floating point:
• Integer array: np.array([1, 4, 2, 5, 3])

• Float point array: np.array([3.14, 4, 2, 3])

• Use dtype keyword:


np.array([1, 2, 3, 4], dtype=np.float32)

Python in Data Science 2023-24 33


Creating Arrays from Scratch
For larger arrays – more efficient to create arrays from scratch
using routines built into NumPy
• np.zeros(10, dtype=int)
• np.ones((3, 5), dtype=float)
• np.full((3, 5), 3.14)

• # Create an array filled with a linear sequence # starting at 0,


ending at 20, stepping by 2 # (this is similar to the built-in range
function) np.arange(0, 20, 2)
• # Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

• … and others
Python in Data Science 2023-24 34
NumPy Standard Data Types
• NumPy is built in C, the types will be familiar to users of
C, Fortran, and other related languages.
Data type Description Data type Description
uint8 Unsigned integer (0 to 255)
bool_ Boolean (True or False) stored as a byte uint16 Unsigned integer (0 to 65535)

Default integer type (same as C long; uint32 Unsigned integer (0 to 4294967295)


int_
normally either int64 or int32) Unsigned integer (0 to
uint64
18446744073709551615)
intc Identical to C int (normally int32 or int64) float_ Shorthand for float64
Half-precision float: sign bit, 5 bits exponent,
Integer used for indexing (same as C ssize_t; float16
intp 10 bits mantissa
normally either int32 or int64)
Single-precision float: sign bit, 8 bits
int8 Byte (–128 to 127) float32
exponent, 23 bits mantissa
int16 Integer (–32768 to 32767)
Double-precision float: sign bit, 11 bits
float64
int32 Integer (–2147483648 to 2147483647) exponent, 52 bits mantissa
complex_ Shorthand for complex128
Integer (–9223372036854775808 to Complex number, represented by two 32-bit
Int64 complex64
9223372036854775807) floats
Complex number, represented by two 64-bit
complex128
floats

Python in Data Science 2023-24 35


The Basics of NumPy Arrays
• Attributes of arrays: Determining the size, shape,
memory consumption, and data types of arrays
• Indexing of arrays: Getting and setting the values of
individual array elements
• Slicing of arrays: Getting and setting smaller subarrays
within a larger array
• Reshaping of arrays: Changing the shape of a given array
• Joining and splitting of arrays: Combining multiple
arrays into one, and splitting one array into many
Python in Data Science 2023-24 36
NumPy Array Attributes
defining random arrays of one, two, and three
dimensions
import numpy as np
rng = np.random.default_rng(seed=1701) # seed for
reproducibility

x1 = rng.integers(10, size=6) # one-dimensional array


x2 = rng.integers(10, size=(3, 4)) # two-dimensional array
x3 = rng.integers(10, size=(3, 4, 5)) # three-dimensional
array

print("x3 ndim: ", x3.ndim)


print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
print("dtype: ", x3.dtype)

Python in Data Science 2023-24 37


Array Indexing
• array Indexing: access to single elements,
• array slicing: access to subarrays:
x[start:stop:step]
• reshaping of arrays
• array concatenation and splitting

Python in Data Science 2023-24 38


Array Concatenation and Splitting
 Concatenation of arrays:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])

 Splitting of arrays:
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)

Python in Data Science 2023-24 39


Computation on NumPy Arrays:
Universal Functions
 Array Arithmetic
 Absolute Value
 Trigonometric Functions
 Specialized Ufuncs (scipy.special)
 Advanced Ufunc Features
 Specifying Output
 Aggregations
 Outer Products
Python in Data Science 2023-24 40
Aggregations: min, max, between
Function name NaN-safe version Description

• Summing the Values in np.sum np.nansum Compute sum of elements


Compute product of
an Array np.prod np.nanprod
elements
Compute mean of
• Minimum and Maximum np.mean np.nanmean
elements
Compute standard
np.std np.nanstd
• Multidimensional deviation
np.var np.nanvar Compute variance
Aggregates np.min np.nanmin Find minimum value
• Other Aggregation np.max np.nanmax Find maximum value
Functions:  (table) np.argmin np.nanargmin Find index of minimum
value
Find index of maximum
np.argmax np.nanargmax
value
np.median np.nanmedian Compute median of
elements
Compute rank-based
np.percentile np.nanpercentile
statistics of elements

Evaluate whether any


http://localhost:8888/notebooks/Downloads/02.04-Co np.any N/A
elements are true
mputation-on-arrays-aggregates.ipynb
Evaluate whether all
np.all N/A
elements are true

Python in Data Science 2023-24 41


Computation on Arrays: Broadcasting
Broadcasting in NumPy follows a strict set of rules to
determine the interaction between the two arrays:
 Rule 1: If the two arrays differ in their number of
dimensions, the shape of the one with fewer dimensions is
padded with ones on its leading (left) side.
 Rule 2: If the shape of the two arrays does not match in
any dimension, the array with shape equal to 1 in that
dimension is stretched to match the other shape.
 Rule 3: If in any dimension the sizes disagree and neither is
equal to 1, an error is raised.
Python in Data Science 2023-24 42
Comparisons, Masks, and Boolean Logic
 Masking comes up when we want to extract, modify, count, or otherwise
manipulate values in an array based on some criterion.
 In NumPy, Boolean masking is often the most efficient way to accomplish these
types of tasks.
 NumPy implements comparison operators such as < (less than) and > (greater
than) as element-wise ufuncs. The result of these comparison operators is always
an array with a Boolean data type.
 All six of the standard comparison operations are available:
Operator Equivalent ufunc Operator Equivalent ufunc
== np.equal != np.not_equal
< np.less <= np.less_equal
> np.greater >= np.greater_equal

Python in Data Science 2023-24 43


Comparisons, Masks, and Boolean Logic

 Boolean Operators
 Boolean Arrays as Masks

Operator Equivalent ufunc Operator Equivalent ufunc

& np.bitwise_and | np.bitwise_or

^ np.bitwise_xor ~ np.bitwise_not

Python in Data Science 2023-24 44


Fancy Indexing
 Fancy indexing is conceptually simple: it means passing an array of indices to
access multiple array elements at once. For example, consider the following array:

import numpy as np
rng = np.random.default_rng(seed=1701)

x = rng.integers(100, size=10)
print(x)
[x[3], x[7], x[2]] # simplest fancy indexing
ind = np.array([[3, 7],
[4, 5]])
x[ind]
 Combined Indexing
 X[2, [2, 0, 1]]
 X[1:, [2, 0, 1]]
Python in Data Science 2023-24 45
Sorting Arrays
 Fast Sorting in NumPy: np.sort and np.argsort
 Sorting Along Rows or Columns

 Partial Sorts: Partitioning

L = [3, 1, 4, 1, 5, 9, 2, 6] sorted('python')
sorted(L) # returns a sorted copy

L.sort() # acts in-place and


returns None
print(L)

import numpy as np

x = np.array([2, 1, 4, 3, 5])
np.sort(x)

Python in Data Science 2023-24 46


NumPy's Structured Arrays
 Exploring Structured Array Creation
 More Advanced Compound Types
Character Description Example
'b' Byte np.dtype('b')
'i' Signed integer np.dtype('i4') == np.int32

'u' Unsigned integer np.dtype('u1') == np.uint8

'f' Floating point np.dtype('f8') == np.int64


np.dtype('c16') ==
'c' Complex floating point np.complex128
'S', 'a' String np.dtype('S5')
'U' Unicode string np.dtype('U') == np.str_

'V' Raw data (void) np.dtype('V') == np.void

Python in Data Science 2023-24 47


Basic features of Pandas library
 Pandas objects
 NumPy and Pandas imports:
import numpy as np
import pandas as pd
 The Pandas Series Object
 Series as Generalized NumPy Array
 Series as Specialized Dictionary

 Constructing Series Objects

Python in Data Science 2023-24 48


The Pandas DataFrame Object
 DataFrame as Generalized NumPy Array
 DataFrame as Specialized Dictionary
 Constructing DataFrame Objects
 From a single Series object
pd.DataFrame(population, columns=['population'])
 From a list of dicts
data = [{'a': i, 'b': 2 * i}
for i in range(3)]
pd.DataFrame(data)
 From a dictionary of Series objects
pd.DataFrame({'population': population,
'area': area})
 From a two-dimensional NumPy array
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
 From a NumPy structured array
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
pd.DataFrame(A)

Python in Data Science 2023-24 49


The Pandas Index Object
• Index as Immutable Array
• Index as Ordered Set

Python in Data Science 2023-24 50


Data Indexing and Selection
 Data Selection in Series
 Series as Dictionary
 Series as One-Dimensional Array
 Indexers: loc and iloc
 Data Selection in DataFrames
 DataFrame as Dictionary
 DataFrame as Two-Dimensional Array
 Additional Indexing Conventions
Python in Data Science 2023-24 51
Operating on Data in Pandas
 Index Preservation (Ufuncs)
 Index Alignment
 Index Alignment in Series
 Index Alignment in DataFrames
 Operations Between DataFrames and Series

Python in Data Science 2023-24 52


Handling Missing Data
 None as a Sentinel Value
 NaN: Missing Numerical Data
 NaN and None in Pandas
 Pandas Nullable Dtypes
 Operating on Null Values
 Detecting Null Values
 Dropping Null Values
 Filling Null Values
Python in Data Science 2023-24 53
Hierarchical Indexing
 A Multiply Indexed Series
 The Bad Way
 The Better Way: The Pandas MultiIndex
 MultiIndex as Extra Dimension
 Methods of MultiIndex Creation
 Explicit MultiIndex Constructors
 MultiIndex Level Names
 MultiIndex for Columns
 Indexing and Slicing a MultiIndex
 Multiply Indexed Series
 Multiply Indexed DataFrames
 Rearranging Multi-Indexes
 Sorted and Unsorted Indices
 Stacking and Unstacking Indices
 Index Setting and Resetting
Python in Data Science 2023-24 54
Combining Datasets: concat and append
 Recall: Concatenation of NumPy Arrays
 Simple Concatenation with pd.concat
 Duplicate Indices
 Concatenation with Joins
 The append Method

Python in Data Science 2023-24 55

You might also like