0% found this document useful (0 votes)
25 views6 pages

Pandas - Cheat - Sheet

The document provides an overview of key Pandas concepts for working with Series and DataFrames. It describes how Series are 1-dimensional labeled arrays that can have names, indexes, and different data types. DataFrames are 2-dimensional tables that contain rows and columns, with each column being a Series. The document outlines many common Series and DataFrame methods for data manipulation, such as filtering, sorting, descriptive statistics, and transformations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views6 pages

Pandas - Cheat - Sheet

The document provides an overview of key Pandas concepts for working with Series and DataFrames. It describes how Series are 1-dimensional labeled arrays that can have names, indexes, and different data types. DataFrames are 2-dimensional tables that contain rows and columns, with each column being a Series. The document outlines many common Series and DataFrame methods for data manipulation, such as filtering, sorting, descriptive statistics, and transformations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Pandas: Python Package | Fast, flexible and powerful data manipulation tool

Series: 1 Dimensional labelled array of any data type


>Bros = pd.Series(data = [‘Soum’ , ’Ayush’], index = [1,2], dtype = ’string’) # Using List for creating Series
Note: When Index are not specified Pandas assign a sequence of integers starting from zero. This is a built-in Pandas
object called ‘RangeIndex’

Method & Attributes:


Method – A function bound to the object (Example: List1.equals(List2)
Attribute – A variable bound to the object (Example: List1.dtype, Bros.name = ‘Cool Dudes!’)
Note: Series can have names and those names can be used as column names in DataFrame. Similarly, the Series Index
can also have its own name

Add_prefix() & Add_suffix() Method: (Doesn’t modify the original Series labels)
>Alphabet.add_prefix(‘_prefix_label’)
>Alphabet.add_suffix(‘_suffix_label’)

Boolean mask and the .loc Indexer:


Boolean Mask - Used to index select items at scale | work with [] and .loc | need to be of same length as Series
>Alphabet.loc[[True if i%2==0 else False for i in range(26)]]

.loc – The prototypical way of doing label based extraction


.iloc – Way of doing integer / location based extraction
Example:
>Alphabet = pd.Series(data = [‘A’,’B’,’C’,’D’], index = [1,2,3,4], dtype=’string’)
>Alphabet.loc[1] ---> Output: A
>Alphabet.iloc[1] ---> Output: B

Selecting with .get():


-‘get()’ works for both ‘index based on labels’ as well as ‘location based extraction’
>Alphabet.get(‘label_A’) ---> Output: A
>Alphabet.get(0) ---> Output: A
>Alphabet.get(‘Anything’, default=’Couldn’t find anything!’) ---> Output: Couldn’t find anything!
Note: default of .get() is set to ‘None’

Series Methods and Handling:


pd.read_csv – Read a comma-separated values (csv) file into the DataFrame
>Alcohol = pd.read_csv(‘https://andybek.com/drinks’, usecols = [‘country’,’wine_servings’], index_col = [‘country’])
>type(Alcohol) ---> pandas.core.frame.DataFrame

If the parsed data only contains one column then setting squeeze parameter as True will return a Series
>Alcohol = pd.read_csv(‘https://andybek.com/drinks’, usecols = [‘country’,’wine_servings’], index_col = [‘country’],
squeeze = True)
>type(Alcohol) ---> pandas.core.series.Series
Size & Shape:
.size - number if elements in the Series
.shape - tuple of the dimension for a Series

Unique values and Series Monotonicity:


.is_unique - Simply checks if the Series contains a sequence of Unique values (True/Values)
.nunique() - Gives exact counts of unique values in a Series (excluding NA) (#)
.nunique(dropna = False) - Gives counts of unique values in a Series (including NA)
.unique() - Gives a list of unique values in the Series
.is_monotonic / .is_monotonic_increasing - Checks if Series is increasing / stagnation
.is_monotonic_decreasing - Checks if Series is decreasing

The count() Method:


Series.count() - Return number of non-NA / Null observations in the Series
.isna().sum() - Return uber of Null elements in the Series

Dropping and Filling NAs:


>Alcohol.fillna(100, inplace = False)
>Alcohol.dropna(inplace = False) - Exclude NAs from the Series
Note: Both methods return a copy of the Series unless the ‘inplace’ parameter is set to True

Descriptive Statistics:
>Alcohol.sum() - Excludes NA’s
>Alcohol.mean()
>Alcohol.median()
>Alcohol.quantile(q=0.5)
IQR (Interquantile Range) -> Alcohol.quantile(0.75) – Alcohol.qunatile(0.25)
>Alcohol.min()
>Alcohol.max()
>Alcohol.std()
>Alcohol.var()
Note: Alcohol.std()**2 = Alcohol.var() | Mode - Item with highest frequency

Describe() Method:
Gives an overall statistical description of the dataset
>Alcohol.describe(percentile = [0.79,0.19], include = float, exclude = object)

Value_counts():
A sorted series containing unique values and their counts
>Alcohol.value_counts(sort = True, ascending = False, dropna = True, normalize = False)
Note: Normalize provides relative frequency

idxmax() & idxmin():


idxmax() - Returns the label of the row with minimum value
idxmin() - Returns the label of the row with minimum value
Note: If ‘multiple’ min/max values are present then only the first label is returned
nlargest() & nsmallest():
>Alcohol.nlargest(n = 10) - Shows largest 10 alcohol consuming countries / values
>Alcohol.nsmallest(n = 10) - Shows smallest 10 alcohol consuming countries / values

Sorting with sort_values():


Returns a new Series, sorted by values
>Alcohol.sort_values(ascending = False, na_position = ‘last’, kind = ‘quicksort’, inplace = False)
Note: Have other sorting options such as ‘mergesort’ or ‘heapsort’, default is ‘quicksort’

Sorting wit sort_index():


Returns a new Series, sorted by index labels
>Alcochol.sort_index(ascending = True, na_postion = ‘last’, inplace = False)
Note: NaN won’t show up in the sorting if not present in the index label

Series Arithmetic and fill_value():


Use to add / subtract / divide / multiply two Series
>Alcohol.add(More_Drinks, fill_value = 0)
>Alcohol.subtract(More_Drinks, fill_value = 0)
>Alcohol.divide(More_Drinks, fill_value = 1)
>Alcohol.multiply(More_Drinks, fill_value = 1)
Note: ‘fill_value’ is set to 0 / 1 so that we don’t end up losing data due to Series misalignment

Calculating Variance and Standard Deviation:


Variance - The average of squared differences from the mean

>(Alcohol.subtract(Alcohol.mean())**2).sum()/(Alcohol.count() -1)
Note: Standard Deviation is square of Variance

Cumulative Operations:
>Alcohol.cumsum(skipna =True) - Calculate a progressive / cumulative sum (of the values preceding in the Series)
>Alcohol.cumprod()
>Alcohol.cummin()
>Alcohol.cummax()
Note: ‘NaN’ are skipped (i.e., ‘skipna’ parameter is set to ‘True’ by default) as sum of any number with ‘NaN’ is ‘NaN’)

Pairwise Difference with Diff():


Used to calculate the discrete difference for pair of elements in a Series
>Ser.diff(periods = 1)
Series Iteration:
>for i in Alcohol:
print(i)
-Prints value / alcohol consumption without labels

>for i in Alcohol.index:
print(i)
-Prints labels without values

>for i in Alcohol.index:
print(i, Alcohol[i])
-Prints both label and value

>for i in Alcohol.items(): / for i in Alcohol.iteritems():


print(i)
-Returns a lazy evaluated iterable of tuple | Behind the scene used ‘zip’ method

Filtering: filter(), where() and mask()


>Alcohol.filter(regex = ‘^V’) - Will filter countries starting with letter ‘V’
>Alcohol.filter(like = ‘stan’) - Will filter countries having ‘stan’ in them
Note: Filtering is done along index labels and not on values on using ‘filter()’ function. For filtering for values, use square
brackets (Example: Alcohol[Alcohol > 200])

pandas.Series.where() - Replace values with ‘NaN / set value’ where the condition is False
>Alcohol.where(lambda x: x > 200, np.nan).dropna()

pandas.Series.mask() - Replace values with ‘NaN / set value’ where the condition is True
>Alcohol.mask(lambda x: x > 200, np.nan).dropna()

Note: Default replacing value is ‘NaN’

Transforming with update(), apply() & map():


update() - Update series with newly added values
>Alcohol.update(pd.Series(data = [200,20], index = [‘Albania’, ‘Algeria’]))

apply() - Applies a transformation to each and every element in a series


>Alcohol.apply(lambda x : x**2) | Alcohol.apply(np.square)

map() - Map values of Series according to input correspondence | Used for substituting each value in a Series with
another value, that may be derived from a function, a dictionary Series
>Ser.map({‘old value’ : ‘new value’})
>Alcohol.map(lambda x : x**2)
DataFrame:
A table of data that contains a collection of rows and columns.
Key Aspects -
1. DataFrames have two dimensions: labeled indices and columns
2. Each column in a DataFrame is a Series and each column must be of same size
3. Unlike Series, DataFrames could be heterogenous (i.e., have multiple data types)

>names = [‘Olga’, ‘Andre’, ‘Brian’]


>ages = [29, 21, 45]
>married = [False, True, True]
>df = pd.DataFrame({‘name’ : names, ‘age’ : ages, ‘married’ : married})

Need to specify index position as well as column position / name to fetch specific value from DataFrame
> df.iloc[2,0] --> Output: ‘Brian’

.dtypes: Shows columns name and their datatypes


Note: ‘.dtype’ will throw an error as DataFrame can have multiple data types

.ndim & .shape:


.ndim - Direct way of comparing dimension of the dataset
.shape - Gives shape of the dataset

More ways for DataFrame:


1. Dict of Tuples (Column-wise)
>tuple_name = tuple(names)
>tuple_ages = tuple(ages)
>tuple_married = tuple(married)
>pd.DataFrame({‘name’ : tuple_name, ‘age’ : tuple_ages, ‘married’ : tuple_married})

2. Dict of Dicts (Column-wise)


>series_name = pd.Series(names)
>series_ages = pd.Series(ages)
>series_married = pd.Series(married)
>pd.DataFrame({‘name’ : series_name, ‘age’ : series_ages, ‘married’ : series_married})

3. Dict of Series (Column-wise)


>dict_names = {k:v for k,v in enumerate(names)}
>dict_ages = {k:v for k,v in enumerate(ages)}
>dict_married = {k:v for k,v in enumerate(married)}
>pd.DataFrame({‘name’ : dict_name, ‘age’ : dict_ages, ‘married’ : dict_married})

4. List of Dicts (Row-wise)


>rowwise = [{‘name’ : name, ‘age’ : ages, ‘married’ : married} for name, ages, married in zip(names, ages, married)]

The info() Method:


Gives overview about the dataset. Only works for DataFrames
>df.info(verbose = True, max_cols = 4, memory_usage = ‘deep)
Note: verbose default is set to True. Setting it to False will remove column specific information
DataFrame:
>Nutrition = pd.DataFrame(‘http://andybek.com/pandas-nutrition’, index_col = [0])

.drop():
Removes specific rows / columns from the DataFrame
>Nutrition.drop(‘Unnamed : 0’, axis = 1)

.set_index():
Set specified column as index of the DataFrame
>Nutrition.set_index(‘Unnamed : 0’)

You might also like