0% found this document useful (0 votes)
1K views21 pages

Pandas Basics

pandas is a Python package providing flexible data structures like Series and DataFrame for working with labeled and relational data. Series is a single column of data and DataFrame is a multi-dimensional table made of Series. DataFrames can be created from various data sources like CSV files with a few lines of code and provide many methods for fundamental data analysis and transformations.

Uploaded by

Dhruv Bhardwaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views21 pages

Pandas Basics

pandas is a Python package providing flexible data structures like Series and DataFrame for working with labeled and relational data. Series is a single column of data and DataFrame is a multi-dimensional table made of Series. DataFrames can be created from various data sources like CSV files with a few lines of code and provide many methods for fundamental data analysis and transformations.

Uploaded by

Dhruv Bhardwaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Pandas

PYTHON FOR DATA ANALYSIS


Package overview

pandas is a Python package providing fast, flexible, and expressive data structures
designed to make working with “relational” or “labelled” data both easy and
intuitive. It aims to be the fundamental high-level building block for doing practical,
real-world data analysis in Python. Additionally, it has the broader goal of
becoming the most powerful and flexible open source data analysis/manipulation
tool available in any language.
pandas is well suited for many different kinds of data:
 Tabular data with heterogeneously-typed columns, as in an SQL table or Excel
spreadsheet
 Ordered and unordered (not necessarily field-frequency) time series data.
 Arbitrary matrix data (homogeneously typed or heterogeneous) with row and
column labels
 Any other form of observational / statistical data sets. The data need not be
labelled at all to be placed into a pandas data structure
Data structures of pandas

 The two primary data structures of pandas, Series (1-dimensional)


and DataFrame (2-dimensional), handle the vast majority of typical
use cases in finance, statistics, social science, and many areas of
engineering. For R users, DataFrame provides everything that R’s
data.frame provides and much more. pandas is built on top of
NumPy and is intended to integrate well within a scientific
computing environment with many other 3rd party libraries.

 The best way to think about the pandas data structures is as flexible
containers for lower dimensional data. For example, DataFrame is a
container for Series, and Series is a container for scalars. We would
like to be able to insert and remove objects from these containers in
a dictionary-like fashion.
Install and import

 Pandas is an easy package to install. Open up your terminal


program (for Mac users) or command line (for PC users) and install it
using either of the following commands:
conda install pandas
or
pip install pandas

 Alternatively, if you're currently viewing this article in a Jupyter


notebook you can run this cell:

!pip install pandas


import

 To import pandas we usually import it with a shorter name since it's


used so much:

import pandas as pd
Core components of pandas: Series and DataFrames

 The primary two components of pandas are the Series and DataFrame.

 A Series is essentially a column, and a DataFrame is a multi-dimensional


table made up of a collection of Series.
Creating DataFrames

data = {
'apples': [3, 2, 0, 1],
'oranges': [0, 3, 7, 2]
}

purchases = pd.DataFrame(data)

purchases
Index in DataFrame

 The Index of this DataFrame was given to us on creation as the


numbers 0-3, but we could also create our own when we initialize
the DataFrame.
 purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily',
'David'])

 purchases
 purchases.loc['June']
Example:

import pandas as pd
df = pd.DataFrame({"Name":["Braund, Mr. Owen Harris","Allen, Mr.
William Henry","Bonnell, Miss. Elizabeth"],
"Age": [22, 35, 58],
"Sex": ["male", "male", "female"]})

df #it will print our data frame

#I’m just interested in working with the data in the column Age
df["Age"]
Create a Series:

ages = pd.Series([22, 35, 58], name="Age")

ages

NOTE: A pandas Series has no column labels, as it is just a single column


of a DataFrame. A Series does have row labels.
How to read in data

 It’s quite simple to load data from various file formats into a
DataFrame. In the following examples we'll keep using our apples
and oranges data, but this time it's coming from various files.
 pandas supports many diffrent fie formats or data sources out of the
box (csv, excel, sql, json, parquet, ...), each of them with the
prefi read_*.

read data from a CSV file or a text file:

df = pd.read_csv(file_path, sep=’,’, header = 0, index_col=False,


names=None)
Explanation:

‘read_csv’ function has a plethora of parameters and I have specified


only a few, ones that you may use most often. A few key points:
 a) header=0 means you have the names of columns in the first row in
the file and if you don’t you will have to specify header=None
 b) index_col = False means to not use the first column of the data as an
index in the data frame, you might want to set it to true if the first
column is really an index.
 c) names = None implies you are not specifying the column names and
want it to be inferred from csv file, which means that your header =
some_number contains column names. Otherwise, you can specify the
names in here in the same order as you have the data in the csv file.
 If you are reading a text file separated by space or tab, you could
simply change the sep to be:
 sep = " " or sep='\t'
Reading data from CSVs

 With CSV files all you need is a single line to load in the data:
 df = pd.read_csv('purchases.csv')

 df

 CSVs don't have indexes like our DataFrames, so all we need to do is


just designate the index_col when reading:
 df = pd.read_csv('purchases.csv', index_col=0)

 df
Most important DataFrame operations

 DataFrames possess hundreds of methods and other operations that


are crucial to any analysis. As a beginner, you should know the
operations that perform simple transformations of your data and
those that provide fundamental statistical analysis.
 Let's load in the IMDB movies dataset to begin:

 movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")


Viewing your data

 The first thing to do when opening a new dataset is print out a few
rows to keep as a visual reference. We accomplish this with .head():
 movies_df.head()

 movies_df.tail(2)
Getting info about your data

 .info() should be one of the very first commands you run after loading
your data:
 movies_df.info()

 movies_df.shape
 Movies_df.describe()
Handling duplicates

 temp_df = movies_df.append(movies_df)

 temp_df.shape

 temp_df = temp_df.drop_duplicates()

 temp_df.shape
 temp_df.drop_duplicates(inplace=True)
Column cleanup

 Many times datasets will have verbose column names with symbols,
upper and lowercase words, spaces, and typos. To make selecting
data by column name easier we can spend a little time cleaning up
their names.
 Here's how to print the column names of our dataset:

 movies_df.columns

You might also like