0% found this document useful (0 votes)
3 views7 pages

Notes On Pandas.

Pandas is a Python library designed for data manipulation and analysis, providing functions for cleaning, exploring, and analyzing datasets. It includes data structures like Series (one-dimensional) and DataFrames (two-dimensional) for organizing data, and allows for operations such as data cleaning, correlation analysis, and loading data from CSV files. Created by Wes McKinney in 2008, the name 'Pandas' references both 'Panel Data' and 'Python Data Analysis'.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views7 pages

Notes On Pandas.

Pandas is a Python library designed for data manipulation and analysis, providing functions for cleaning, exploring, and analyzing datasets. It includes data structures like Series (one-dimensional) and DataFrames (two-dimensional) for organizing data, and allows for operations such as data cleaning, correlation analysis, and loading data from CSV files. Created by Wes McKinney in 2008, the name 'Pandas' references both 'Panel Data' and 'Python Data Analysis'.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

What are Pandas?

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was
created by Wes McKinney in 2008.

Why Use Pandas?


Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

What Can Pandas Do?


Pandas gives you answers about the data. Like:

● Is there a correlation between two or more columns?


● What is the average value?
● Max value?
● Min value?

Pandas are also able to delete rows that are not relevant, or contain wrong values, like empty or
NULL values. This is called cleaning the data.

What is a Series?
A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

Create a simple Pandas Series from a list:

import pandas as pd
a = [1, 7, 2]

data = pd.Series(a)

print(data)

Labels
If nothing else is specified, the values are labeled with their index number. First value has index
0, second value has index 1 etc.

This label can be used to access a specified value.

Create Labels
With the index argument, you can name your own labels.
Create your own labels:

import pandas as pd

a = [1, 7, 2]

data = pd.Series(a, index = ["x", "y", "z"])

print(data)

Output:

x 1

y 7

z 2

Key/Value Objects as Series


You can also use a key/value object, like a dictionary, when creating a Series.
Create a simple Pandas Series from a dictionary:

import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}

data = pd.Series(calories)

print(data)

Output:
day1 420

day2 380

day3 390

To select only some of the items in the dictionary, use the index argument and specify only the
items you want to include in the Series.

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

data = pd.Series(calories, index = ["day1", "day2"])

print(data)

Output:
day1 420

day2 380

DataFrames
Datasets in Pandas are usually multi-dimensional tables, called DataFrames.
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table
with rows and columns.

Series is like a column, a DataFrame is the whole table. 2 dimensional.

Create a DataFrame from two Series:

import pandas as pd

data = {

"calories": [420, 380, 390],


"duration": [50, 40, 45]

data = pd.DataFrame(data)

print(data)

Output:
calories duration

0 420 50

1 380 40

2 390 45

Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

Example 1

#refer to the row index:

print(df.loc[0])

Output:
calories 420

duration 50

Example 2

#use a list of indexes:

print(df.loc[[0, 1]])

Output:
calories duration

0 420 50
1 380 40

Note: When using [], the result is a Pandas DataFrame.

Named Indexes
With the index argument, you can name your own indexes.
Example

Add a list of names to give each row a name:

import pandas as pd

data = {

"calories": [420, 380, 390],

"duration": [50, 40, 45]

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)

Output:
calories duration

day1 420 50

day2 380 40

day3 390 45

Locate Named Indexes


Use the named index in the loc attribute to return the specified row(s).

Example
Return "day2":

#refer to the named index:

print(df.loc["day2"])

Output:
calories 380

duration 40

Read CSV Files


A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contain plain text and is a well known format that can be read by everyone including
Pandas.

In our examples we will be using a CSV file called 'data.csv'.

Load Files Into a DataFrame


If your data sets are stored in a file, Pandas can load them into a DataFrame.

Example

Load a comma separated file (CSV file) into a DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df)

You might also like