0% found this document useful (0 votes)
5 views

introduction to pandas

Pandas is a Python library designed for data manipulation and analysis, allowing users to clean, explore, and analyze datasets effectively. It provides data structures like Series and DataFrames for handling one-dimensional and multi-dimensional data, respectively, and supports operations such as loading data from files, handling missing values, and performing statistical analyses. The library is essential for data science, enabling users to derive insights from large datasets through various functions and methods.

Uploaded by

korircaren4
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

introduction to pandas

Pandas is a Python library designed for data manipulation and analysis, allowing users to clean, explore, and analyze datasets effectively. It provides data structures like Series and DataFrames for handling one-dimensional and multi-dimensional data, respectively, and supports operations such as loading data from files, handling missing values, and performing statistical analyses. The library is essential for data science, enabling users to derive insights from large datasets through various functions and methods.

Uploaded by

korircaren4
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

PANDAS

Pandas is short for panel data.

It is a python library used for working with datasets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

Pandas allows us to analyze big data and make conclusions based on statistical
theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

:}
Data Science: is a branch of computer science where we study how to store,
use and analyze data for deriving information from it.

Pandas gives you answers about the data. Like:

 Is there a correlation between two or more columns?

 What is average value?

 Max value?

 Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values,
like empty or NULL values. This is called cleaning the data.

Once Pandas is installed, import it in your applications by adding


the import keyword:

import pandas
Example;
import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}

myvar = pandas.DataFrame(mydataset)

print(myvar)
What is a Series?

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

Example;

Create a simple Pandas Series from a list:


import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

Create Labels

With the index argument, you can name your own labels.

Example

Create your own labels:

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])


print(myvar)

When you have created labels, you can access an item by referring to the label.

Example

Return the value of "y":

print(myvar["y"])

Key/Value Objects as Series

You can also use a key/value object, like a dictionary, when creating a Series.

Example

Create a simple Pandas Series from a dictionary:

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

Note: The keys of the dictionary become the labels.

To select only some of the items in the dictionary, use the index argument and
specify only the items you want to include in the Series.

Example

Create a Series using only data from "day1" and "day2":

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])


print(myvar)
DataFrames

Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

Example

Create a DataFrame from two Series:

import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

myvar = pd.DataFrame(data)

print(myvar)

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array,


or a table with rows and columns.

Example;

Create a simple Pandas DataFrame:

import pandas as pd

data = {
"calories": [420, 380, 390],

"duration": [50, 40, 45]

#load data into a DataFrame object:

df = pd.DataFrame(data)

print(df)

As you can see from the result above, the DataFrame is like a table with rows and
columns.

Pandas use the loc attribute to return one or more specified row(s)

Example

Return row 0:

#refer to the row index:

print(df.loc[0])

Note: This example returns a Pandas Series.

Example

Return row 0 and 1:


#use a list of indexes:

print(df.loc[[0, 1]])

Note: When using [], the result is a Pandas DataFrame.

Named Indexes

With the index argument, you can name your own indexes.

Example

Add a list of names to give each row a name:

import pandas as pd

data = {

"calories": [420, 380, 390],

"duration": [50, 40, 45]

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)

Locate Named Indexes

Use the named index in the loc attribute to return the specified row(s).

Example
Return "day2":

#refer to the named index:

print(df.loc["day2"])

Load Files Into a DataFrame

If your data sets are stored in a file, Pandas can load them into a DataFrame.

Example

Load a comma separated file (CSV file) into a DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df)

Read CSV Files

A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by
everyone including Pandas.

In our examples we will be using a CSV file called 'data.csv'.

Download data.csv. or Open data.csv

Load the CSV into a DataFrame:


import pandas as pd

df = pd.read_csv('data.csv')

print(df.to_string())

Tip: use to_string() to print the entire DataFrame.

If you have a large DataFrame with many rows, Pandas will only return the first 5
rows, and the last 5 rows:

Example

Print the DataFrame without the to_string() method:

import pandas as pd

df = pd.read_csv('data.csv')

print(df)

The number of rows returned is defined in Pandas option settings.

You can check your system's maximum rows with the


pd.options.display.max_rows statement.
Example

Check the number of maximum returned rows:

import pandas as pd

print(pd.options.display.max_rows)

n my system the number is 60, which means that if the DataFrame contains more
than 60 rows, the print(df) statement will return only the headers and the first and
last 5 rows.

You can change the maximum rows number with the same statement.

Read JSON

Big data sets are often stored, or extracted as JSON.

JSON is plain text, but has the format of an object, and is well known in the world
of programming, including Pandas.

In our examples we will be using a JSON file called 'data.json'.

Open data.json.

Example;

Load the JSON file into a DataFrame:

import pandas as pd
df = pd.read_json('data.json')

print(df.to_string())

Tip: use to_string() to print the entire DataFrame

JSON = Python Dictionary

JSON objects have the same format as Python dictionaries.

If your JSON code is not in a file, but in a Python Dictionary, you can load it into a
DataFrame directly

Example

Load a Python Dictionary into a DataFrame:

import pandas as pd

data = {
"Duration":{
"0":60,
"1":60,
"2":60,
"3":45,
"4":45,
"5":60
},
"Pulse":{
"0":110,
"1":117,
"2":103,
"3":109,
"4":117,
"5":102
},
"Maxpulse":{
"0":130,
"1":145,
"2":135,
"3":175,
"4":148,
"5":127
},
"Calories":{
"0":409,
"1":479,
"2":340,
"3":282,
"4":406,
"5":300
}
}

df = pd.DataFrame(data)

print(df)

Pandas - Analyzing DataFrames

Viewing the Data

One of the most used method for getting a quick overview of the DataFrame, is the
head() method.
The head() method returns the headers and a specified number of rows, starting
from the top

ExampleGet your own Python Server

Get a quick overview by printing the first 10 rows of the DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head(10))

Note: if the number of rows is not specified, the head() method will return the top 5
rows.

Example

Print the first 5 rows of the DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

There is also a tail() method for viewing the last rows of the DataFrame.
The tail() method returns the headers and a specified number of rows, starting from
the bottom.

Example

Print the last 5 rows of the DataFrame:

print(df.tail())

Info About the Data

The DataFrames object has a method called info(), that gives you more information
about the data set.

Example

Print information about the data:

print(df.info())

Null Values

The info() method also tells us how many Non-Null values there are present in
each column, and in our data set it seems like there are 164 of 169 Non-Null values
in the "Calories" column.

Which means that there are 5 rows with no value at all, in the "Calories" column,
for whatever reason.
Empty values, or Null values, can be bad when analyzing data, and you should
consider removing rows with empty values. This is a step towards what is called
cleaning data, and you will learn more about that in the next chapters.

You might also like