Importing Files Through Pandas
Importing Files Through Pandas
Pandas
Course Instructor: Anam Shahid
Source: https://www.w3schools.com/python/pandas/default.asp
Pandas Introduction
What is Pandas?
Pandas is a Python library used for working with data sets.
The name "Pandas" has a reference to both "Panel Data", and "Python Data
Analysis" and was created by Wes McKinney in 2008.
:}
Data Science: is a branch of computer science where we study how to store, use and
analyze data for deriving information from it.
Pandas are also able to delete rows that are not relevant, or contains wrong
values, like empty or NULL values. This is called cleaning the data.
If this command fails, then use a python distribution that already has Pandas
installed like, Anaconda, Spyder etc.
Import Pandas
Once Pandas is installed, import it in your applications by adding
the import keyword:
import pandas
Example
import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)
Pandas as pd
Pandas is usually imported under the pd alias.
alias: In Python alias are an alternate name for referring to the same thing.
import pandas as pd
Example
import pandas as pd
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset)
print(myvar)
Output:
Pandas Series
What is a Series?
A Pandas Series is like a column in a table.
Example
Create a simple Pandas Series from a list:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
Output:
0 1
1 7
2 2
dtype: int64
Labels
If nothing else is specified, the values are labeled with their index number. First
value has index 0, second value has index 1 etc.
print(myvar[0])
Output: 1
Create Labels
With the index argument, you can name your own labels.
Example
Create your own labels:
import pandas as pd
a = [1, 7, 2]
print(myvar)
Output:
x 1
y 7
z 2
dtype: int64
When you have created labels, you can access an item by referring to the label.
Example
Return the value of "y":
print(myvar["y"])
Output: 7
Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.
Example
Create a simple Pandas Series from a dictionary:
import pandas as pd
myvar = pd.Series(calories)
print(myvar)
Output:
day1 420
day2 380
day3 390
dtype: int64
DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
Example
Create a DataFrame from two Series:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)
Output:
calories duration
0 420 50
1 380 40
2 390 45
Pandas DataFrames
What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional
array, or a table with rows and columns.
Example
Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
print(df)
Result
calories duration
0 420 50
1 380 40
2 390 45
Locate Row
As you can see from the result above, the DataFrame is like a table with rows
and columns.
Pandas use the loc attribute to return one or more specified row(s)
Example
Return row 0:
Result
calories 420
duration 50
Name: 0, dtype: int64
Example
Return row 0 and 1:
Output:
calories duration
0 420 50
1 380 40
Named Indexes
With the index argument, you can name your own indexes.
Example
Add a list of names to give each row a name:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
print(df)
Result
calories duration
day1 420 50
day2 380 40
day3 390 45
Result
calories 380
duration 40
Name: 0, dtype: int64
Example
Load a comma separated file (CSV file) into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
CSV files contain plain text and are a well know format that can be read by
everyone including Pandas.
Example
Load the CSV into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
Output:
If you have a large DataFrame with many rows, Pandas will only return the first
5 rows, and the last 5 rows:
Example
Print the DataFrame without the to_string() method:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
Output:
max_rows
The number of rows returned is defined in Pandas option settings.
import pandas as pd
print(pd.options.display.max_rows)
Output : 6
In my system the number is 60, which means that if the DataFrame contains
more than 60 rows, the print(df) statement will return only the headers and
the first and last 5 rows.
You can change the maximum rows number with the same statement.
Example
Increase the maximum number of rows to display the entire DataFrame:
import pandas as pd
pd.options.display.max_rows = 9999
df = pd.read_csv('data.csv')
print(df)
Output:
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.5
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0
10 60 103 147 329.3
11 60 100 120 250.7
12 60 106 128 345.3
13 60 104 132 379.3
14 60 98 123 275.0
15 60 98 120 215.2
16 60 100 120 300.0
The head() method returns the headers and a specified number of rows,
starting from the top.
Example
Get a quick overview by printing the first 10 rows of the DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))
Output:
Note: if the number of rows is not specified, the head() method will return the
top 5 rows.
Example
Print the first 5 rows of the DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
Output:
There is also a tail() method for viewing the last rows of the DataFrame.
The tail() method returns the headers and a specified number of rows, starting
from the bottom.
Example
Print the last 5 rows of the DataFrame:
print(df.tail())
Output:
print(df.info())
Result
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Duration 169 non-null int64
1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None
Result Explained
The result tells us there are 169 rows and 4 columns: