introduction to pandas
introduction to pandas
Pandas allows us to analyze big data and make conclusions based on statistical
theories.
Pandas can clean messy data sets, and make them readable and relevant.
:}
Data Science: is a branch of computer science where we study how to store,
use and analyze data for deriving information from it.
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values,
like empty or NULL values. This is called cleaning the data.
import pandas
Example;
import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)
What is a Series?
Example;
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
Create Labels
With the index argument, you can name your own labels.
Example
import pandas as pd
a = [1, 7, 2]
When you have created labels, you can access an item by referring to the label.
Example
print(myvar["y"])
You can also use a key/value object, like a dictionary, when creating a Series.
Example
import pandas as pd
myvar = pd.Series(calories)
print(myvar)
To select only some of the items in the dictionary, use the index argument and
specify only the items you want to include in the Series.
Example
import pandas as pd
Example
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)
Example;
import pandas as pd
data = {
"calories": [420, 380, 390],
df = pd.DataFrame(data)
print(df)
As you can see from the result above, the DataFrame is like a table with rows and
columns.
Pandas use the loc attribute to return one or more specified row(s)
Example
Return row 0:
print(df.loc[0])
Example
print(df.loc[[0, 1]])
Named Indexes
With the index argument, you can name your own indexes.
Example
import pandas as pd
data = {
print(df)
Use the named index in the loc attribute to return the specified row(s).
Example
Return "day2":
print(df.loc["day2"])
If your data sets are stored in a file, Pandas can load them into a DataFrame.
Example
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by
everyone including Pandas.
df = pd.read_csv('data.csv')
print(df.to_string())
If you have a large DataFrame with many rows, Pandas will only return the first 5
rows, and the last 5 rows:
Example
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
import pandas as pd
print(pd.options.display.max_rows)
n my system the number is 60, which means that if the DataFrame contains more
than 60 rows, the print(df) statement will return only the headers and the first and
last 5 rows.
You can change the maximum rows number with the same statement.
Read JSON
JSON is plain text, but has the format of an object, and is well known in the world
of programming, including Pandas.
Open data.json.
Example;
import pandas as pd
df = pd.read_json('data.json')
print(df.to_string())
If your JSON code is not in a file, but in a Python Dictionary, you can load it into a
DataFrame directly
Example
import pandas as pd
data = {
"Duration":{
"0":60,
"1":60,
"2":60,
"3":45,
"4":45,
"5":60
},
"Pulse":{
"0":110,
"1":117,
"2":103,
"3":109,
"4":117,
"5":102
},
"Maxpulse":{
"0":130,
"1":145,
"2":135,
"3":175,
"4":148,
"5":127
},
"Calories":{
"0":409,
"1":479,
"2":340,
"3":282,
"4":406,
"5":300
}
}
df = pd.DataFrame(data)
print(df)
One of the most used method for getting a quick overview of the DataFrame, is the
head() method.
The head() method returns the headers and a specified number of rows, starting
from the top
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))
Note: if the number of rows is not specified, the head() method will return the top 5
rows.
Example
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
There is also a tail() method for viewing the last rows of the DataFrame.
The tail() method returns the headers and a specified number of rows, starting from
the bottom.
Example
print(df.tail())
The DataFrames object has a method called info(), that gives you more information
about the data set.
Example
print(df.info())
Null Values
The info() method also tells us how many Non-Null values there are present in
each column, and in our data set it seems like there are 164 of 169 Non-Null values
in the "Calories" column.
Which means that there are 5 rows with no value at all, in the "Calories" column,
for whatever reason.
Empty values, or Null values, can be bad when analyzing data, and you should
consider removing rows with empty values. This is a step towards what is called
cleaning data, and you will learn more about that in the next chapters.