Pandas+With+Python+ +DATAhill+Solutions
Pandas+With+Python+ +DATAhill+Solutions
in
Pandas
Pandas is an open-source Python Library providing high-
performance data manipulation and analysis tool using its
powerful data structures.
Advantages:
===========
Easily handles missing data
It uses Series for one-dimensional data structure and
DataFrame for multi-dimensional data structure
It provides an efficient way to slice the data
It provides a flexible way to merge, concatenate or reshape the
data
Series
======
import pandas as pd
a = pd.Series([10, 20, 30])
print(a)
a = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(a)
a = pd.Series([10, 20,np.nan])
print(a)
type(a)
DataFrame
=========
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df
a = [['Srinu',97],['Vasu',88],['Nivas',90]]
df = pd.DataFrame(a, columns=['Name','Marks'])
print(df)
type(df) # DataFrame
df['Name']
b = df['Name']
type(b) # Series
c = df[['Name']]
type(c) # DataFrame
## Numpy to pandas
import numpy as np
h = np.array([[1,2],[3,4]])
print(h)
df_h = pd.DataFrame(h)
print(df_h)
## Pandas to numpy
df_h_n = np.array(df_h)
print(df_h_n)
Range Data
===========
Pandas have a convenient API to create a range of date
Syntax:
pd.data_range(date,period,frequency)
## Create date
# Days
dates_d = pd.date_range('20191110', periods=10, freq='D')
print(dates_d)
# Months
dates_m = pd.date_range('20191110', periods=10, freq='M')
print(dates_m)
Inspecting data
===============
We can check the head or tail of the dataset with head(), or
tail() preceded by the name of the panda's data frame
Slice data:
===========
# Using name
df['A']
df_concat = pd.concat([df1,df2])
print(df_concat)
df_concat['name']
df_concat['name'] == "Srinivas" # returns True or False
df_concat[df_concat['name'] == "Srinivas"] # returns data
# Drop_duplicates
# If a dataset can contain duplicates information use,
`drop_duplicates()` is an easy to exclude duplicate rows. You
can see that `df_concat` has a duplicate observation,
`Srinivas` appears twice in the column `name.`
df_concat.drop_duplicates('name')
Descriptive Statistics
=================
a.sum()
count() Number of non-null observations
sum() Sum of values
mean() Mean of Values
median() Median of Values
mode() Mode of values
std() Standard Deviation of the Values
min() Minimum Value
max() Maximum Value
describe() Summarizing Data Summarizes Numeric columns
Merging/Joining
==============
pd.merge(left, right, how='inner', on=None, left_on=None,
right_on=None,
left_index=False, right_index=False, sort=True)
pd.merge(student_details,course_details,on='s_name')
pd.merge(student_details, course_details,
left_on='student_name', right_on='s_name', how='left')
pd.merge(student_details, course_details,
left_on='student_name', right_on='s_name', how='right')
pd.merge(student_details, course_details,
left_on='student_name', right_on='s_name', how='outer')
pd.merge(student_details, course_details,
left_on='student_name', right_on='s_name', how='inner')
s.str.islower()
s.str.isupper()
s.str.isnumeric()
s.str.lower()
s.str.upper()
s.str.swapcase()
s.str.len()
s.str.cat(sep='_')
s.str.replace('@','$')
s.str.repeat(2)
s.str.count('s')
s.str.startswith ('P')
s.str.endswith('s')
Function Application:
=====================
Table wise Function Application: pipe()
Row or Column Wise Function Application: apply()
def add(a,b):
return a+b
df.pipe(type)
# pipe() is used to perform on the whole DataFrame.
df.pipe(add,10)
# By default, the operation performs column wise
df.apply(type)
df.apply(np.mean)
# By passing axis parameter, operations can be performed row
wise.
df.apply(type, axis=1)
df.apply(np.mean,axis=1)
os.getcwd()
Instead of [1,2] you can also write range(1,3). Both means the
same thing but range( ) function is very useful when you want
to skip many rows so it saves time of manually defining row
position.
NOTE:
When skiprows = 4, it means skipping four rows from top.
skiprows=[1,2,3,4] means skipping rows from second through
fifth. It is because when list is specified in skiprows= option, it
skips rows at index positions. When a single integer value is
specified in the option, it considers skip those rows from top
print(emp1)
# As you can see in the above output, the column ID has been
set as index column
How to read CSV file from URL without using Pandas package
----------------------------------------------------------
import csv
import requests
response =
requests.get('https://dyurovsky.github.io/psyc201/data/lab2/n
ycflights.csv').text
lines = response.splitlines()
d = csv.DictReader(lines)
l = list(d)
a=
pd.read_csv("http://winterolympicsmedals.com/medals.csv")
a.shape
a.head()
# This DataFrame contains 2311 rows and 8 columns. Using
mydata02.shape, you can generate this summary.
a = pd.read_table("E:/MLDataSets/demo.txt")
a = pd.read_csv("E:/MLDataSets/demo.csv", sep ="\t")
NOTE:
Here it is difficult to remind all the modules in packages,
functions in modules, arguments in functions, syntax of the
function, so it is better to take help
List of the functions - Use Tab at the time of typing function
Visualization
=============
df['price'].plot.box()
df['price'].plot.box(vert=False)
df['price'].plot()
df['price'].plot.bar()
df['price'].plot.barh()
df['price'].plot.hist(bins=20)
df['price'].diff.hist(bins=20)
df['price'].plot.area()
df['price'].plot.scatter(x='a', y='b')
df['price'].plot.pie(subplots=True)