1.
creating 1 rank array
import numpy as np
arr= np.array([1,2,3])
print("the array is :", arr)
output :
the array is : [1 2 3]
2. creating rank 2 array
import numpy as np
arr= np.array([[1,2,3],['a','b','c']])
print("the array is : \n", arr)
output :
the array is :
[['1' '2' '3']
['a' 'b' 'c']]
import numpy as np
arr= np.array([[1,2,3],[23,56,90]])
print("the array is : \n", arr)
output :
the array is :
[[ 1 2 3]
[23 56 90]]
3. creating an array from the tuple
import numpy as np
arr= np.array((1,2,3))
print("the array is : \n", arr)
output :
the array is :
[1 2 3]
4. creating a series from scalar values
import pandas as pd
s1=pd.Series([34,90,12])
print("the series is : \n", s1)
output :
the series is :
0 34
1 90
2 12
dtype: int64
5. Creation of a DataFrame from NumPy arrays
array1=np.array([90,100,110,120])
array2=np.array([50,60,70])
array3=np.array([10,20,30,40])
marksDF = pd.DataFrame([array1, array2, array3],
columns=[ 'A', 'B', 'C', 'D'])
print(marksDF)
output :
A B C D
0 90 100 110 120.0
1 50 60 70 NaN
2 10 20 30 40.0
6. Creation of a DataFrame from dictionary of array/lists:
import pandas as pd
data = {'Name':['Varun', 'Ganesh', 'Joseph', 'Abdul','Reena'],
'Age':[37,30,38, 39,40]}
df = pd.DataFrame(data)
print(df)
output :
Name Age
0 Varun 37
1 Ganesh 30
2 Joseph 38
3 Abdul 39
4 Reena 40
7. Creation of DataFrame from List of Dictionaries
listDict = [{'a':10, 'b':20}, {'a':5,'b':10,'c':20}]
a= pd.DataFrame(listDict)
print(a)
output :
a b c
0 10 20 NaN
1 5 10 20.0
8. Adding a New Column to a DataFrame:
ResultSheet={'Rajat': pd.Series([90, 91,
97],index=['Maths','Science','Hindi']), 'Amrita': pd.Series([92,
81, 96],index=['Maths','Science','Hindi']),'Meenakshi':
pd.Series([89, 91, 88],index=['Maths','Science','Hindi']),'Rose':
pd.Series([81, 71,
67],index=['Maths','Science','Hindi']),'Karthika': pd.Series([94,
95, 99],index=['Maths','Science','Hindi'])}
Result = pd.DataFrame(ResultSheet)
Result['Fathima']=[89,78,76]
print(Result)
output :
Rajat Amrita Meenakshi Rose Karthika Fathima
Maths 90 92 89 81 94 89
Science 91 81 91 71 95 78
Hindi 97 96 88 67 99 76
9. Adding a new row to a DataFrame in the above dataframe
only :
Result.loc['English'] = [90, 92, 89, 80, 90, 88]
print(Result)
output :
Rajat Amrita Meenakshi Rose Karthika Fathima
Maths 90 92 89 81 94 89
Science 91 81 91 71 95 78
Hindi 97 96 88 67 99 76
English 90 92 89 80 90 88
DataFRame.loc[] method can also be used to change the data
values of a row to a particular value. For example, to change
the marks of science.
10. Change the marks of science
Result.loc['Science'] = [92, 84, 90, 72, 96, 88]
print(Result)
output :
Rajat Amrita Meenakshi Rose Karthika Fathima
Maths 90 92 89 81 94 89
Science 92 84 90 72 96 88
Hindi 97 96 88 67 99 76
English 90 92 89 80 90 88
11. Delete the row of “Hindi” from above dataframe
Result = Result.drop('Hindi', axis=0)
print(Result)
output :
Rajat Amrita Meenakshi Rose Karthika Fathima
Maths 90 92 89 81 94 89
Science 92 84 90 72 96 88
English 90 92 89 80 90 88
12. Delete the columns of Rajat, Meenakshi and Karthika
from the above dataframe.
Result = Result.drop(['Rajat','Meenakshi','Karthika'], axis=1)
print(Result)
output :
Amrita Rose Fathima
Maths 92 81 89
Science 84 72 88
English 92 80 88
Attributes of DataFrames :
Consider following dataframe :
import pandas as pd
dict = {"Student": pd.Series(["Arnav","Neha","Priya","Rahul"],
index=["Data 1","Data 2","Data 3","Data 4"]),
"Marks": pd.Series([85, 92, 78, 83], index=["Data 1","Data
2","Data 3","Data 4"]),
"Sports":
pd.Series(["Cricket","Volleyball","Hockey","Badminton"],
index=["Data 1","Data 2","Data 3","Data 4"])}
df = pd.DataFrame(dict)
print(df)
output :
Student Marks Sports
Data 1 Arnav 85 Cricket
Data 2 Neha 92 Volleyball
Data 3 Priya 78 Hockey
Data 4 Rahul 83 Badminton
1. To find out the index :
print(df.index)
output :
Index(['Data 1', 'Data 2', 'Data 3', 'Data 4'], dtype='object')
2. To print names of the columns :
print(df.columns)
output :
Index(['Student', 'Marks', 'Sports'], dtype='object')
3. To print shape of the dataframe :
print(df.shape)
Output :
(4, 3)
4. To print first 5 rows of the dataframe :
print(df.head)
output : Here we have only 4 rows so it will print 4 rows…
head() by default will print first 5 rows
<bound method NDFrame.head of Student Marks
Sports
Data 1 Arnav 85 Cricket
Data 2 Neha 92 Volleyball
Data 3 Priya 78 Hockey
Data 4 Rahul 83 Badminton>
If you want to print first n rows of dataframe :
print(df.head(n) where n is any number like 2,20 or 200..
5. To print first 5 rows of the dataframe :
print(df.tail)
output : Here we have only 4 rows so it will print 4 rows…tail()
by default will print last 5 rows
<bound method NDFrame.head of Student Marks
Sports
Data 1 Arnav 85 Cricket
Data 2 Neha 92 Volleyball
Data 3 Priya 78 Hockey
Data 4 Rahul 83 Badminton>
If you want to print first n rows of dataframe :
print(df.tail(n) where n is any number like 2,20 or 200..
Handling CSV Files :
1. Importing a csv file in/as our dataframe.
import pandas as pd
df=pd.read_csv("studentsmarks.csv")
#mention the entire path of your csv file if your .py file and .csv
file are in different folder.
print(df)
rno name aimarks mathsmarks
0 1 Akshita 89 91
1 2 Apoorva 91 87
2 3 Bhavik 88 76
3 4 Deepti 78 71
4 5 Farhan 84 84
You can also write :
import pandas as pd
df=pd.read_csv("studentsmarks.csv",sep =",", header=0)
print(df)
here separator can be mention, if our separator is anything else
other than ‘,’ (comma) -we need to specify this.
When we mention header=0 – means the first row of csv file is
header. (0 index means the first row)
2. Exporting our dataframe as a csv file.
A new csv file will be created in the same folder in which
your .py file is created.
df.to_csv(path_or_buf='C:/PANDAS/resultout.csv', sep=',')
# if you don’t mention the path, the new file named as
“resultout.csv” will be created in the same folder where
your .py file is saved. If you want the new file to be created at
some other place, mention the complete path.
Output:
,rno,name,aimarks,mathsmarks
0,1,Akshita,89,91
1,2,Apoorva,91,87
2,3,Bhavik,88,76
3,4,Deepti,78,71
4,5,Farhan,84,84
df.to_csv("resultout1.csv",index=False)
output :
rno,name,aimarks,mathsmarks
1,Akshita,89,91
2,Apoorva,91,87
3,Bhavik,88,76
4,Deepti,78,71
5,Farhan,84,84
df.to_csv("resultout2.csv")
output :
,rno,name,aimarks,mathsmarks
0,1,Akshita,89,91
1,2,Apoorva,91,87
2,3,Bhavik,88,76
3,4,Deepti,78,71
4,5,Farhan,84,84
1.3. Handling Missing Values
The two most common strategies for handling missing values
explained in this section are:
i) Drop the row having missing values OR
ii) Estimate the missing value
for eg your dataframe is given below (with some missing
values) :
rno,name,aimarks,mathsmarks
1,Akshita,89,91
2,Apoorva,,87
3,Bhavik,88,76
4,Deepti,78,
5,Farhan,84,84
Checking Missing Values :
Pandas provide a function isnull() to check whether any value is
missing or not in the DataFrame. This function checks all
attributes and returns True in case that attribute has missing
values, otherwise returns False.
import pandas as pd
df=pd.read_csv("studentsmarks.csv",sep =",", header=0)
print(df)
a=df.aimarks.isnull()
print(a)
output :
rno name aimarks mathsmarks
0 1 Akshita 89.0 91.0
1 2 Apoorva NaN 87.0
2 3 Bhavik 88.0 76.0
3 4 Deepti 78.0 NaN
4 5 Farhan 84.0 84.0
0 False
1 True
2 False
3 False
4 False
Name: aimarks, dtype: bool
Drop missing values:
Dropping will remove the entire row (object) having the
missing value(s). This strategy reduces the size of the dataset
used in data analysis, hence should be used in case of missing
values on few objects. The dropna() function can be used to
drop an entire row from the DataFrame.
import pandas as pd
df=pd.read_csv("studentsmarks.csv",sep =",", header=0)
print(df)
print(df.dropna())
output :
rno name aimarks mathsmarks
0 1 Akshita 89.0 91.0
1 2 Apoorva NaN 87.0
2 3 Bhavik 88.0 76.0
3 4 Deepti 78.0 NaN
4 5 Farhan 84.0 84.0
rno name aimarks mathsmarks
0 1 Akshita 89.0 91.0
2 3 Bhavik 88.0 76.0
4 5 Farhan 84.0 84.0
You can also store the edited dataframe in a new dataframe.
import pandas as pd
df=pd.read_csv("studentsmarks.csv",sep =",", header=0)
print(df)
a=df.dropna()
print(a)
Estimating the missing values :
Missing values can be filled by using estimations or
approximations e.g a value just before (or after) the missing
value, average/minimum/maximum of the values of that
attribute, etc. In some cases, missing values are replaced by
zeros (or ones). The fillna(num) function can be used to replace
missing value(s) by the value specified in num. For example,
fillna(0) replaces missing value by 0. Similarly fillna(1) replaces
missing value by 1.
import pandas as pd
df=pd.read_csv("studentsmarks.csv",sep =",", header=0)
print(df)
df=df.fillna(0)
print(df)
output :
rno name aimarks mathsmarks
0 1 Akshita 89.0 91.0
1 2 Apoorva NaN 87.0
2 3 Bhavik 88.0 76.0
3 4 Deepti 78.0 NaN
4 5 Farhan 84.0 84.0
rno name aimarks mathsmarks
0 1 Akshita 89.0 91.0
1 2 Apoorva 0.0 87.0
2 3 Bhavik 88.0 76.0
3 4 Deepti 78.0 0.0
4 5 Farhan 84.0 84.0
A program to download USA_Holusing.csv from internet. Clean
the file, identify dependent and independent variables,
separate them out , train the model, test the model and
compare the predicted values with actual values.
import pandas as pd
df=pd.read_csv('USA_Housing.csv')
print(df.head()) # print first 5 rows.
print(df.shape) # ( 5000 , 7 )
print(df.describe()) # will give
count,mean,std,min,25%,50%,75%,max (here u can see if any
missing value in the count)
df.drop(['Address'],inplace=True,axis=1) # deleting a column
which do not have relavance.
print(df.shape) # (5000,6)
# as price is dependent variable, separating it from df
x=df.drop(['Price'],axis=1)
print(x.shape) #(5000,5) as price column deleted from df and
rest is saved as dataframe x
y=df['Price']
print(y.shape) # (5000, ) only one column Price is save in
dataframe y
#now using train_test_split for training the model and then
testing.
#the ratio will be 80%,20%
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test
=train_test_split(x,y,test_size=0.20)
print(x_train.shape) # (4000,5) i.e. 80% of 5000 rows
print(y_train.shape) # (4000,)
print(x_test.shape) # (1000,5) i.e. 20% of 5000 rows
print(y_test.shape) # (1000,)
#now applying linear regression algorithm to train the model
using 80% of the data.
from sklearn.linear_model import LinearRegression
m=LinearRegression()
m.fit(x_train,y_train) # training the model with input and
output data.
y_predict=m.predict(x_test)
#in above line y_predict is the numpy array created by
predicting the prices
#for x_test i.e. test data.
print(y_predict[0:5]) # print 5 predicted values.
print(y_test.head()) # printing first 5 rows of testing data(1000
rows)
df1=pd.DataFrame({'Actual':y_test,'Predicted':y_predict})
#in above line, a directory with keys actual and predicted for
comparision.
print(df1.head()) # print first 5 comparsions of actual and
predicted.
#We observe that there is a difference between the actual and
predicted value.
#Further, we need to calculate the error,
#evaluate the model and test the accuracy of the model.