PANDAS
PANDAS
What is Pandas?
->It is an open source Python library that is build on top of numpy library.
->It is designed for Data Manipulation, Data Analysis, Data Cleaning
It can handle missing as well.
->It provides Flexible & Powerful Data Structures such as Series, DataFrame .
->It is fast and has high Performance & Productivity.
Features of Pandas
->Fast and Efficient data manipulation and analysis.
->Provides Time-series functionality
->Easily we can handle missing data
->Faster data merging and joining
->Flexible reshaping and pivoting of data
->Data from different file objects can be loaded
->Integrates with numpy
Data Structures in Pandas
-> Data Structures are used to Organize & Retrieve & Manipulate the Data
-> In pandas we have D.S are Series and Data Frame
What is Series
->Series is the one dimensional Labeled array
->It can hold any Data Type(int,string or python objects)
->It axis labels are also known as Index
->Series Contains homogeneous data
->Series are mutable means we can modify the elements And Size is Inmutable means
we can not change once its Declared
->Syntax:pandas.Series( data, index, dtype, copy)
->Parameters : Data(required) = it can be a list and dictionary
Index(optional)
Dtype(optional)
Copy(optional)= This makes a copy of the input data
Different ways to create a series in pandas
1.Creating a empty series
import pandas as pd
print(pd.Series()) o/p Series([], dtype: object)
2. Creating a series from an Array
series_array=np.array(['m','Mukesh','bf','gf'])
pd.Series(series_array)
o/p 0 m
1 Mukesh
2 bf
3 gf
dtype: object
3. Create a series from an array with custom index
series_array=np.array(['m','Mukesh','bf','gf'])
pd.Series(series_array,index=[100,'Love',103,'No'])
O/p 100 m
Love Mukesh
103 bf
No gf
dtype: object
-> np.random.randn(x)
nu_fn=pd.Series(np.random.rand(3))
nu_fn
O/p 0 0.487446
1 0.375540
2 0.011341
dtype: float64
Pos.iloc[1:4:2] o/p B 20
D 40
Retrieve the data using Label(index) name (loc)
->Here we use loc(Label based indexing)
-> Ex: pos.loc[‘A’] o/p 10
pos.loc[‘A’ : ‘E’] (Slicing) o/p all the elements
data=[1,2,3,4,5,0]
s=pd.Series(data,dtype=bool)
O/p 0 True
1 True
2 True
3 True
4 True
5 False
dtype: bool
What is DataFrame ?
->It is Data Structure in pandas library in python.
->It is a Two Dimensional labeled Data
->it has a labeled axis which means Both rows and columns have labels
Which makes easier to access or manipulate the specific data
->It is a heterogeneous type of data. A Dataframe can contains different
datatypes(int,float,string,object)
->Here size is mutable we can add or remove the rows and columns in DF
Different ways to access a Dataframe
1.Creating a empty dataframe:
print(pd.DataFrame())
O/p : Empty DataFrame
Columns: []
Index: []
df['programming']
df[['programming']]
Column Addition
1. Addingg the new column by scaler value:
data={'A':[1,2,3],'B':[4,5,6]}
df=pd.DataFrame(data)
df['C']=10
df
2. Adding a new column using list
df['D']=[9,8,7]
df
1. Describe()
df.describe()
2. Mean()
mean_values=df.mean()
mean_values
3. Medium()
median_values=df.median()
median_values
4. Standard deviation()
std_=df.std()
std_
5. Variance()
var_=df.var()
var_
6. Skewness()
skew_=df.skew()
skew_
7. Kurtosis()
kurt_=df.kurt()
kurt_
8. Min ()
min_=df.min()
min_
9. Max()
max_=df.max()
max_
10. Quantile ()
quantile_=df.quantile([0.25,0.5,0.75])
quantile_
q1_A=df['A'].quantile(0.25)
q1_A
q3_D=df['D'].quantile(0.75)
q3_D
11. Co-Varience()
cov_=df.cov()
cov_
12. Co-Relation()
corr_=df.corr()
corr_
13. sum()
sum_=df.sum()
sum_
14. count()
count_=df.count()
count_
15. cumsum()
-> it is used to calculate the cumulative sum of the elements along a given axis
cumsum_=df.cumsum()
cumsum_
horizontal
16. cummin()
17. Cummax()
18. Cumprod()
Iteration:
Iteration a DataFrame :
iterrows()
Ex:
dic={'stu_id' : ['C1','C2','C3','C4'],
'Tool_Proficcency' : ['Powr bi','Tableau','Excel','Sql'],
'Ratings' : [4,5,4,3]}
df=pd.DataFrame(dic)
-It returns index and series pairs from each row (each row is converted into series object)
-It allows you to access the rows data using column name Ex: {row['stu_id']}
-Index : The index of the row
- Series Pairs : each row of the dataframe returns a series object
Row as Series:
stu_id C1
Tool_Proficcency Powr bi
Ratings 4
Name: 0, dtype: object (This the series object)
-iterrows() is slower compered to itertuples()
- Because iterrows() convert the each row in to series object .
->Itertuples()
for row in df.itertuples():
print(f" stu_id :{row.stu_id} , Tool_proficcency: {row.Tool_Proficcency}, Ratings{row.Ratings}")
Items()
->we are iterating over the datadrame column by column
->for each column I get column name and Series (column data)
for col_name,col_data in df.items():
print(f" Column :{col_name}")
print(col_data)
Sorting
->Sorting is nothing but arranging data in the specific order, Like ascending descending order
-> we can apply sorting for datatypes such as numbers, strings , complex objects
->Sorting algorithums : Bubble sort, Merge sort, Quick sort, Insertion sort
Sorting by Values :
->short the dataframe by one or more columns
df = pd.DataFrame({
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
})
sort=df.sort_values(by='Name')
sort = df.sort_values(by='Age',ascending=False) # by default ascending = true
sort=df.sort_values(by=['City','Age'],ascending=False) # sorting multiple columns
Sorting by Indexes :
sort=df.sort_index(ascending=False)
# sorting the index by row
sort =df.sort_index(axis=1)
# sorting the index by column
->Syntax : df.groupby(‘col_name’).function()
df2 = pd.DataFrame({
'ID': [4, 5, 6],
'Name': ['David', 'Edward', 'Frank']
})
concat_df=pd.concat([df1,df2])
concat_df
Merge Function :
->combine multiple dataframes based on the one or more keys
->Syntax : pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None)
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = pd.DataFrame({
'ID': [3, 4, 5, 6],
'Score': [85, 90, 75, 60]
})
merge=pd.merge(df1,df2, on='ID',how='right')
Merge
Join function:
->Join function is used to join dataframs based on there indexes or a key column
-> syntax : left_df.join(right_df, on=None, how='left', lsuffix='', rsuffix='', sort=False)
# DataFrame 2
df2 = pd.DataFrame({
'Score': [85, 90, 75, 60],
'ID': [3, 4, 5, 6]
}).set_index('ID')
join=df1.join(df2,how='left')
join
->#using the set_index while creating the join.
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'ID': [1, 2, 3, 4]
})
df2 = pd.DataFrame({
'Score': [85, 90, 75, 60],
'ID': [3, 4, 5, 6]
})
Set_Index() = is the function is used to set one or more columns in datadrame as Indexs (row
lablesSyntax : df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
Setting Single coumn as index :
data = {
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Score': [85, 90, 75, 60]
}
df = pd.DataFrame(data)
set_1 = df.set_index('ID')
set_1
df2 = pd.DataFrame({
'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']
})
concat=pd.concat([df1,df2],axis=0,ignore_index=True)
concat
df6 = pd.DataFrame({
'A': ['A3', 'A4', 'A5'],
'C': ['C3', 'C4', 'C5']
})
result = pd.concat([df5, df6], axis=0)
result