0% found this document useful (0 votes)
6 views24 pages

PANDAS

Pandas is an open-source Python library built on NumPy, designed for data manipulation and analysis, providing powerful data structures like Series and DataFrame. It offers features such as efficient data handling, time-series functionality, and easy integration with NumPy. The document details the creation, manipulation, and analysis of data using Series and DataFrames, including methods for accessing, modifying, and performing statistical operations on data.

Uploaded by

Venkata Lokendra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views24 pages

PANDAS

Pandas is an open-source Python library built on NumPy, designed for data manipulation and analysis, providing powerful data structures like Series and DataFrame. It offers features such as efficient data handling, time-series functionality, and easy integration with NumPy. The document details the creation, manipulation, and analysis of data using Series and DataFrames, including methods for accessing, modifying, and performing statistical operations on data.

Uploaded by

Venkata Lokendra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

PANDAS

 What is Pandas?
->It is an open source Python library that is build on top of numpy library.
->It is designed for Data Manipulation, Data Analysis, Data Cleaning
It can handle missing as well.
->It provides Flexible & Powerful Data Structures such as Series, DataFrame .
->It is fast and has high Performance & Productivity.
 Features of Pandas
->Fast and Efficient data manipulation and analysis.
->Provides Time-series functionality
->Easily we can handle missing data
->Faster data merging and joining
->Flexible reshaping and pivoting of data
->Data from different file objects can be loaded
->Integrates with numpy
 Data Structures in Pandas
-> Data Structures are used to Organize & Retrieve & Manipulate the Data
-> In pandas we have D.S are Series and Data Frame
 What is Series
->Series is the one dimensional Labeled array
->It can hold any Data Type(int,string or python objects)
->It axis labels are also known as Index
->Series Contains homogeneous data
->Series are mutable means we can modify the elements And Size is Inmutable means
we can not change once its Declared
->Syntax:pandas.Series( data, index, dtype, copy)
->Parameters : Data(required) = it can be a list and dictionary
Index(optional)
Dtype(optional)
Copy(optional)= This makes a copy of the input data
 Different ways to create a series in pandas
1.Creating a empty series
import pandas as pd
print(pd.Series()) o/p Series([], dtype: object)
2. Creating a series from an Array
series_array=np.array(['m','Mukesh','bf','gf'])
pd.Series(series_array)
o/p 0 m
1 Mukesh
2 bf
3 gf
dtype: object
3. Create a series from an array with custom index
series_array=np.array(['m','Mukesh','bf','gf'])
pd.Series(series_array,index=[100,'Love',103,'No'])
O/p 100 m
Love Mukesh
103 bf
No gf
dtype: object

4. Creating a Series from List


list =['hi', 100,'Mukesh', 1000]
pd.Series(list)
O/p 0 hi
1 100
2 Mukesh
3 1000
dtype: object

5. Creating a series from dictionary


dict ={ 'k1': 1000,
'k2' : 2000,
'k3' : 3000,
'k4' : 4000 }
pd.Series(dict)
O/p k1 1000
k2 2000
k3 3000
k4 4000
dtype: int64

6. Creating a series using numpy functions


->np.linespace(start,stop,)
nu_fn=pd.Series(np.linspace(3,33,3))
nu_fn
O/p 0 3.0
1 18.0
2 33.0
dtype: float64

-> np.random.randn(x)
nu_fn=pd.Series(np.random.rand(3))
nu_fn
O/p 0 0.487446
1 0.375540
2 0.011341
dtype: float64

7. Creating a series using range function


range=pd.Series(range(5))
range
O/p 0 0
1 1
2 2 3 3 4 4 dtype : int 6
 Accessing Data using Series Position (iloc)
->To access position of Series we use iloc(Integer based indexing)
->iLoc is allow you to access/select rows by there integer/index positions
->Ex: data=[10,20,30,40,50]
pos=pd.Series(data,index=['A','B','C','D','E'])
pos.iloc[4] o/p 50
pos.iloc[-1] o/p 50
pos.iloc[:] o/p A 10
B 20
C 30
D 40
E 50
dtype: int64

Pos.iloc[1:4:2] o/p B 20
D 40
 Retrieve the data using Label(index) name (loc)
->Here we use loc(Label based indexing)
-> Ex: pos.loc[‘A’] o/p 10
pos.loc[‘A’ : ‘E’] (Slicing) o/p all the elements

 Changing the type of data


data=[1,2,3,4,5,0]
s=pd.Series(data,dtype=object)
O/p 0 1
1 2
2 3
3 4
4 5
5 0
dtype: object

data=[1,2,3,4,5,0]
s=pd.Series(data,dtype=bool)
O/p 0 True
1 True
2 True
3 True
4 True
5 False
dtype: bool
 What is DataFrame ?
->It is Data Structure in pandas library in python.
->It is a Two Dimensional labeled Data
->it has a labeled axis which means Both rows and columns have labels
Which makes easier to access or manipulate the specific data
->It is a heterogeneous type of data. A Dataframe can contains different
datatypes(int,float,string,object)
->Here size is mutable we can add or remove the rows and columns in DF
 Different ways to access a Dataframe
1.Creating a empty dataframe:
print(pd.DataFrame())
O/p : Empty DataFrame
Columns: []
Index: []

2.Create a Dataframe using List:


list=['hii',1,2,3,'hwllo']
pd.DataFrame(list)
(or)
Print(pd.DataFrame(list))

3.Creating a dataframe using list of lists:


list_list=[[1,'Mukesh'],[2,'data_Science'],[3,'job']]
pd.DataFrame(list_list,columns=['hii','Bye'])
4.Creating a DataFrame using Dictionary:
dic={'team':['India','SouthAfrica','Austrilla','England','Newsland'],
'Ranking':[1,2,4,3,5]}
pd.DataFrame(dic)
5.Creating a Dataframe using list of Dictionaries:
list_dic=[{1:'Mukesh',2:'Bleson',3:'Srinivasan'},
{1:'Safa',2:'Sreya',3:'Fareedha'}]
pd.DataFrame(list_dic)
6.Creating DataFrame from Pandas Series:
sd=pd.Series(['hhi',1,3,4])
pd.DataFrame(sd)
7.Creating Dataframe using Dict of ndarrays:
se={1:np.array([1,2,3]),
'hi':np.array(['ji','ki','li']),
3:np.array([4,5,6])}
pd.DataFrame(se)
8.Creating Datframe using Dict of lists:
data = { 'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago'] }
pd.DataFrame(data)
 Column Selection
->It is a fundamental operation in data manipulation and analysis
-> Methods to select the column
1.selecting a single column: (using Brackets)
dict={'programming':['SQL','Python','Java','Html'],
'level oo proficiency':[4,3,2,1],
'Trainers':['self-learn','Madha_kiran','Akila','self_learn']}
df=pd.DataFrame(dict)
df

df['programming']

df[['programming']]

1. (Using Dot notation)


df.Trainers
2. Selecting Multipule columns ( using list of column names)
df[['programming','Trainers']]

3. Selecting the column by Label(loc) or by conduction


df.loc[ :, 'programming' : 'Trainers'] =[ rows : columns]

4. Selecting column by index (iloc)


df.iloc[ :,0:3]

5.Selecting column by datatype


df.select_dtypes(include=['int'])

 Column Addition
1. Addingg the new column by scaler value:
data={'A':[1,2,3],'B':[4,5,6]}
df=pd.DataFrame(data)
df['C']=10

df
2. Adding a new column using list
df['D']=[9,8,7]
df

3. Addition with the help of ndarray


df['E']=np.array(['kii','kalkii','Prabhs'])

4. addition using arithmetic operations


df['F']=df['A']+df['B']

5. Joining the dataframs


dl=[[10,20,30],[40,50,60],[70,80,90]]
ds=pd.DataFrame(dl)
ds
df=df.join(ds)
 Column deletion:
1.Using drop function
1.1 droping single column:
data={'A':[1,2,3,4],
'B':[5,6,7,8],
'C':[9,10,11,12],
'D':[10,20,30,40],
'E':[40,50,60,70]}
df=pd.DataFrame(data)
df=df.drop(columns=['E'])
Df

1.2 Droping multiple columns:


df.drop(columns=['D','C'],inplace=True)
Df

Inplace =It modifies original Data Framewithout creating a copy.


2. Using del keyword
del df['E']
Df

3. Using pop Keyword :


Pop method removes column and return it as series.
a=df.pop('B')
a
 Descriptive Statistics
data = { 'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'B': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
'C': [9, 10, 11, 12, 13, 14, 15, 16, 17, 18],
'D': [13, 14, 15, 16, 17, 18, 19, 20, 21, 22] }
df=pd.DataFrame(data)
Df

1. Describe()
df.describe()

2. Mean()
mean_values=df.mean()
mean_values
3. Medium()
median_values=df.median()
median_values

4. Standard deviation()
std_=df.std()
std_

5. Variance()
var_=df.var()
var_

6. Skewness()
skew_=df.skew()
skew_

7. Kurtosis()
kurt_=df.kurt()
kurt_
8. Min ()
min_=df.min()
min_

9. Max()
max_=df.max()
max_

10. Quantile ()
quantile_=df.quantile([0.25,0.5,0.75])
quantile_

q1_A=df['A'].quantile(0.25)
q1_A

q3_D=df['D'].quantile(0.75)
q3_D
11. Co-Varience()
cov_=df.cov()
cov_

12. Co-Relation()
corr_=df.corr()
corr_

13. sum()
sum_=df.sum()
sum_

14. count()
count_=df.count()
count_
15. cumsum()
-> it is used to calculate the cumulative sum of the elements along a given axis
cumsum_=df.cumsum()
cumsum_

0 : 1 , 1 : 1+2=3 , 2 : 3+3 =6 , 3 : 6 +4=10 , 4 : 10 +5=15 , ………


cumsum_=df.cumsum(axis=1)
cumsum_

horizontal
16. cummin()
17. Cummax()
18. Cumprod()
Iteration:
Iteration a DataFrame :
iterrows()
Ex:
dic={'stu_id' : ['C1','C2','C3','C4'],
'Tool_Proficcency' : ['Powr bi','Tableau','Excel','Sql'],
'Ratings' : [4,5,4,3]}
df=pd.DataFrame(dic)

for index, row in df.iterrows():


print(f'Index: {index}')
print(f" stu_id :{row['stu_id']} , Tool_proficcency: {row['Tool_Proficcency']}, Ratings{row['Ratings']}")
print(f"Row as Series:\n{row}\n")

-It returns index and series pairs from each row (each row is converted into series object)
-It allows you to access the rows data using column name Ex: {row['stu_id']}
-Index : The index of the row
- Series Pairs : each row of the dataframe returns a series object
Row as Series:
stu_id C1
Tool_Proficcency Powr bi
Ratings 4
Name: 0, dtype: object (This the series object)
-iterrows() is slower compered to itertuples()
- Because iterrows() convert the each row in to series object .

->Itertuples()
for row in df.itertuples():
print(f" stu_id :{row.stu_id} , Tool_proficcency: {row.Tool_Proficcency}, Ratings{row.Ratings}")

->It returns an each row as named tuple


-> It excludes the Index , but we can include by passing the parameters (index=True)
->Accessing the row data : using dot notation {row.stu_id}

Items()
->we are iterating over the datadrame column by column
->for each column I get column name and Series (column data)
for col_name,col_data in df.items():
print(f" Column :{col_name}")
print(col_data)
Sorting
->Sorting is nothing but arranging data in the specific order, Like ascending descending order
-> we can apply sorting for datatypes such as numbers, strings , complex objects
->Sorting algorithums : Bubble sort, Merge sort, Quick sort, Insertion sort

Sorting by Values :
->short the dataframe by one or more columns
df = pd.DataFrame({
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
})
sort=df.sort_values(by='Name')
sort = df.sort_values(by='Age',ascending=False) # by default ascending = true
sort=df.sort_values(by=['City','Age'],ascending=False) # sorting multiple columns

Sorting by Indexes :
sort=df.sort_index(ascending=False)
# sorting the index by row
sort =df.sort_index(axis=1)
# sorting the index by column

Sorting vales In place:


->we use Inplace = True to modify the original dataframe
df.sort_values(by='City',inplace=True)

Sorting in the Series :


s = pd.Series([3, 1, 4, 2], index=['d', 'b', 'a', 'c'])
sorts=s.sort_values(ascending=False)
# Soring by the values
sorts=s.sort_index(ascending=False)
# Sorting by the indexes
Groupby
->It is used to split the data in to groups based on the some criteria
And apply function to each group independently.
->We use groupby for aggregating data such as (sum, mean, count, max, min)

->Syntax : df.groupby(‘col_name’).function()

Grouping by the single Column


df = pd.DataFrame({
'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
'Region': ['North', 'North', 'South', 'South', 'North', 'South'],
'Sales': [100, 200, 150, 250, 120, 300]
})
group =df.groupby('Region').sum()

Grouping the multiple columns:


group =df.groupby(['Region','Product']).sum()

Applying the multiple aggregations functions


group= df.groupby('Region').agg({'Sales':['sum','mean','max','count']})

Resetting the Index :


->After grouping the Labels are converted into Indexes. We can reset the index to get the dataframe
group= df.groupby('Region').agg({'Sales':['sum','mean','max','count']}).reset_index()
Merging/Joining the groups
->These operation allows multiple dataframes in to single dataframe based on common Keys or columns

Concatenating the dataframes:


df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
'ID': [4, 5, 6],
'Name': ['David', 'Edward', 'Frank']
})
concat_df=pd.concat([df1,df2])
concat_df

Merge Function :
->combine multiple dataframes based on the one or more keys
->Syntax : pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None)
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = pd.DataFrame({
'ID': [3, 4, 5, 6],
'Score': [85, 90, 75, 60]
})
merge=pd.merge(df1,df2, on='ID',how='right')
Merge

Join function:
->Join function is used to join dataframs based on there indexes or a key column
-> syntax : left_df.join(right_df, on=None, how='left', lsuffix='', rsuffix='', sort=False)

-> #Using set_index while creating dataframe


df1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'ID': [1, 2, 3, 4]
}).set_index('ID')

# DataFrame 2
df2 = pd.DataFrame({
'Score': [85, 90, 75, 60],
'ID': [3, 4, 5, 6]
}).set_index('ID')
join=df1.join(df2,how='left')
join
->#using the set_index while creating the join.
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'ID': [1, 2, 3, 4]
})

df2 = pd.DataFrame({
'Score': [85, 90, 75, 60],
'ID': [3, 4, 5, 6]
})

# Join DataFrames on 'ID' column


join = df1.set_index('ID').join(df2.set_index('ID'),how='outer')
Join

Set_Index() = is the function is used to set one or more columns in datadrame as Indexs (row
lablesSyntax : df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
Setting Single coumn as index :
data = {
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Score': [85, 90, 75, 60]
}

df = pd.DataFrame(data)

set_1 = df.set_index('ID')
set_1

Setting multiple columnas an index:


set_2 =df.set_index(['ID','Score'])
set_2

Keeping The orginal column


set_keep=df.set_index('ID',drop=False)
set_keep

Resetting the index :


set_reset=df.reset_index()
set_reset
Concatenation
->It used to Combain the multiple Sources into dataframe
->Syntax : result = pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None,
names=None, verify_integrity=False, sort=False)

Concatenation along the rows(Vertical Concate)


df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
})

df2 = pd.DataFrame({
'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']
})
concat=pd.concat([df1,df2],axis=0,ignore_index=True)
concat

ignore_index : It is used to control the indexs will concatenating


When we concate the dataframes(Vertical concate) it will keep the orginal indexes(default=Flase).
to change the indexes into Sequential order (ignore_index=True)

Concatenation along Columns(Horizontal Concatenation)


concat=pd.concat([df1,df2],axis=1,ignore_index=True)
Concat

Concatenate with there indexes


df3= pd.DataFrame({
'A': ['A0','A1','A2'],
'B':['B0','B1','B2']}, index=[0,1,2])
df4 = pd.DataFrame({
'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']
}, index=[3, 4, 5])
result=pd.concat([df3,df4])
Result

Concatening with keys : We an create hierarchical index in the dataframe


result=pd.concat([df1,df2],keys=['df1','muk'],axis=1) #axis=0
Result
Concatenate with different columns
df5 = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
})

df6 = pd.DataFrame({
'A': ['A3', 'A4', 'A5'],
'C': ['C3', 'C4', 'C5']
})
result = pd.concat([df5, df6], axis=0)
result

You might also like