0% found this document useful (0 votes)
9 views

session-1 DataFrame

Data Frame

Uploaded by

Ssk Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

session-1 DataFrame

Data Frame

Uploaded by

Ssk Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

EDA

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process.

In [ ]: ====================== Data Analysis =======================


1.Pandas ------- Dataframe read and write operations
2.Numpy ------- Numerical python Math operations
3.Matplotlib ------- plots , graphs , visualization
4.Seaborn ------- plots
5.Plotly ------- plots
6.Bokhe ------- plots

====================== Machine Learning =====================


7.Sickit-learn (sklearn) ------ Model development
8.stats packages ------ Linear Regression

====================== Webscrapping and Database connection ======


9.Sqlite ------ SQL Connection
10.Beautiful soup ------ scrap the data
11.websocket ------ scrap the data

====================== Deep Learning ==========================


12.Tensorflow ------ Deep learning models development(google)
13.keras
14.pytorch ------ develop by
15.Opencv ------ computer vision(reading and writing images)
16.Pillow ------ reading images

====================== NLP ======================================


17.NLTK ----- Natural language tool kit
18.SpaCy ----- NLP Models
19.wordcloud -----

====================== Web development - API ======================


20.Flask
21.Django
22.Fask API
23.Gradio

====================== Apps creation ==============================


24.Streamlit

====================== Transformers BERT (NLP models) ==============


25.Transformers ------ Huggingface (Google)

====================== DL:Pretarained Models bject Detections =======


26.vgg16
27.Mobilenet
28.Yolo ----- Ultralytics

====================== NLP pretrained Models ========================


29.Word2Vec ----- Google
30.GloVe ----- StandforUniversity

====================== Model save ==================================


31.Pickle
32.Joblib

====================== GenAI LLM ====================================


33.Azure openAI
34.Google Gemini
35.Amazon BedRock
36.LLAMA Meta
37.Langchain Framework
====================== Model Deployment ================================
38.MLFlow

====================== Cloud Services ==================================


39.Azure ML Related packages
40.GCP vertex ai packages
41.Amazon sagemaker packages

====================== Alle NLP ======================================


42. Allen NLP packages

====================== ML using Pyspark ================================


43.MLlib package

====================== Small packages ==================================


44.random
45.math
46.time
47.logger

Step-1 : Import Packages

In [1]: import pandas as pd


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Step-2 : Create a DataFrame using List

In [7]: import pandas as pd


pd.DataFrame

Out[7]: pandas.core.frame.DataFrame

In [9]: import pandas as pd


pd.DataFrame()

Out[9]:

In [13]: import pandas as pd


data=pd.DataFrame()
data

# we created a DataFrame
# But no data (no rows and no columns)
# we saved our DataFrame with a name 'data'

Out[13]:

Step-3 : Provide The Data

In [16]: name=['Navya','Sneha','Yamu']
pd.DataFrame()
Out[16]:

In [18]: name=['Navya','Sneha','Yamu']
pd.DataFrame(name)

Out[18]: 0

0 Navya

1 Sneha

2 Yamu

In [20]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
pd.DataFrame(zip(name,age))

Out[20]: 0 1

0 Navya 20

1 Sneha 21

2 Yamu 22

In [22]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
pd.DataFrame(zip(name,age,city))

Out[22]: 0 1 2

0 Navya 20 Hyd

1 Sneha 21 Delhi

2 Yamu 22 Pune

In [24]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
data=[name,age,city]
pd.DataFrame(data)

Out[24]: 0 1 2

0 Navya Sneha Yamu

1 20 21 22

2 Hyd Delhi Pune

In [26]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
df=pd.DataFrame(zip(name,age,city))
df
Out[26]: 0 1 2

0 Navya 20 Hyd

1 Sneha 21 Delhi

2 Yamu 22 Pune

Step-4 : Provide The Columns

Columns we need to provide in a list

The number of columns exactly match with data

Here we have 3 columns , so we need to create a list with 3 names

In [30]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
df=pd.DataFrame(zip(name,age,city),columns=cols)
df

Out[30]: Names Age City

0 Navya 20 Hyd

1 Sneha 21 Delhi

2 Yamu 22 Pune

Step-5 : Provide the Index

In [33]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
id=[1,2,3]
df=pd.DataFrame(zip(name,age,city),index=id,columns=cols)
df

Out[33]: Names Age City

1 Navya 20 Hyd

2 Sneha 21 Delhi

3 Yamu 22 Pune

In [35]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
id=['A','B','C']
df=pd.DataFrame(zip(name,age,city),index=id,columns=cols)
df
Out[35]: Names Age City

A Navya 20 Hyd

B Sneha 21 Delhi

C Yamu 22 Pune

Step-6 : How to provide a New Column to already existed dataframe

Here we already has a dataframe with name df

It has 3 columns

Now we want to add a new column Marks

we need to create new array or list

That length of list should be equal to length of rows

so here we have 3 rows , so new list also must have 3 values

In [ ]: # df['<new column name>']=<list>

In [38]: marks=[100,200,300]
df['Marks']=marks
df

Out[38]: Names Age City Marks

A Navya 20 Hyd 100

B Sneha 21 Delhi 200

C Yamu 22 Pune 300

Step-7 : Create a DataFrame using empty DataFrame

In above case we created a list

we create a dataframe by passing list

In [41]: df1=pd.DataFrame()
df1

Out[41]:

In [43]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
df1['Name']=name
df1['Age']=age
df1['City']=city
df1
Out[43]: Name Age City

0 Navya 20 Hyd

1 Sneha 21 Delhi

2 Yamu 22 Pune

Step-8 : Create a DataFrame using Dictionary

In [50]: dict1={'Names':['Navya','Sneha','Yamu'],'Age':[20,21,22],'City':['Hyd','Delhi','Pune']}
dict1

Out[50]: {'Names': ['Navya', 'Sneha', 'Yamu'],


'Age': [20, 21, 22],
'City': ['Hyd', 'Delhi', 'Pune']}

In [52]: df2=pd.DataFrame(dict1)
df2

Out[52]: Names Age City

0 Navya 20 Hyd

1 Sneha 21 Delhi

2 Yamu 22 Pune

In [54]: df2=pd.DataFrame(dict1,index=['A','B','C'])
df2

Out[54]: Names Age City

A Navya 20 Hyd

B Sneha 21 Delhi

C Yamu 22 Pune

Keys Behaves as Columns

Values Behaves as Rows

In [57]: dict2={'Name':'Navya','Age':20,'City':'Hyd'}
dict2

Out[57]: {'Name': 'Navya', 'Age': 20, 'City': 'Hyd'}

In [61]: df3=pd.DataFrame(dict2)
df3
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[61], line 1
----> 1 df3=pd.DataFrame(dict2)
2 df3

File ~\anaconda3\Lib\site-packages\pandas\core\frame.py:778, in DataFrame.__init__(self, data, in


dex, columns, dtype, copy)
772 mgr = self._init_mgr(
773 data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
774 )
776 elif isinstance(data, dict):
777 # GH#38939 de facto copy defaults to False only in non-dict cases
--> 778 mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
779 elif isinstance(data, ma.MaskedArray):
780 from numpy.ma import mrecords

File ~\anaconda3\Lib\site-packages\pandas\core\internals\construction.py:503, in dict_to_mgr(dat


a, index, columns, dtype, typ, copy)
499 else:
500 # dtype check to exclude e.g. range objects, scalars
501 arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 503 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File ~\anaconda3\Lib\site-packages\pandas\core\internals\construction.py:114, in arrays_to_mgr(ar


rays, columns, index, dtype, verify_integrity, typ, consolidate)
111 if verify_integrity:
112 # figure out the index, if necessary
113 if index is None:
--> 114 index = _extract_index(arrays)
115 else:
116 index = ensure_index(index)

File ~\anaconda3\Lib\site-packages\pandas\core\internals\construction.py:667, in _extract_index(d


ata)
664 raise ValueError("Per-column arrays must each be 1-dimensional")
666 if not indexes and not raw_lengths:
--> 667 raise ValueError("If using all scalar values, you must pass an index")
669 if have_series:
670 index = union_indexes(indexes)

ValueError: If using all scalar values, you must pass an index

In [63]: dict2={'Name':'Navya','Age':20,'City':'Hyd'}
pd.DataFrame(dict2,index=[1])

# If using all scalar values, you must pass an index

Out[63]: Name Age City

1 Navya 20 Hyd

In [65]: dict2={'Name':'Navya','Age':20,'City':'Hyd'}
pd.DataFrame(dict2,index=[1,2])
Out[65]: Name Age City

1 Navya 20 Hyd

2 Navya 20 Hyd

Data in the form of array can print 3 ways :

list : Normal way

numpy: Numpy package

tensor: Tensorflow

In [68]: l1=[1,2,3]
import numpy as np
np.array(l1)

Out[68]: array([1, 2, 3])

In [70]: l1=[1,2,3]
l2=[11,12,13]
l1+l2

Out[70]: [1, 2, 3, 11, 12, 13]

In [72]: import numpy as np


np.array(l1)
np.array(l2)
np.array(l1+l2)

Out[72]: array([ 1, 2, 3, 11, 12, 13])

In [74]: l1=[1,2,3]
a=np.array(l1)
l2=[11,12,13]
b=np.array(l2)
a+b

Out[74]: array([12, 14, 16])

In [76]: l1=[1,2,3]
a=np.array(l1)
l2=[11,12,13]
b=np.array(l2)
a*b

Out[76]: array([11, 24, 39])

In [78]: l1=[1,2,3]
a=np.array(l1)
l2=[11,12,13]
b=np.array(l2)
a+b,a*b

Out[78]: (array([12, 14, 16]), array([11, 24, 39]))


Step-9 : Drop the column

In order to drop a column we need to use drop method

All the methods based on dataframe names similar as the string names

It requires mainly 3 arguments

1.Column name

2.axis

axis = 1 represents column

axis = 0 represents rows

3.Inplace

once you drop the column , dataframe affected

The modified dataframe wants to save in a same or different name

if you want to keep at same name then inplace=True

In [ ]: # create a dataframe and drop any column

In [81]: df4=pd.DataFrame()
df4

Out[81]:

In [ ]: df4.drop()

In [87]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
id=['A','B','C']
df4=pd.DataFrame(zip(name,age,city),index=id,columns=cols)
df4

Out[87]: Names Age City

A Navya 20 Hyd

B Sneha 21 Delhi

C Yamu 22 Pune

In [97]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
id=['A','B','C']
df4=pd.DataFrame(zip(name,age,city),index=id,columns=cols)
df4.drop('City',axis=1)
Out[97]: Names Age

A Navya 20

B Sneha 21

C Yamu 22

In [103… name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
id=['A','B','C']
df4=pd.DataFrame(zip(name,age,city),index=id,columns=cols)
df4.drop('A',axis=0)

Out[103… Names Age City

B Sneha 21 Delhi

C Yamu 22 Pune

In [107… name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
id=['A','B','C']
df4=pd.DataFrame(zip(name,age,city),index=id,columns=cols)
df4.drop('A',axis=0,inplace=True)

In [109… df4

Out[109… Names Age City

B Sneha 21 Delhi

C Yamu 22 Pune

In [4]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
id=['A','B','C']
df4=pd.DataFrame(zip(name,age,city),index=id,columns=cols)
df4.drop('A',axis=0,inplace=False)

Out[4]: Names Age City

B Sneha 21 Delhi

C Yamu 22 Pune

In [ ]: # create two dataframes df1 and df2


# add those dataframes

############# df1 ###########


Names Age City
Ramesh 20 Hyd
############ df2 ###########
Names Age City
Suresh 21 Blr

Names Age City


Ramesh 20 Hyd
Suresh 21 Blr

append
concate
join

In [34]: dict1={'Name':'Ramesh','Age':20,'City':'Hyd'}
df5=pd.DataFrame(dict1,index=[1])
dict2={'Name':'Suresh','Age':21,'City':'Blr'}
df6=pd.DataFrame(dict2,index=[2])
result=pd.concat([df5,df6],ignore_index=True)
print(result)

Name Age City


0 Ramesh 20 Hyd
1 Suresh 21 Blr

Step-10 : How to overwrite existed column

we already has a dataframe

now we want to replace all the values of specific column with new values

first create a list with new values

Then update the column with new values , in the same way of how to create a new column

df[new col]=data , to create a new column

df[old col]=new data , to overwrite theold column

In [8]: df4['Age']=[33,44,34]
df4

Out[8]: Names Age City

A Navya 33 Hyd

B Sneha 44 Delhi

C Yamu 34 Pune

In [48]: df4['Names']=['anshu','chinni','adya']
df4

Out[48]: Names Age City Name

A anshu 33 Hyd anshu

B chinni 44 Delhi chinni

C adya 34 Pune adya


Step-11 : How to save the DataFrame

we can save the dataframe using 2 ways

csv:comma seperated value

excel

For csv : to_csv extension = .csv

For excel : read_csv extension = .xlsx

In [51]: # create a dataframe

name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
id=['A','B','C']
df=pd.DataFrame(zip(name,age,city),index=id,columns=cols)
df

Out[51]: Names Age City

A Navya 20 Hyd

B Sneha 21 Delhi

C Yamu 22 Pune

Csv Format

In [56]: # DataFramename.methodname
# where you want to save
# in what name you want to save

df.to_csv('data12.csv')

Excel sheet

In [61]: df.to_excel('data13.xlsx')

Step-12 : Read the data

read_csv

read_excel

both available on pandas

In [65]: pd.read_csv('data12.csv')
Out[65]: Unnamed: 0 Names Age City

0 A Navya 20 Hyd

1 B Sneha 21 Delhi

2 C Yamu 22 Pune

In [69]: pd.read_excel('data13.xlsx')

Out[69]: Unnamed: 0 Names Age City

0 A Navya 20 Hyd

1 B Sneha 21 Delhi

2 C Yamu 22 Pune

Step-13 : How to avoid extra column

while we are saving the data , we have argument name index

keep index=False

In [74]: # Give the different name , provide index=False

df.to_csv('data21.csv',index=False)
pd.read_csv('data21.csv')

Out[74]: Names Age City

0 Navya 20 Hyd

1 Sneha 21 Delhi

2 Yamu 22 Pune

In [76]: df.to_excel('data31.xlsx',index=False)
pd.read_excel('data31.xlsx')

Out[76]: Names Age City

0 Navya 20 Hyd

1 Sneha 21 Delhi

2 Yamu 22 Pune

In [ ]:

You might also like