UNIT4
UNIT4
4.1 File handling ( text and CSV files) using CSV module :
➢ A CSV file (Comma Separated Values file) is a type of plain text file that uses specific
structuring to arrange tabular data. Because it’s a plain text file, it can contain only
actual text data—in other words, printable ASCII or Unicode characters.
➢ A CSV file (Comma Separated Values file) is a delimited text file that uses a comma(,)
to separate values. It is used to store tabular data, such as a spreadsheet or database.
➢ Python’s Built-in csv library makes it easy to read, write, and process data from and to
CSV files.
When you specify the filename only, it is assumed that the file is located in the same folder
as Python. If it is somewhere else, you can specify the exact path where the file is located.
f = open(r'C:\Python33\Scripts\myfile.csv')
Remember! While specifying the exact path, characters prefaced by \ (like \n \r \t etc.) are
interpreted as special characters. You can escape them using:
1. raw strings like r'C:\new\text.txt'
2. double backslashes like 'C:\\new\\text.txt'
Because read mode ‘r’ and text mode ‘t’ are default modes, you do not need to specify
them.
There are two approaches to ensure that a file is closed properly, even in cases of
error.
The first approach is to use the with keyword, which Python recommends, as it
automatically takes care of closing the file once it leaves the with block (even in cases of
error).
The with statement
The with statement was introduced in python 2.5. The with statement is useful in the case
of manipulating the files. It is used in the scenario where a pair of statements is to be
executed with a block of code in between.
Syntax:
with open(<file name>, <access mode>) as <file-pointer>:
#statement suite
➢ The advantage of using with statement is that it provides the guarantee to close the file
regardless of how the nested block exits.
➢ It is always suggestible to use the with statement in the case of files because, if the
break, return, or exception occurs in the nested block of code then it automatically
closes the file, we don't need to write the close() function. It doesn't let the file to
corrupt.
Example:
with open('myfile.csv') as f:
print(f.read())
The second approach is to use the try-finally block:
f = open('myfile.csv')
try:
# File operations goes here
finally:
f.close()
#read_csv.py
import csv
with open('myfile.csv') as f:
reader = csv.reader(f)
for row in reader:
print(row)
# Prints:
# ['name', 'age', 'job', 'city']
# ['Bob', '25', 'Manager', 'Seattle']
# ['Sam', '30', 'Developer', 'New York']
Write to a CSV File
To write an existing file, you must first open the file in one of the writing modes (‘w’, ‘a’ or
‘r+’) first. Then, use a writer object and its writerow() method to pass the data as a list
of strings.
Example1:
#write_csv.py
import csv
with open('myfile.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(['Bob', '25', 'Manager', 'Seattle'])
writer.writerow(['Sam', '30', 'Developer', 'New York'])
#Open our existing CSV file in append mode & Create a file object for this file
with open('event.csv', 'a') as f_object:
# Pass this file object to csv.writer() and get a writer object
writer_object = writer(f_object)
When you are executing this program Ensure that your CSV file must be closed otherwise
this program will give you a permission error.
with open('myfile.csv') as f:
reader = csv.DictReader(f)
for row in reader:
print(row)
# OUTPUT:
{'name': 'Bob', 'age': '25', 'job': 'Manager', 'city': 'Seattle'}
{'name': 'Sam', 'age': '30', 'job': 'Developer', 'city': 'New York'}
If the CSV file doesn’t have column names like the file below, you should specify your own
keys by setting the optional parameter fieldnames.
myfile1.csv
Bob,25,Manager,Seattle
Sam,30,Developer,New York
import csv
with open('myfile1.csv') as f:
keys = ['Name', 'Age', 'Job', 'City']
reader = csv.DictReader(f, fieldnames=keys)
for row in reader:
print(row)
#OUTPUT:
{'Name': 'Mihir', 'Age': '25', 'Job': 'Manager', 'City': 'Seattle'}
{'Name': 'Tahir', 'Age': '30', 'Job': 'Developer', 'City': 'New York'}
OUTPUT:
****Data From CSV File****
['Name', 'Age', 'Designation', 'City']
['Saumil', '25', 'Manager', 'Seattle']
['Raj', '30', 'Developer', 'New York']
df = pandas.read_csv('hrdata.csv')
print(df)
In the above code, the three lines are enough to read the file, and only one of them is doing
the actual work, i.e., pandas.read_csv()
OUTPUT:
import pandas as pd
dict = {'Name': ['Jemil', 'Pratham'], 'ID': [101, 102],
'Language': ['Python', 'JavaScript']}
OUTPUT:
❖ Read excel : Specify the path or URL of the Excel file in the first argument. If there
are multiple sheets in Excel Workbook then only the first sheet is used by pandas. It
reads as DataFrame.
Example : To Read(Load/Extract) Excel File Using Dataframe
import pandas as pd
df = pd.read_excel('sample.xlsx')
print(df)
print(df_sheet_index)
print(type(df_sheet_multi[1]))
# <class 'pandas.core.frame.DataFrame'>
print("*******Pandas Dataframe**********")
print(df)
If you do not need to write index (row name), columns (column name), the argument
index, columns is False.
df.to_excel('pandas_to_excel_no_index_header.xlsx', index=False, header=False)
print("*******Pandas Dataframe1**********")
print(df1)
print("*******Pandas Dataframe2**********")
print(df2)
with pd.ExcelWriter('pandas_to_excel2.xlsx') as writer:
df1.to_excel(writer, sheet_name='TEST1')
df2.to_excel(writer, sheet_name='TEST2')
import pandas as pd
print(first)
Row Selection: Pandas provide a unique method to retrieve rows from a Data frame.
DataFrame.loc[] method is used to retrieve rows from Pandas DataFrame. Rows can also
be selected by passing integer location to an iloc[] function.(Read Sem2-204-Unit5)
Pandas DataFrame.loc attribute access a group of rows and columns by label(s) or a
boolean array in the given DataFrame.
Example: To retrieve unique rows from a Data frame.
#sp_rows.py
import pandas as pd
print(df)
print(type(df))
print('\n\n',row)
print('\nType of row',type(row))
Parameters
n: It refers to an integer value that returns the number of rows.
Return
It returns the DataFrame with top n rows.
2. Pandas DataFrame.tail() :The tail() function is used to get the last n rows.
This function returns last n rows from the object based on position. It is useful for quickly
verifying data, for example, after sorting or appending rows.
Syntax:
DataFrame.tail(n=5)
Parameters
n-number of rows to select
Example: By using the head() in the below example, we showed only top 2 rows
and last 2 rows from the dataset.
# importing pandas module
import pandas as pd
Syntax:
DataFrame.loc
Raises: KeyError
when any items are not found
print(df)
iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which
allow out-of-bounds indexing (this conforms with python/numpy slice semantics).
print('\n\n',row)
print('\nType of row',type(row))
Returns: numpy.ndarray
The values of the DataFrame.
Examples: A DataFrame where all columns are the same type (e.g., int64) results in an
array of the same type.
import numpy as np
import pandas as pd
df = pd.DataFrame({'age': [ 5, 30],
'height': [84, 180],
'weight': [36, 95]})
print(df)
print(df.values)
OUTPUT:
Parameters
➢ dtype – For if you need to specify the type of data that you’re passing to .to_numpy().
You likely won’t need to set this parameter
➢ copy (Default: False) – This parameter isn’t widely used either. Setting copy=True will
return a full exact copy of a NumPy array. Copy=False will potentially return a view of
your NumPy array instead. If you don’t know what the difference is, it’s ok and feel free
not to worry about it.
➢ na_value – The value to use for missing values. The default value depends on dtype and
the dtypes of the DataFrame columns. The value to use when you have NAs. By default
Pandas will return the NA default for that column data type. If you wanted to specify
another value, go ahead and get desire result.
import pandas as pd
df = pd.DataFrame([('PVR Cinema', 'Restaurant', pd.NA),
('Jalaram', 'Restaurant', 224.0),
('500 Club', 'Bar', 80.5),
('The Square', pd.NA, 25.30)],
columns=('Name', 'Type', 'AvgBill'))
print('********DataFrame Values**********')
print(df)
x = df.to_numpy()
print('\nArray Values :\n',x)
print(type(x))
7. pandas.DataFrame.describe():
The describe() function is used to generate descriptive statistics that summarize the central
tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
Syntax:
DataFrame.describe(percentiles=None, include=None, exclude=None)
Parameters:
➢ percentile: It is an optional parameter which is a list like data type of numbers that
should fall between 0 and 1. Its default value is [.25, .5, .75], which returns the 25th,
50th, and 75th percentiles.
➢ include: It is also an optional parameter that includes the list of the data types while
describing the DataFrame. Its default value is None.
➢ exclude: It is also an optional parameter that exclude the list of data types while
describing DataFrame. Its default value is None.
Returns:
It returns the statistical summary of the Series and DataFrame.
Example :
import numpy as np
import pandas as pd
a1 = pd.Series([1, 2, 3])
print('\n a1 result:\n',a1.describe())
df = pd.DataFrame({'categorical': pd.Categorical(['s','t','u']),
'numeric': [1, 2, 3],
'object': ['p', 'q', 'r'] })
#Describing a DataFrame.
#By default only numeric fields are returned:
print('DataFrame Describe:\n',df.describe())