How To Extract Data From Common File Formats in Python?
Last Updated :
13 Jan, 2021
Sometimes work with some datasets must have mostly worked with .csv(Comma Separated Value) files only. They are really a great starting point in applying Data Science techniques and algorithms. But many of us will land up in Data Science firms or take up real-world projects in Data Science sooner or later. Unfortunately in real-world projects, the data won't be available to us in a neat .csv file. There we have to extract data from different sources like images, pdf files, doc files, image files, etc. In this article, we will see the perfect start to tackle those situations.
Below we will see how to extract relevant information from multiple such sources.
1. Multiple Sheet Excel Files
Note that if the Excel file has a single sheet then the same method to read CSV file (pd.read_csv('File.xlsx')) might work. But it won't in the case of multiple sheet files as shown in the below image where there are 3 sheets( Sheet1, Sheet2, Sheet3). In this case, it will just return the first sheet.
Excel sheet used: Click Here.
Example: We will see how to read this excel-file.
Python3
# import Pandas library
import pandas as pd
# Read our file. Here sheet_name=1
# means we are reading the 2nd sheet or Sheet2
df = pd.read_excel('Sample1.xlsx', sheet_name = 1)
df.head()
Output:
Now let's read a selected column of the same sheet:
Python3
# Read only column A, B, C of all
# the four columns A,B,C,D in Sheet2
df=pd.read_excel('Sample1.xlsx',
sheet_name = 1, usecols = 'A : C')
df.head()
Output:
Now let's read all sheet together:
Sheet1 contains columns A, B, C; Sheet2 contains A, B, C, D and Sheet3 contains B, D. We will see a simple example below on how to read all the 3 sheets together and merge them into common columns.
Python3
df2 = pd.DataFrame()
for i in df.keys():
df2 = pd.concat([df2, df[i]],
axis = 0)
display(df2)
Output:

2. Extract Text From Images
Now we will discuss how to extract text from images.
For enabling our python program to have Character recognition capabilities, we would be making use of pytesseract OCR library. The library could be installed onto our python environment by executing the following command in the command interpreter of the OS:-
pip install pytesseract
The library (if used on Windows OS) requires the tesseract.exe binary to be also present for proper installation of the library. During the installation of the aforementioned executable, we would be prompted to specify a path for it. This path needs to be remembered as it would be utilized later on in the code. For most installations the path would be C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe.Â
Image for demonstration:
Python3
# We import necessary libraries.
# The PIL Library is used to read the images
from PIL import Image
import pytesseract
# Read the image
image = Image.open(r'pic.png')
# Perform the information extraction from images
# Note below, put the address where tesseract.exe
# file is located in your system
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
print(pytesseract.image_to_string(image))
Output:
GeeksforGeeks
3. Extracting text from Doc File
Here we will extract text from the doc file using docx module.
For installation:
pip install python-docx
Image for demonstration:Â Aniket_Doc.docxÂ
Example 1: First we'll extract the title:
Python3
# Importing our library and reading the doc file
import docx
doc = docx.Document('csv/g.docx')
# Printing the title
print(doc.paragraphs[0].text)
Output:
My Name Aniket
Example 2: Then we'll extract the different texts present(excluding the table).
Python3
# Getting all the text in the doc file
l=[doc.paragraphs[i].text for i in range(len(doc.paragraphs))]
# There might be many useless empty
# strings present so removing them
l=[i for i in l if len(i)!=0]
print(l)
Output:
['My Name Aniket', ' Â Â Â Â Â Â Â Hello I am Aniket', 'I am giving tutorial on how to extract text from MS Doc.', 'Please go through it carefully.']
Example 3: Now we'll extract the table:
Python3
# Since there are only one table in
# our doc file we are using 0. For multiple tables
# you can use suitable for toop
table = doc.tables[0]
# Initializing some empty list
list1 = []
list2 = []
# Looping through each row of table
for i in range(len(table.rows)):
# Looping through each column of a row
for j in range(len(table.columns)):
# Extracting the required text
list1.append(table.rows[i].cells[j].paragraphs[0].text)
list2.append(list1[:])
list1.clear()
print(list2)
Output:
[['A', 'B', 'C'], ['12', 'aNIKET', '@@@'], ['3', 'SOM', '+12&']]
4. Extracting Data From PDF File
The task is to extract Data( Image, text) from PDF in Python. We will extract the images from PDF files and save them using PyMuPDF library. First, we would have to install the PyMuPDF library using Pillow.
pip install PyMuPDF Pillow
Example 1:
Now we will extract data from the pdf version of the same doc file.
Python3
# import module
import fitz
# Reading our pdf file
docu=fitz.open('file.pdf')
# Initializing an empty list where we will put all text
text_list=[]
# Looping through all pages of the pdf file
for i in range(docu.pageCount):
# Loading each page
pg=docu.loadPage(i)
# Extracting text from each page
pg_txt=pg.getText('text')
# Appending text to the empty list
text_list.append(pg_txt)
# Cleaning the text by removing useless
# empty strings and unicode character '\u200b'
text_list=[i.replace(u'\u200b','') for i in text_list[0].split('\n') if len(i.strip()) ! = 0]
print(text_list)
Output:
['My Name Aniket ', ' Â Â Â Â Â Â Â Hello I am Aniket ', 'I am giving tutorial on how to extract text from MS Doc. ', 'Please go through it carefully. ', 'A ', 'B ', 'C ', '12 ', 'aNIKET ', '@@@ ', '3 ', 'SOM ', '+12& ']
Example 2: Extract image from PDF.
Python3
# Iterating through the pages
for current_page in range(len(docu)):
# Getting the images in that page
for image in docu.getPageImageList(current_page):
# get the XREF of the image . XREF can be thought of a
# container holding the location of the image
xref=image[0]
# extract the object i.e,
# the image in our pdf file at that XREF
pix=fitz.Pixmap(docu,xref)
# Storing the image as .png
pix.writePNG('page %s - %s.png'%(current_page,xref))
The image is stored in our current file location as in format page_no.-xref.png. In our case, its name is page 0-7.png.
Now let's plot view the image.
Python3
# Import necessary library
import matplotlib.pyplot as plt
# Read and display the image
img=plt.imread('page 0 - 7.png')
plt.imshow(img)
Output:
Similar Reads
Python | shutil.get_archive_formats() method Shutil module in Python provides many functions of high-level operations on files and collections of files. It comes under Pythonâs standard utility modules. This module helps in automating process of copying and removal of files and directories. shutil.get_archive_formats() method in Python is used
1 min read
How to Extract PDF Tables in Python? When handling data in PDF files, you may need to extract tables for use in Python programs. PDFs (Portable Document Format) preserve the layout of text, images and tables across platforms, making them ideal for sharing consistent document formats. For example, a PDF might contain a table like:User_I
3 min read
Determining file format using Python The general way of recognizing the type of file is by looking at its extension. But this isn't generally the case. This type of standard for recognizing file by associating an extension with a file type is enforced by some operating system families (predominantly Windows). Other OS's such as Linux (
3 min read
Python - Loop through files of certain extensions A directory is capable of storing multiple files and python can support a mechanism to loop over them. In this article, we will see different methods to iterate over certain files in a given directory or subdirectory. Path containing different files: This will be used for all methods. Method 1: Usin
4 min read
How to Unpack a PKL File in Python Unpacking a PKL file in Python is a straightforward process using the pickle module. It allows you to easily save and load complex Python objects, making it a useful tool for many applications, especially in data science and machine learning. However, always be cautious about the security implicatio
3 min read
Read content from one file and write it into another file Prerequisite: Reading and Writing to text files in Python Python provides inbuilt functions for creating, writing, and reading files. Two types of files can be handled in python, normal text files and binary files (written in binary language,0s, and 1s). Text files: In this type of file, Each line o
2 min read
Import Text Files Into Numpy Arrays - Python We have to import data from text files into Numpy arrays in Python. By using the numpy.loadtxt() and numpy.genfromtxt() functions, we can efficiently read data from text files and store it as arrays for further processing.numpy.loadtxt( ) - Used to load text file datanumpy.genfromtxt( ) - Used to lo
3 min read
Determine the type of an image in Python using imghdr Suppose you are given an image type file and you need to determine the type of that file. In simple words, you need to get the extension of that image type file. This can be used in a project to verify whether the image you have requested for is actually an image and with which extension does it com
2 min read
How to Convert Bytes to String in Python ? We are given data in bytes format and our task is to convert it into a readable string. This is common when dealing with files, network responses, or binary data. For example, if the input is b'hello', the output will be 'hello'.This article covers different ways to convert bytes into strings in Pyt
2 min read
Working with csv files in Python Python is one of the important fields for data scientists and many programmers to handle a variety of data. CSV (Comma-Separated Values) is one of the prevalent and accessible file formats for storing and exchanging tabular data. In article explains What is CSV. Working with CSV files in Python, Rea
10 min read