0% found this document useful (0 votes)
2 views10 pages

Documentation ML

This document outlines an image processing and entity extraction pipeline that downloads images from URLs, extracts text using OCR, and predicts entities like weight and volume. It details the required libraries, functions for downloading images, extracting text, and making predictions, as well as how to save results to a CSV file. The pipeline is designed for bulk OCR and entity extraction tasks, providing a structured approach to handle image data efficiently.

Uploaded by

Ayushi Vardhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views10 pages

Documentation ML

This document outlines an image processing and entity extraction pipeline that downloads images from URLs, extracts text using OCR, and predicts entities like weight and volume. It details the required libraries, functions for downloading images, extracting text, and making predictions, as well as how to save results to a CSV file. The pipeline is designed for bulk OCR and entity extraction tasks, providing a structured approach to handle image data efficiently.

Uploaded by

Ayushi Vardhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Documentation: Image Processing and

Entity Extraction Pipeline

This project involves downloading images from URLs, extracting text


using OCR (Optical Character Recognition), and predicting certain
entities like weight, volume, or dimensions from the extracted text. It
uses pandas, OpenCV, requests, and pytesseract to accomplish these
tasks.

This script is designed to:

 Download a random sample of images from a dataset.


 Extract text from images using Optical Character
Recognition (OCR) via pytesseract.
 Extract specific entities (like weight, volume, dimensions)
from the extracted text using regex patterns.
 Make predictions and save the results to a CSV file.

Team Name
Data_Dynamo_

Anisha Roy
Ayushi Singh
Ayushi Vardhan
Table of Contents

1. Requirements

2. Assumptions

3. Functions

4. download_random_sample_images

5. extract_text_from_image

6. extract_entity

7. make_predictions

8. Usage

9. How it Works

10.Modifications
Requirements

Make sure the following libraries are installed:

pandas: For reading and processing CSV data.


pytesseract: Python wrapper for Google's Tesseract-OCR to extract text
from images.
cv2 (OpenCV): For image processing (converting images to grayscale,
reading image data).
requests: For downloading images from the internet.
numpy: For handling image arrays.

Assumptions

CSV file: The input dataset test.csv has the following structure:

index: Unique identifier for each row.


image_link: URL of the image.
entity_name: The name of the entity you're extracting (e.g., weight,
volume, dimensions).
download_images function: Assumes that the function
download_images(url, path) exists in src.utils to download images from
a given URL and save them to the specified path.

ALLOWED_UNITS constant: Assumes the existence of a constant


ALLOWED_UNITS in src.constants, which defines valid units for
extraction.

Tesseract-OCR: Ensure that Tesseract OCR is properly installed and


accessible in your environment. If not, download it.
Libraries Used

pandas: For reading and processing CSV data.


pytesseract: Python wrapper for Google's Tesseract-OCR to extract text
from images.
cv2 (OpenCV): For image processing (converting images to grayscale,
reading image data).
requests: For downloading images from the internet.
Numpy

Functions

1. Setup and Data Loading:

import pandas as pd
import pytesseract
import cv2
import requests
import numpy as np
import re

# Load the train and test data


train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# Display the first few rows of the data to understand its structure
print("Train Data:")
print(train_df.head())
print("Test Data:")
print(test_df.head())

Explanation:
train.csv and test.csv: The training and testing datasets containing
image URLs and other relevant columns. These CSV files are loaded into
pandas DataFrames.
This prints out the first few rows of both datasets to inspect the data
structure.
2. Image Downloading

def download_image(url):
try:
response = requests.get(url)
img = np.array(bytearray(response.content), dtype=np.uint8)
img = cv2.imdecode(img, -1)
if img is None:
raise ValueError("Image could not be decoded")
print(f"Image downloaded successfully: {url}")
print(f"Image shape: {img.shape}")
return img
except Exception as e:
print(f"Error downloading or decoding image: {e}")
return None

Explanation:
Purpose: Downloads an image from a given URL and decodes it into a
format that can be processed by OpenCV.
Error Handling: If the image cannot be downloaded or decoded, an
exception is caught and an error message is printed.
Return: Returns the image as a NumPy array or None if the download
fails.

3. Text Extraction Using OCR

def extract_text_from_image(url):
image = download_image(url)
if image is not None:
try:
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
extracted_text = pytesseract.image_to_string(gray)
return extracted_text
except Exception as e:
print(f"Error in OCR processing: {e}")
return ""

Explanation:
Input: Takes the URL of an image.
Download and Preprocessing: The image is downloaded and
converted to grayscale using OpenCV, which helps improve the accuracy
of the OCR process.
OCR: The grayscale image is passed to Tesseract-OCR to extract text.
Error Handling: Any OCR processing issues are caught and logged.
Return: Returns the extracted text from the image or an empty string
if an error occurs.

4.Entity Extraction from Text

patterns = {
'item_weight': r'(\d+\.?\d*)\s*(gram|kilogram|ounce|pound)',
'item_volume': r'(\d+\.?\d*)\s*(milliliter|liter|gallon)',
'item_dimension': r'(\d+\.?\d*)\s*(centimetre|meter|inch)',
# Add more patterns as needed
}

def extract_entity(text, entity_name):


if entity_name in patterns:
pattern = patterns[entity_name]
match = re.search(pattern, text, re.IGNORECASE)
if match:
return f'{float(match.group(1))} {match.group(2).lower()}'
return ""

Explanation:
Patterns: A dictionary of regular expressions for extracting specific
entities such as weight, volume, or dimensions from text.
Example: For weight, the pattern looks for a number followed by a
weight unit like "gram", "kilogram", "ounce", or "pound".
Functionality: The extract_entity function uses these patterns to
search for the relevant entity in the extracted text.
Return: If a match is found, it returns the extracted entity value;
otherwise, it returns an empty string.

5. Predictions Based on Test Data

import random

# Set the sample size (e.g., 100 rows)


sample_size = 100

# Create a list of random indices


random_indices = random.sample(range(len(test_df)), sample_size)

# Create a list of predictions based on the sampled test data


predictions = []
for index in random_indices:
row = test_df.iloc[index]
sample_image_link = row['image_link'] # Ensure this column exists
and contains valid URLs
print(f"Processing image: {sample_image_link}") # Debugging line
to check which image is being processed
text = extract_text_from_image(sample_image_link)
print(f"Extracted Text: {text}") # Debugging line to see the extracted
text
prediction = extract_entity(text, 'item_weight') # Replace
'item_weight' with the appropriate entity name
predictions.append(prediction)
import numpy as np

# Calculate the difference in length


diff = len(test_df['index']) - len(predictions)

# Pad the predictions list with NaN values


predictions_padded = np.pad(predictions, (0, diff), mode='constant',
constant_values=np.nan)

output_df = pd.DataFrame({
'index': test_df['index'],
'prediction': predictions_padded
})

Explanation:
The code imports the random module, sets a sample size, and generates
random indices. It then loops through each index, extracts the image
link, extracts text from the image, extracts the entity, and appends the
prediction to a list. The final list contains extracted entities for each
sampled row
6. Saving Predictions to CSV

import pandas as pd
import numpy as np

# Assume test_df is a pandas DataFrame with an 'index' and


'actual_entity' columns
test_df = pd.DataFrame({'index': range(131187), 'actual_entity':
np.random.rand(131187)})

# Assume predictions is a list of predicted entity values


predictions = np.random.rand(131187) # Replace with your actual
predictions

# Create a DataFrame with index and prediction columns


output_df = pd.DataFrame({
'index': test_df['index'],
'prediction': predictions
})

# Save the output to a CSV file


output_df.to_csv('test_out.csv', index=False)

print("Output saved to test_out.csv")

# Calculate F1 score
def calculate_f1_score(gt, out):
true_positives = 0
false_positives = 0
false_negatives = 0
true_negatives = 0

for i in range(len(gt)):
if out[i] != "" and gt[i] != "" and out[i] == gt[i]:
true_positives += 1
elif out[i] != "" and gt[i] != "" and out[i] != gt[i]:
false_positives += 1
elif out[i] != "" and gt[i] == "":
false_positives += 1
elif out[i] == "" and gt[i] != "":
false_negatives += 1
elif out[i] == "" and gt[i] == "":
true_negatives += 1

precision = true_positives / (true_positives + false_positives) if


(true_positives + false_positives) > 0 else 0
recall = true_positives / (true_positives + false_negatives) if
(true_positives + false_negatives) > 0 else 0

f1_score = 2 * precision * recall / (precision + recall) if (precision +


recall) > 0 else 0

return f1_score

actual_entities = test_df['actual_entity'].tolist()
f1_score = calculate_f1_score(actual_entities, predictions)
print(f"F1 Score: {f1_score:.4f}")

Usage
Download a random sample of images:

sampled_test_df = download_random_sample_images(test_df,
sample_size=100)
Generate predictions for the sampled images:

make_predictions(sampled_test_df)

How it Works
DataFrame Creation: A new DataFrame is created to store the
predictions along with the corresponding index from the test dataset.

CSV Output: The DataFrame is saved to a CSV file (test_out.csv), which


contains the predicted entity values for each test image.
Modifications

Add new entity extraction: Update the extract_entity function by adding


new patterns to support additional entity types.
Change sample size: Adjust the sample_size parameter in
download_random_sample_images to control the number of images
downloaded.
Custom error handling: Modify the error handling within the
download_images or text extraction logic to suit specific needs.

Summary

The pipeline starts by reading a CSV file containing image URLs,


downloading each image, processing it to extract text using Tesseract-
OCR, and then applying regular expressions to predict specific entities
(e.g., weight, volume, dimension).

The results are saved in a CSV file, making this process useful for bulk
OCR and entity extraction tasks from images.

You might also like