0% found this document useful (0 votes)

3 views

text_processor

The LEO Text Processor module processes text files to generate intents by cleaning the text, splitting it into sentences, and extracting key phrases. It includes methods for reading files, basic preprocessing, and utilizing spaCy for advanced phrase extraction, with a fallback to a simple approach if spaCy is unavailable. The module also provides progress and status updates during processing.

Uploaded by

raynyx77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

text_processor

Uploaded by

raynyx77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 3

#!

/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
LEO Text Processor

This module processes text files for intent generation.

"""

import os
import logging
import re
from collections import Counter

class TextProcessor:
"""Processes text files for intent generation."""

def __init__(self):
"""Initialize the text processor."""
self.on_progress = lambda p: None
self.on_status = lambda s: None

def process(self, file_path):

"""
Process a text file.

Args:
file_path (str): Path to the text file

Returns:
str: Processed text
"""
try:
self.on_status(f"Processing text file: {os.path.basename(file_path)}")
self.on_progress(10)

# Read file
with open(file_path, 'r', encoding='utf-8', errors='replace') as f:
text = f.read()

self.on_progress(30)

# Basic preprocessing
self.on_status("Cleaning text...")

# Remove extra whitespace

text = re.sub(r'\s+', ' ', text)

# Split into sentences

self.on_status("Splitting into sentences...")
sentences = self._split_into_sentences(text)

self.on_progress(70)

# Extract key phrases

self.on_status("Extracting key phrases...")
key_phrases = self._extract_key_phrases(sentences)

self.on_progress(90)
# Combine results
result = {
'text': text,
'sentences': sentences,
'key_phrases': key_phrases
}

self.on_progress(100)
self.on_status("Text processing complete")

return result

except Exception as e:
logging.error(f"Error processing text file: {str(e)}", exc_info=True)
raise

def _split_into_sentences(self, text):

"""
Split text into sentences.

Args:
text (str): Text to split

Returns:
list: List of sentences
"""
# Simple sentence splitting
sentences = re.split(r'(?<=[.!?])\s+', text)

# Filter out empty sentences

sentences = [s.strip() for s in sentences if s.strip()]

return sentences

def _extract_key_phrases(self, sentences):

"""
Extract key phrases from sentences.

Args:
sentences (list): List of sentences

Returns:
list: List of key phrases
"""
# Try to use spaCy if available
try:
import spacy

# Load spaCy model

nlp = spacy.load("en_core_web_sm")

key_phrases = []

for sentence in sentences:

doc = nlp(sentence)

# Extract noun phrases

for chunk in doc.noun_chunks:
if len(chunk.text.split()) > 1: # Only multi-word phrases
key_phrases.append(chunk.text)

# Extract verb phrases

for token in doc:
if token.pos_ == "VERB":
phrase = token.text
for child in token.children:
if child.dep_ in ["dobj", "pobj"]:
phrase += " " + child.text
key_phrases.append(phrase)

return key_phrases

except ImportError:
# Fallback to simple approach if spaCy is not available
logging.warning("spaCy not available, using simple key phrase
extraction")

# Tokenize
words = []
for sentence in sentences:
words.extend(sentence.lower().split())

# Count word frequencies

word_counts = Counter(words)

# Get common bigrams

bigrams = []
for i in range(len(words) - 1):
bigrams.append(words[i] + " " + words[i + 1])

bigram_counts = Counter(bigrams)

# Return top phrases

return [phrase for phrase, count in bigram_counts.most_common(20)]

Untitled
100% (6)
Untitled
345 pages
Automatic Transmissions and Transaxles
100% (2)
Automatic Transmissions and Transaxles
5 pages
WoD StreetsofMarienburg Playset
100% (1)
WoD StreetsofMarienburg Playset
12 pages
Hytorc Gas 25533 8-28-18
No ratings yet
Hytorc Gas 25533 8-28-18
1 page
pdf_processor
No ratings yet
pdf_processor
4 pages
Lab Assignment-10
No ratings yet
Lab Assignment-10
1 page
NLP Lab1
No ratings yet
NLP Lab1
6 pages
NLP LAB_MANUAL (1)
No ratings yet
NLP LAB_MANUAL (1)
33 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
NLP lab Manual (3)
No ratings yet
NLP lab Manual (3)
7 pages
Natural Language Processing lab 7
No ratings yet
Natural Language Processing lab 7
10 pages
1a NLTK
No ratings yet
1a NLTK
10 pages
Module5 PPT
No ratings yet
Module5 PPT
69 pages
NLP LAB MANUAL
No ratings yet
NLP LAB MANUAL
17 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
SK NLP Practical (FS)
No ratings yet
SK NLP Practical (FS)
22 pages
NLP Expts
No ratings yet
NLP Expts
41 pages
Files in Python
No ratings yet
Files in Python
22 pages
NLP_TP1_Report_Lahouel_Ibrahim
No ratings yet
NLP_TP1_Report_Lahouel_Ibrahim
6 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
Batch 2
No ratings yet
Batch 2
13 pages
Pseudo Code
No ratings yet
Pseudo Code
2 pages
AI Lab Manual aktu
No ratings yet
AI Lab Manual aktu
11 pages
1
No ratings yet
1
13 pages
Final_NLP_Lab_File
No ratings yet
Final_NLP_Lab_File
28 pages
Unit 5
No ratings yet
Unit 5
4 pages
Simplifying Data Science With Python
From Everand
Simplifying Data Science With Python
Billy David millican
No ratings yet
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
NLP (1)
No ratings yet
NLP (1)
12 pages
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
Python Reference: An Alphabetical Guide
From Everand
Python Reference: An Alphabetical Guide
Jo Foster
No ratings yet
EXERCISE-1
No ratings yet
EXERCISE-1
3 pages
Lab File Complete
No ratings yet
Lab File Complete
10 pages
1.2 - Handling Text in Python
No ratings yet
1.2 - Handling Text in Python
14 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
Text Processing
No ratings yet
Text Processing
16 pages
NLP_record[1][1] (1)
No ratings yet
NLP_record[1][1] (1)
23 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
18 pages
CS Practical File
No ratings yet
CS Practical File
47 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
NLP Prac 5
No ratings yet
NLP Prac 5
6 pages
Self Evaluation Exercises (1)
No ratings yet
Self Evaluation Exercises (1)
12 pages
Sahil NLP
No ratings yet
Sahil NLP
16 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
NLP - (Natural Language Processing Lab Manual)
No ratings yet
NLP - (Natural Language Processing Lab Manual)
12 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
Ai Lab Rishabh
No ratings yet
Ai Lab Rishabh
2 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
tsarecord
No ratings yet
tsarecord
22 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
PHP programming
From Everand
PHP programming
Nino Paiotta
No ratings yet
R22 Nlp Python Programs
No ratings yet
R22 Nlp Python Programs
15 pages
a7 dsbda sana
No ratings yet
a7 dsbda sana
15 pages
TSA Student
No ratings yet
TSA Student
20 pages
Practical File by Aksh Jaiswal
No ratings yet
Practical File by Aksh Jaiswal
48 pages
NLP LAB ASSIGNMENT-2
No ratings yet
NLP LAB ASSIGNMENT-2
6 pages
Tsa Lab Record - Cse
No ratings yet
Tsa Lab Record - Cse
53 pages
Strings
No ratings yet
Strings
4 pages
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Project
No ratings yet
Project
2 pages
NLP Op
No ratings yet
NLP Op
16 pages
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Record
No ratings yet
Record
6 pages
dataset_manager
No ratings yet
dataset_manager
6 pages
ARC Prize 2025 Paper Submission
No ratings yet
ARC Prize 2025 Paper Submission
13 pages
Figure1 Belief in Pseudoscience
No ratings yet
Figure1 Belief in Pseudoscience
1 page
Research
No ratings yet
Research
9 pages
Student Team Achievement Division Stad Model in Increasing Economic Learning Outcomes
No ratings yet
Student Team Achievement Division Stad Model in Increasing Economic Learning Outcomes
4 pages
Tuberculosis As Infectious Disease.: Lecture 1. The Department of Tuberculosis of KSMA. Doc. Fydorova S.V
No ratings yet
Tuberculosis As Infectious Disease.: Lecture 1. The Department of Tuberculosis of KSMA. Doc. Fydorova S.V
46 pages
Law: Right To Information Act, 2005
100% (2)
Law: Right To Information Act, 2005
45 pages
Matric Students Protest Outside BSE:: VC KU Visits Teaching Departments
No ratings yet
Matric Students Protest Outside BSE:: VC KU Visits Teaching Departments
3 pages
General Action Plan
No ratings yet
General Action Plan
4 pages
All Questions
No ratings yet
All Questions
5 pages
Criminal Law Book 2 CRI 181
No ratings yet
Criminal Law Book 2 CRI 181
9 pages
Gaite v. Fonacier: Facts
No ratings yet
Gaite v. Fonacier: Facts
2 pages
Advert-Training Facilitator Mining
No ratings yet
Advert-Training Facilitator Mining
1 page
High School Concert Band Fall 2013
No ratings yet
High School Concert Band Fall 2013
24 pages
The Ether Crisis in Physics
No ratings yet
The Ether Crisis in Physics
12 pages
E-New Hires Checklist 9.20 (2)
No ratings yet
E-New Hires Checklist 9.20 (2)
9 pages
Area and Perimeter Around The House
No ratings yet
Area and Perimeter Around The House
4 pages
Tausadi Mining Engineering Company Profile 2017
No ratings yet
Tausadi Mining Engineering Company Profile 2017
7 pages
Huawei LTE Roadmap and Solution For Mobitel (20121220)
100% (2)
Huawei LTE Roadmap and Solution For Mobitel (20121220)
37 pages
What Is HRIS?
No ratings yet
What Is HRIS?
2 pages
Daily Bible Reading - January 23, 2023 - USCCB
No ratings yet
Daily Bible Reading - January 23, 2023 - USCCB
3 pages
Daniel Defoe: Compact Performer - Culture & Literature
No ratings yet
Daniel Defoe: Compact Performer - Culture & Literature
11 pages
DLL_SCIENCE 3_Q4_W4
No ratings yet
DLL_SCIENCE 3_Q4_W4
3 pages
Minnesota Satisfaction Questionnaire
No ratings yet
Minnesota Satisfaction Questionnaire
2 pages
Amazon Case Study
No ratings yet
Amazon Case Study
3 pages
Catalog
No ratings yet
Catalog
72 pages
Three-Dimensional Evaluation of The Carriere Motion 3D Appliance in The Treatment of Class II Malocclusion
No ratings yet
Three-Dimensional Evaluation of The Carriere Motion 3D Appliance in The Treatment of Class II Malocclusion
13 pages
Tenor Recorder - Trill Fingering
No ratings yet
Tenor Recorder - Trill Fingering
11 pages
Object The Game Turn Set Up Actions: Craft
No ratings yet
Object The Game Turn Set Up Actions: Craft
6 pages
BBNW Network Architecture
No ratings yet
BBNW Network Architecture
183 pages

text_processor

Uploaded by

text_processor

Uploaded by

#!

This module processes text files for intent generation.

def process(self, file_path):

# Remove extra whitespace

# Split into sentences

# Extract key phrases

def _split_into_sentences(self, text):

# Filter out empty sentences

def _extract_key_phrases(self, sentences):

# Load spaCy model

for sentence in sentences:

# Extract noun phrases

# Extract verb phrases

# Count word frequencies

# Get common bigrams

# Return top phrases

You might also like