Problem..
Assume you have a directory full of files with different formats, to search for a particular keyword.
Getting ready..
Install below packages.
1. beautifulsoup4
2. python-docx
How to do it...
1. Write a function to search for a string in CSV format. I will be using csv.reader module to go through the file and search for the string and return True when found else False.
Example
def csv_stringsearch(input_file, input_string): """ Function: search a string in csv files. args: input file , input string """ with open(input_file) as file: for row in csv.reader(file): for column in row: if input_string in column.lower(): return True return False
2. Function to search a text file. This is a bit tricky as we need to deal with encoding. There are thousands of encoding and determing the encoding format is propably the toughest part. Offcourse, we can go back to the user who created the text file but hey we are automating it right.
So, we will use UnicodeDammit to determine the encoding.
Example
def text_stringsearch(input_file, input_string): """ Function: search a string in text files. args: input file , input string """ with open(filename, 'rb') as file: content = file.read(1024) guessencoding = UnicodeDammit(content) encoding = guessencoding.original_encoding # Open and read with open(input_file, encoding=encoding) as file: for line in file: if input_string in line.lower(): return True return False
3. Function to search for a string in a MS Word document.
Example
def MSDocx_stringsearch(input_file, input_string): """ Function: search a string in MS Word documents. args: input file , input string """ doc = docx.Document(input_file) for paragraph in doc.paragraphs: if input_string in paragraph.text.lower(): return True return False
4. Now, we need to have main function to loop through the files and call the corresponsing functions with the string to search. Here I assume the code and input files to search are in the same directory. You can add in path parameter if you directory is in a different location.
Example
def main(input_string): """ Function: Open the current directory and search for a string in all the files args: input string """ for root, dirs, files in os.walk('.'): for file in files: # Get the file extension extension = file.split('.')[-1] if extension in function_maps: search_file = function_maps.get(extension) full_file_path = os.path.join(root, file) if search_file(full_file_path, input_string): print(f' *** Yeah String found in {full_file_path}')
5. Map our functions to the file extensions by creating a dictionary.
Example
EXTENSIONS ={ 'csv': csv_stringsearch, 'txt': text_stringsearch, 'docx': MSDocx_stringsearch, }
Example
6. Putting it all together.
import os import argparse import csv import docx from bs4 import UnicodeDammit def csv_stringsearch(input_file, input_string): """ Function: search a string in csv files. args: input file , input string """ with open(input_file) as file: for row in csv.reader(file): for column in row: if input_string in column.lower(): return True return False def MSDocx_stringsearch(input_file, input_string): """ Function: search a string in MS Word documents. args: input file , input string """ doc = docx.Document(input_file) for paragraph in doc.paragraphs: if input_string in paragraph.text.lower(): return True return False def text_stringsearch(input_file, input_string): """ Function: search a string in text files. args: input file , input string """ with open(input_file, 'rb') as file: content = file.read(1024) guessencoding = UnicodeDammit(content) encoding = guessencoding.original_encoding # Open and read with open(input_file, encoding=encoding) as file: for line in file: if input_string in line.lower(): return True return False def main(input_string): """ Function: Open the current directory and search for a string in all the files args: input string """ for root, dirs, files in os.walk('.'): for file in files: # Get the file extension extension = file.split('.')[-1] if extension in function_mapping: search_file = function_mapping.get(extension) full_file_path = os.path.join(root, file) if search_file(full_file_path, input_string): print(f' *** Yeah String found in {full_file_path}') function_mapping = { 'csv': csv_stringsearch, 'txt': text_stringsearch, 'docx': MSDocx_stringsearch, } if __name__ == '__main__': string_to_search = 'Hello' print(f'Output \n') main(string_to_search.lower())
Output
*** Yeah String found in .\Hello_World.docx *** Yeah String found in .\My_Amazing_WordDoc.docx
7. In case you want to change the program to command line execution then use argparse.
Example
if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('-s', type=str, help='Input string to search', default='Hello') args = parser.parse_args() main(args.s.lower())