You can use the PDFMiner package to convert PDF to text.
Example
You can use it in the following way:
import sys
from cStringIO import StringIO
from pdfminer.pdfpage importPDFPage
from pdfminer.pdfinterp importPDFResourceManager, PDFPageInterpreter
from pdfminer.layout importLAParams
from pdfminer.converter importXMLConverter, HTMLConverter, TextConverter
def pdfparser(data):
fp = file(data, 'rb')
resource_manager = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(resource_manager,retstr, codec=codec, laparams=laparams)
interpreter =PDFPageInterpreter(resource_manager, device)
# Process each page contained in thedocument.
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
data = retstr.getvalue()
print data
pdfparser('filename.pdf')This takes in a pdf file and extracts text from it page by page using the process_page function from the PDFPageInterpreter class.
There is an alternative to PDFMiner with a much easier API to use for extracting text. pyPDF works fine(assuming that you're working with well-formed PDFs). If all you want is the text (with spaces), you can do the following:
import pyPdf
pdf = pyPdf.PdfFileReader(open('filename.pdf',"rb"))
for page in pdf.pages:
print page.extractText()