You can use the PDFMiner package to convert PDF to text.
Example
You can use it in the following way:
import sys from cStringIO import StringIO from pdfminer.pdfpage importPDFPage from pdfminer.pdfinterp importPDFResourceManager, PDFPageInterpreter from pdfminer.layout importLAParams from pdfminer.converter importXMLConverter, HTMLConverter, TextConverter def pdfparser(data): fp = file(data, 'rb') resource_manager = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams = LAParams() device = TextConverter(resource_manager,retstr, codec=codec, laparams=laparams) interpreter =PDFPageInterpreter(resource_manager, device) # Process each page contained in thedocument. for page in PDFPage.get_pages(fp): interpreter.process_page(page) data = retstr.getvalue() print data pdfparser('filename.pdf')
This takes in a pdf file and extracts text from it page by page using the process_page function from the PDFPageInterpreter class.
There is an alternative to PDFMiner with a much easier API to use for extracting text. pyPDF works fine(assuming that you're working with well-formed PDFs). If all you want is the text (with spaces), you can do the following:
import pyPdf pdf = pyPdf.PdfFileReader(open('filename.pdf',"rb")) for page in pdf.pages: print page.extractText()