0% found this document useful (0 votes)
72 views

Extract Text From PDF, Office Files (.Doc, .PPT, .XLS), Open Office Files, .RTF, and Text - Plain Files in Java - Codezrule's Blog

The document describes code to extract text from various file formats like PDF, Microsoft Office files, OpenOffice files, RTF, and plain text files in Java. It provides code snippets to extract text from PDF, Word (.doc), PowerPoint (.ppt), Excel (.xls), RTF, and plain text files. The code uses various third party Java libraries like PDFBox, POI, JDOM, and FontBox to parse the different file formats and extract the text content.

Uploaded by

Capital Design
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

Extract Text From PDF, Office Files (.Doc, .PPT, .XLS), Open Office Files, .RTF, and Text - Plain Files in Java - Codezrule's Blog

The document describes code to extract text from various file formats like PDF, Microsoft Office files, OpenOffice files, RTF, and plain text files in Java. It provides code snippets to extract text from PDF, Word (.doc), PowerPoint (.ppt), Excel (.xls), RTF, and plain text files. The code uses various third party Java libraries like PDFBox, POI, JDOM, and FontBox to parse the different file formats and extract the text content.

Uploaded by

Capital Design
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .

pen office files, .rtf, and text/plain files in Java | Codezrule's Blog

Codezrule's Blog

Technology from the viewing point..

MARCH 24, 2010 by CODEZRULE (SHIVA)

Extract Text From pdf, office files(.doc, .ppt,


.xls), open office files, .rtf, and text/plain files
in Java

Following is the code to extract text from .pdf, .doc, .ppt, .xls, .odt, .ods, .odp, .rtf and all
text/plain files. Following jar files are used in the program:

FontBox-0.1.0-dev.jar
jdom.jar
log4j-1.2.15.jar
PDFBox-0.7.3-dev.jar
poi-2.5.1-final-20040804.jar
poi-contrib-2.5.1-final-20040804.jar
poi-scratchpad-2.5.1-final-20040804.jar

If you are unable to find these jars leave a comment, I’ll send you the jars.

The code is as follows:

https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 1/10
01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .rtf, and text/plain files in Java | Codezrule's Blog

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintWriter;
import java.io.StringWriter;
import java.util.Enumeration;
import java.util.Iterator;
import java.util.List;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;
import javax.swing.text.DefaultStyledDocument;
import javax.swing.text.rtf.RTFEditorKit;
import org.apache.poi.hdf.extractor.WordDocument;
import org.apache.poi.hssf.usermodel.HSSFCell;
import org.apache.poi.hssf.usermodel.HSSFRow;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.poifs.eventfilesystem.POIFSReader;
import org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent;
import org.apache.poi.poifs.eventfilesystem.POIFSReaderListener;
import org.apache.poi.poifs.filesystem.DocumentInputStream;
import org.apache.poi.util.LittleEndian;
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.Text;
import org.jdom.input.SAXBuilder;
import org.pdfbox.cos.COSDocument;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.FileReader;

public class ReadFileFormat {

StringBuffer sb = new StringBuffer(8192);


StringBuffer TextBuffer = new StringBuffer();

public String pdftotext(String fileName) {


PDFParser parser;
String parsedText;
PDFTextStripper pdfStripper;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File f = new File(fileName);
try {
parser = new PDFParser(new FileInputStream(f));
} catch (Exception e) {
System.out.println("Unablepen PDF Parser.");

https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 2/10
01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .rtf, and text/plain files in Java | Codezrule's Blog

return null;
}
try {
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
cosDoc.close();
pdDoc.close();
} catch (Exception e) {
System.out.println("Anption occured in parsing the PDF Documen
e.printStackTrace();
try {
if (cosDoc != null) {
cosDoc.close();
}
if (pdDoc != null) {
pdDoc.close();
}
} catch (Exception e1) {
e1.printStackTrace();
}
return null;
}
System.out.println("Done return parsedText;
}

public String doc2text(String fileName) throws IOException {


WordDocument wd = new WordDocument(fileName);
StringWriter docTextWriter = new StringWriter();
wd.writeAllText(new PrintWriter(docTextWriter));
docTextWriter.close();
return docTextWriter.toString();
}

public String rtf2text(InputStream is) throws Exception {


DefaultStyledDocument styledDoc = new DefaultStyledDocument();
new RTFEditorKit().read(is, styledDoc, 0);
return styledDoc.getText(0, styledDoc.getLength());
}

public String ppt2text(String fileName) throws Exception {


POIFSReader poifReader = new POIFSReader();
poifReader.registerListener(new ReadFileFormat.MyPOIFSReaderListen
poifReader.read(new FileInputStream(fileName));
return sb.toString();
}

class MyPOIFSReaderListener implements POIFSReaderListener {

public void processPOIFSReaderEvent(POIFSReaderEvent event) {


https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 3/10
01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .rtf, and text/plain files in Java | Codezrule's Blog

char ch0 = (char) 0;


char ch11 = (char) 11;
try {
DocumentInputStream dis = null;
dis = event.getStream();
byte btoWrite[] = new byte[dis.available()];
dis.read(btoWrite, 0, dis.available());
for (int i = 0; i < btoWrite.length - 20; i++) {
long type = LittleEndian.getUShort(btoWrite, i + 2);
long size = LittleEndian.getUInt(btoWrite, i + 4);
if (type == 4008) {
try {
String s = new String(btoWrite, i + 4 + 1, (in
if (s.trim().startsWith("Click to edit") == fa
sb.append(s);
}
} catch (Exception ee) {
System.out.println("erroree);
}
}
}
} catch (Exception ex) {
ex.printStackTrace();
return;
}
}
}

public String xls2text(InputStream in) throws Exception {


HSSFWorkbook excelWb = new HSSFWorkbook(in);
StringBuffer result = new StringBuffer(4096);
int numberOfSheets = excelWb.getNumberOfSheets();
for (int i = 0; i < numberOfSheets; i++) {
HSSFSheet sheet = excelWb.getSheetAt(i);
int numberOfRows = sheet.getPhysicalNumberOfRows();
if (numberOfRows > 0) {
if (excelWb.getSheetName(i) != null && excelWb.getSheetNam
// append sheet name to content
if (i > 0) {
result.append("\n\n");
}
result.append(excelWb.getSheetName(i).trim());
result.append(":\n\n");
}

Iterator<HSSFRow> rowIt = sheet.rowIterator();


while (rowIt.hasNext()) {
HSSFRow row = rowIt.next();
if (row != null) {
boolean hasContent = false;
Iterator<HSSFCell> it = row.cellIterator();
while (it.hasNext()) {
https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 4/10
01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .rtf, and text/plain files in Java | Codezrule's Blog

HSSFCell cell = it.next();


String text = null;
try {
switch (cell.getCellType()) {
case HSSFCell.CELL_TYPE_BLANK:
case HSSFCell.CELL_TYPE_ERROR:
// ignore all blank or error cells
break;
case HSSFCell.CELL_TYPE_NUMERIC:
text = Double.toString(cell.getNum
break;
case HSSFCell.CELL_TYPE_BOOLEAN:
text = Boolean.toString(cell.getBo
break;
case HSSFCell.CELL_TYPE_STRING:
default:
text = cell.getStringCellValue();
break;
}
} catch (Exception e) {
}
if ((text != null) && (text.length() != 0)) {
result.append(text.trim());
result.append(' ');
hasContent = true;
}
}
if (hasContent) {
// append a newline at the end of each row that has content
result.append('\n');
}
}
}
}
}
return result.toString();
}

public void processElement(Object o) {


if (o instanceof Element) {
Element e = (Element) o;
String elementName = e.getQualifiedName();
if (elementName.startsWith("text")) {
if (elementName.equals("text:tab")) // add tab for text:ta
{
TextBuffer.append("\t");
} else if (elementName.equals("text:s")) // add space for
{
TextBuffer.append(" ");
} else {
List children = e.getContent();
Iterator iterator = children.iterator();
https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 5/10
01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .rtf, and text/plain files in Java | Codezrule's Blog

while (iterator.hasNext()) {
Object child = iterator.next();
//If Child is a Text Node, then append the text
if (child instanceof Text) {
Text t = (Text) child;
TextBuffer.append(t.getValue());
} else {
processElement(child); // Recursively process
}
}
}
if (elementName.equals("text:p")) {
TextBuffer.append("\n");
}
} else {
List non_text_list = e.getContent();
Iterator it = non_text_list.iterator();
while (it.hasNext()) {
Object non_text_child = it.next();
processElement(non_text_child);
}
}
}
}

public String getOpenOfficeText(String fileName) throws Exception {


TextBuffer = new StringBuffer();
//Unzip the openOffice Document
ZipFile zipFile = new ZipFile(fileName);
Enumeration entries = zipFile.entries();
ZipEntry entry;
while (entries.hasMoreElements()) {
entry = (ZipEntry) entries.nextElement();
if (entry.getName().equals("content.xml")) {
TextBuffer = new StringBuffer();
SAXBuilder sax = new SAXBuilder();
Document doc = sax.build(zipFile.getInputStream(entry));
Element rootElement = doc.getRootElement();
processElement(rootElement);
break;
}
}
return TextBuffer.toString();
}

public String fileToStringNow(File f) throws Exception {


BufferedReader br = new BufferedReader(new FileReader(f));
String nextLine = "";
StringBuffer sbuff = new StringBuffer();
while ((nextLine = br.readLine()) != null) {
sbuff.append(nextLine);
sbuff.append(System.getProperty("line.separator"));
https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 6/10
01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .rtf, and text/plain files in Java | Codezrule's Blog

}
return sbuff.toString();
}

public static void main(String[] args) throws Exception {


ReadFileFormat rff = new ReadFileFormat();
System.out.print("Enter Name => ");
BufferedReader br = new BufferedReader(new InputStreamReader(Syste
String fileName = br.readLine();
File f = new File(fileName);
if (!f.exists()) {
System.out.println("Sorry does not Exists!");
} else {
if (f.getName().endsWith(".pdf") || f.getName().endsWith(".PDF
System.out.println(rff.pdftotext(fileName));
} else if (f.getName().endsWith(".doc") || f.getName().endsWit
System.out.println(rff.doc2text(fileName));
} else if (f.getName().endsWith(".rtf") || f.getName().endsWit
System.out.println(rff.rtf2text(new FileInputStream(f)));
} else if (f.getName().endsWith(".ppt") || f.getName().endsWit
System.out.println(rff.ppt2text(fileName));
} else if (f.getName().endsWith(".xls") || f.getName().endsWit
System.out.println(rff.xls2text(new FileInputStream(f)));
} else if (f.getName().endsWith(".odt") || f.getName().endsWit
System.out.println(rff.getOpenOfficeText(fileName));
} else {
System.out.println(rff.fileToStringNow(f));
}
}
br.close();
}
}

If you are on a linux machine with bash shell then place the code in ReadFileFormat.java beside
the above mentioned jar files and execute the following commands:

1. export CLASSPATH=FontBox-0.1.0-dev.jar:jdom.jar:log4j-1.2.15.jar:PDFBox-0.7.3-
dev.jar:poi-2.5.1-final-20040804.jar:poi-contrib-2.5.1-final-20040804.jar:poi-scratchpad-2.5.1-
final-20040804.jar:$CLASSPATH
2. javac ReadFileFormat.java
3. java ReadFileFormat

If you are on a windows machine and javac has been set in your classpath then follow the
following commands from your command line:

1. set classpath=%classpath%;.;FontBox-0.1.0-dev.jar;.;jdom.jar;.;log4j-1.2.15.jar;.;PDFBox-0.7.3-
dev.jar;.;poi-2.5.1-final-20040804.jar;.;poi-contrib-2.5.1-final-20040804.jar;.;poi-scratchpad-
2.5.1-final-20040804.jar;.;
2. javac ReadFileFormat.java
3. java ReadFileFormat

https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 7/10

You might also like