0% found this document useful (0 votes)

72 views

Extract Text From PDF, Office Files (.Doc, .PPT, .XLS), Open Office Files, .RTF, and Text - Plain Files in Java - Codezrule's Blog

The document describes code to extract text from various file formats like PDF, Microsoft Office files, OpenOffice files, RTF, and plain text files in Java. It provides code snippets to extract text from PDF, Word (.doc), PowerPoint (.ppt), Excel (.xls), RTF, and plain text files. The code uses various third party Java libraries like PDFBox, POI, JDOM, and FontBox to parse the different file formats and extract the text content.

Uploaded by

Capital Design

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views

Extract Text From PDF, Office Files (.Doc, .PPT, .XLS), Open Office Files, .RTF, and Text - Plain Files in Java - Codezrule's Blog

Uploaded by

Capital Design

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .

pen office files, .rtf, and text/plain files in Java | Codezrule's Blog

Codezrule's Blog

Technology from the viewing point..

MARCH 24, 2010 by CODEZRULE (SHIVA)

Extract Text From pdf, office files(.doc, .ppt,

.xls), open office files, .rtf, and text/plain files
in Java

Following is the code to extract text from .pdf, .doc, .ppt, .xls, .odt, .ods, .odp, .rtf and all
text/plain ﬁles. Following jar ﬁles are used in the program:

FontBox-0.1.0-dev.jar
jdom.jar
log4j-1.2.15.jar
PDFBox-0.7.3-dev.jar
poi-2.5.1-final-20040804.jar
poi-contrib-2.5.1-final-20040804.jar
poi-scratchpad-2.5.1-final-20040804.jar

If you are unable to ﬁnd these jars leave a comment, I’ll send you the jars.

The code is as follows:

https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 1/10
01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .rtf, and text/plain files in Java | Codezrule's Blog

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintWriter;
import java.io.StringWriter;
import java.util.Enumeration;
import java.util.Iterator;
import java.util.List;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;
import javax.swing.text.DefaultStyledDocument;
import javax.swing.text.rtf.RTFEditorKit;
import org.apache.poi.hdf.extractor.WordDocument;
import org.apache.poi.hssf.usermodel.HSSFCell;
import org.apache.poi.hssf.usermodel.HSSFRow;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.poifs.eventfilesystem.POIFSReader;
import org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent;
import org.apache.poi.poifs.eventfilesystem.POIFSReaderListener;
import org.apache.poi.poifs.filesystem.DocumentInputStream;
import org.apache.poi.util.LittleEndian;
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.Text;
import org.jdom.input.SAXBuilder;
import org.pdfbox.cos.COSDocument;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.FileReader;

public class ReadFileFormat {

StringBuffer sb = new StringBuffer(8192);

StringBuffer TextBuffer = new StringBuffer();

public String pdftotext(String fileName) {

PDFParser parser;
String parsedText;
PDFTextStripper pdfStripper;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File f = new File(fileName);
try {
parser = new PDFParser(new FileInputStream(f));
} catch (Exception e) {
System.out.println("Unablepen PDF Parser.");

https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 2/10
01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .rtf, and text/plain files in Java | Codezrule's Blog

return null;
}
try {
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
cosDoc.close();
pdDoc.close();
} catch (Exception e) {
System.out.println("Anption occured in parsing the PDF Documen
e.printStackTrace();
try {
if (cosDoc != null) {
cosDoc.close();
}
if (pdDoc != null) {
pdDoc.close();
}
} catch (Exception e1) {
e1.printStackTrace();
}
return null;
}
System.out.println("Done return parsedText;
}

public String doc2text(String fileName) throws IOException {

WordDocument wd = new WordDocument(fileName);
StringWriter docTextWriter = new StringWriter();
wd.writeAllText(new PrintWriter(docTextWriter));
docTextWriter.close();
return docTextWriter.toString();
}

public String rtf2text(InputStream is) throws Exception {

DefaultStyledDocument styledDoc = new DefaultStyledDocument();
new RTFEditorKit().read(is, styledDoc, 0);
return styledDoc.getText(0, styledDoc.getLength());
}

public String ppt2text(String fileName) throws Exception {

POIFSReader poifReader = new POIFSReader();
poifReader.registerListener(new ReadFileFormat.MyPOIFSReaderListen
poifReader.read(new FileInputStream(fileName));
return sb.toString();
}

class MyPOIFSReaderListener implements POIFSReaderListener {

public void processPOIFSReaderEvent(POIFSReaderEvent event) {

https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 3/10
01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .rtf, and text/plain files in Java | Codezrule's Blog

char ch0 = (char) 0;

char ch11 = (char) 11;
try {
DocumentInputStream dis = null;
dis = event.getStream();
byte btoWrite[] = new byte[dis.available()];
dis.read(btoWrite, 0, dis.available());
for (int i = 0; i < btoWrite.length - 20; i++) {
long type = LittleEndian.getUShort(btoWrite, i + 2);
long size = LittleEndian.getUInt(btoWrite, i + 4);
if (type == 4008) {
try {
String s = new String(btoWrite, i + 4 + 1, (in
if (s.trim().startsWith("Click to edit") == fa
sb.append(s);
}
} catch (Exception ee) {
System.out.println("erroree);
}
}
}
} catch (Exception ex) {
ex.printStackTrace();
return;
}
}
}

public String xls2text(InputStream in) throws Exception {

HSSFWorkbook excelWb = new HSSFWorkbook(in);
StringBuffer result = new StringBuffer(4096);
int numberOfSheets = excelWb.getNumberOfSheets();
for (int i = 0; i < numberOfSheets; i++) {
HSSFSheet sheet = excelWb.getSheetAt(i);
int numberOfRows = sheet.getPhysicalNumberOfRows();
if (numberOfRows > 0) {
if (excelWb.getSheetName(i) != null && excelWb.getSheetNam
// append sheet name to content
if (i > 0) {
result.append("\n\n");
}
result.append(excelWb.getSheetName(i).trim());
result.append(":\n\n");
}

Iterator<HSSFRow> rowIt = sheet.rowIterator();

while (rowIt.hasNext()) {
HSSFRow row = rowIt.next();
if (row != null) {
boolean hasContent = false;
Iterator<HSSFCell> it = row.cellIterator();
while (it.hasNext()) {
https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 4/10
01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .rtf, and text/plain files in Java | Codezrule's Blog

HSSFCell cell = it.next();

String text = null;
try {
switch (cell.getCellType()) {
case HSSFCell.CELL_TYPE_BLANK:
case HSSFCell.CELL_TYPE_ERROR:
// ignore all blank or error cells
break;
case HSSFCell.CELL_TYPE_NUMERIC:
text = Double.toString(cell.getNum
break;
case HSSFCell.CELL_TYPE_BOOLEAN:
text = Boolean.toString(cell.getBo
break;
case HSSFCell.CELL_TYPE_STRING:
default:
text = cell.getStringCellValue();
break;
}
} catch (Exception e) {
}
if ((text != null) && (text.length() != 0)) {
result.append(text.trim());
result.append(' ');
hasContent = true;
}
}
if (hasContent) {
// append a newline at the end of each row that has content
result.append('\n');
}
}
}
}
}
return result.toString();
}

public void processElement(Object o) {

if (o instanceof Element) {
Element e = (Element) o;
String elementName = e.getQualifiedName();
if (elementName.startsWith("text")) {
if (elementName.equals("text:tab")) // add tab for text:ta
{
TextBuffer.append("\t");
} else if (elementName.equals("text:s")) // add space for
{
TextBuffer.append(" ");
} else {
List children = e.getContent();
Iterator iterator = children.iterator();
https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 5/10
01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .rtf, and text/plain files in Java | Codezrule's Blog

while (iterator.hasNext()) {
Object child = iterator.next();
//If Child is a Text Node, then append the text
if (child instanceof Text) {
Text t = (Text) child;
TextBuffer.append(t.getValue());
} else {
processElement(child); // Recursively process
}
}
}
if (elementName.equals("text:p")) {
TextBuffer.append("\n");
}
} else {
List non_text_list = e.getContent();
Iterator it = non_text_list.iterator();
while (it.hasNext()) {
Object non_text_child = it.next();
processElement(non_text_child);
}
}
}
}

public String getOpenOfficeText(String fileName) throws Exception {

TextBuffer = new StringBuffer();
//Unzip the openOffice Document
ZipFile zipFile = new ZipFile(fileName);
Enumeration entries = zipFile.entries();
ZipEntry entry;
while (entries.hasMoreElements()) {
entry = (ZipEntry) entries.nextElement();
if (entry.getName().equals("content.xml")) {
TextBuffer = new StringBuffer();
SAXBuilder sax = new SAXBuilder();
Document doc = sax.build(zipFile.getInputStream(entry));
Element rootElement = doc.getRootElement();
processElement(rootElement);
break;
}
}
return TextBuffer.toString();
}

public String fileToStringNow(File f) throws Exception {

BufferedReader br = new BufferedReader(new FileReader(f));
String nextLine = "";
StringBuffer sbuff = new StringBuffer();
while ((nextLine = br.readLine()) != null) {
sbuff.append(nextLine);
sbuff.append(System.getProperty("line.separator"));
https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 6/10
01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .rtf, and text/plain files in Java | Codezrule's Blog

}
return sbuff.toString();
}

public static void main(String[] args) throws Exception {

ReadFileFormat rff = new ReadFileFormat();
System.out.print("Enter Name => ");
BufferedReader br = new BufferedReader(new InputStreamReader(Syste
String fileName = br.readLine();
File f = new File(fileName);
if (!f.exists()) {
System.out.println("Sorry does not Exists!");
} else {
if (f.getName().endsWith(".pdf") || f.getName().endsWith(".PDF
System.out.println(rff.pdftotext(fileName));
} else if (f.getName().endsWith(".doc") || f.getName().endsWit
System.out.println(rff.doc2text(fileName));
} else if (f.getName().endsWith(".rtf") || f.getName().endsWit
System.out.println(rff.rtf2text(new FileInputStream(f)));
} else if (f.getName().endsWith(".ppt") || f.getName().endsWit
System.out.println(rff.ppt2text(fileName));
} else if (f.getName().endsWith(".xls") || f.getName().endsWit
System.out.println(rff.xls2text(new FileInputStream(f)));
} else if (f.getName().endsWith(".odt") || f.getName().endsWit
System.out.println(rff.getOpenOfficeText(fileName));
} else {
System.out.println(rff.fileToStringNow(f));
}
}
br.close();
}
}

If you are on a linux machine with bash shell then place the code in ReadFileFormat.java beside
the above mentioned jar ﬁles and execute the following commands:

1. export CLASSPATH=FontBox-0.1.0-dev.jar:jdom.jar:log4j-1.2.15.jar:PDFBox-0.7.3-
dev.jar:poi-2.5.1-final-20040804.jar:poi-contrib-2.5.1-final-20040804.jar:poi-scratchpad-2.5.1-
final-20040804.jar:$CLASSPATH
2. javac ReadFileFormat.java
3. java ReadFileFormat

If you are on a windows machine and javac has been set in your classpath then follow the
following commands from your command line:

1. set classpath=%classpath%;.;FontBox-0.1.0-dev.jar;.;jdom.jar;.;log4j-1.2.15.jar;.;PDFBox-0.7.3-
dev.jar;.;poi-2.5.1-final-20040804.jar;.;poi-contrib-2.5.1-final-20040804.jar;.;poi-scratchpad-
2.5.1-final-20040804.jar;.;
2. javac ReadFileFormat.java
3. java ReadFileFormat

https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 7/10

Week 8 File Input Output
No ratings yet
Week 8 File Input Output
1 page
Disaster Readiness and Risk Reduction 1st Quarter Exam
89% (9)
Disaster Readiness and Risk Reduction 1st Quarter Exam
2 pages
Map Reduce Example
No ratings yet
Map Reduce Example
6 pages
Running Jar Program
No ratings yet
Running Jar Program
3 pages
CSF443 Lab-Report Nimish Shandilya 1000016934
No ratings yet
CSF443 Lab-Report Nimish Shandilya 1000016934
17 pages
(2019) Example File IO and Exception Handling
No ratings yet
(2019) Example File IO and Exception Handling
7 pages
Java File Handling Notes
No ratings yet
Java File Handling Notes
10 pages
"Tfilelist - 1 - Current - Filepath" "Tfilelist - 1 - Current - File" "."
No ratings yet
"Tfilelist - 1 - Current - Filepath" "Tfilelist - 1 - Current - File" "."
2 pages
Program: Domparser - Java
No ratings yet
Program: Domparser - Java
4 pages
Java
No ratings yet
Java
3 pages
ANWESHA MONDAL - MCAN - 293 - Java Assignment4
No ratings yet
ANWESHA MONDAL - MCAN - 293 - Java Assignment4
25 pages
Ayushman Bhattcharya - MCAN - 293 - Java Assignment4
No ratings yet
Ayushman Bhattcharya - MCAN - 293 - Java Assignment4
25 pages
Basic Operation On Text File in C
No ratings yet
Basic Operation On Text File in C
10 pages
XML
No ratings yet
XML
23 pages
6 - Simple Wordcount
No ratings yet
6 - Simple Wordcount
2 pages
Advanced Computer Networks Lab
No ratings yet
Advanced Computer Networks Lab
24 pages
Converting
No ratings yet
Converting
4 pages
Practical 3bcbs
No ratings yet
Practical 3bcbs
5 pages
DSBDA GRP B Print
No ratings yet
DSBDA GRP B Print
21 pages
Index SN. Program Date Signature 1. 2. 3. 4.: Write A Program To Copy The Content of One File To Another
No ratings yet
Index SN. Program Date Signature 1. 2. 3. 4.: Write A Program To Copy The Content of One File To Another
6 pages
Wordcount
No ratings yet
Wordcount
3 pages
Java IO Basics Example
No ratings yet
Java IO Basics Example
8 pages
Code To Extract Plain Text From A PDF File - CodeProject
No ratings yet
Code To Extract Plain Text From A PDF File - CodeProject
5 pages
Data Persistence: Hands On
No ratings yet
Data Persistence: Hands On
47 pages
Class 8 Yuldoshov - Shakhzod
No ratings yet
Class 8 Yuldoshov - Shakhzod
11 pages
Source Code for Wordcount
No ratings yet
Source Code for Wordcount
3 pages
New Text Document
No ratings yet
New Text Document
1 page
✅ PART 1- Install Java and Hadoop on Ubuntu
No ratings yet
✅ PART 1- Install Java and Hadoop on Ubuntu
4 pages
ASS5
No ratings yet
ASS5
8 pages
Map Reduce
No ratings yet
Map Reduce
4 pages
Java Rec Print
No ratings yet
Java Rec Print
61 pages
File Transfer Using Java Sockets
No ratings yet
File Transfer Using Java Sockets
7 pages
Dps International School: Subject: Computer Applications Topic: File Handling Name: Grade:11 ISC
100% (1)
Dps International School: Subject: Computer Applications Topic: File Handling Name: Grade:11 ISC
7 pages
File Download Servlet
No ratings yet
File Download Servlet
2 pages
C++ Programming
No ratings yet
C++ Programming
120 pages
8.file Handling Notes
No ratings yet
8.file Handling Notes
10 pages
02-Wordcount Mapreduce
No ratings yet
02-Wordcount Mapreduce
5 pages
Java
No ratings yet
Java
1 page
JavaClass Lecture11 StreamManipulation
No ratings yet
JavaClass Lecture11 StreamManipulation
12 pages
Assignment 3
No ratings yet
Assignment 3
7 pages
Java Questions and Answers
No ratings yet
Java Questions and Answers
40 pages
Nama: Rima Sintike BR Sinuhaji NIM:5170311124 Prodi:S1-Sistem Informasi C
No ratings yet
Nama: Rima Sintike BR Sinuhaji NIM:5170311124 Prodi:S1-Sistem Informasi C
9 pages
Networking Lab Programs (2009-10)
No ratings yet
Networking Lab Programs (2009-10)
35 pages
Big Data Practical 2
No ratings yet
Big Data Practical 2
11 pages
Dhruvi Os New
No ratings yet
Dhruvi Os New
33 pages
Step 2 - First MapReduce Program
No ratings yet
Step 2 - First MapReduce Program
25 pages
Run Wordcount
No ratings yet
Run Wordcount
3 pages
Java Notes(Unit III,IV)
No ratings yet
Java Notes(Unit III,IV)
17 pages
19P220 Lab - 8
No ratings yet
19P220 Lab - 8
5 pages
Week-8 de
No ratings yet
Week-8 de
9 pages
Getting Started With IText PDF API For Java
No ratings yet
Getting Started With IText PDF API For Java
38 pages
Selvaraghav Java
No ratings yet
Selvaraghav Java
48 pages
Iopackage Read &write : Streams: 1) First - Java
No ratings yet
Iopackage Read &write : Streams: 1) First - Java
11 pages
ContarPalabras Java
No ratings yet
ContarPalabras Java
2 pages
USERLOGIN
No ratings yet
USERLOGIN
7 pages
Checkin UCM
No ratings yet
Checkin UCM
2 pages
How To Unity Build Inside App
No ratings yet
How To Unity Build Inside App
10 pages
Starts Operation (Timer) // Stops Operation (Timer) //final Int IN 3
No ratings yet
Starts Operation (Timer) // Stops Operation (Timer) //final Int IN 3
12 pages
Word To PDF
No ratings yet
Word To PDF
35 pages
Consuming An Input File: Michael Hoffman
No ratings yet
Consuming An Input File: Michael Hoffman
18 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
PTSP Jntu Syllabus
100% (1)
PTSP Jntu Syllabus
2 pages
Po Go
No ratings yet
Po Go
15 pages
Physics Project Class-XII (2021-22) - Investigatory Project - PDF - Electromagnetic Induction - Magnetic Field
No ratings yet
Physics Project Class-XII (2021-22) - Investigatory Project - PDF - Electromagnetic Induction - Magnetic Field
356 pages
Corrosion Glossary PDF
100% (1)
Corrosion Glossary PDF
5 pages
MATH HS AP Calculus AB - Unit 01 Limits
No ratings yet
MATH HS AP Calculus AB - Unit 01 Limits
2 pages
Practice Test 2
No ratings yet
Practice Test 2
7 pages
Icecce49384 2020 9179470
No ratings yet
Icecce49384 2020 9179470
5 pages
Bubbles in Transformer Oil Dynamic Behavior Internal Discharge and Triggered Liquid Breakdown
No ratings yet
Bubbles in Transformer Oil Dynamic Behavior Internal Discharge and Triggered Liquid Breakdown
9 pages
Compact Heat Exchangers Heat Exchanger Types and Classifications
No ratings yet
Compact Heat Exchangers Heat Exchanger Types and Classifications
39 pages
PP T 2004 Apple Script Ref
No ratings yet
PP T 2004 Apple Script Ref
203 pages
Terms and Symbols
No ratings yet
Terms and Symbols
3 pages
2 Supply Demand and Elasticity
No ratings yet
2 Supply Demand and Elasticity
43 pages
Toyota Votex Bison 5150081 Documentation
No ratings yet
Toyota Votex Bison 5150081 Documentation
24 pages
Lab 2 Linear Algebra For It - 501032: 1 Exercises
No ratings yet
Lab 2 Linear Algebra For It - 501032: 1 Exercises
2 pages
Presented By: Bero, Ryan Arvin B. Labe, Mary Mae V. Ordiz, Kent Marx Visto, Lloyd Ray Bere, Legien Diaz, Cesar Iii
No ratings yet
Presented By: Bero, Ryan Arvin B. Labe, Mary Mae V. Ordiz, Kent Marx Visto, Lloyd Ray Bere, Legien Diaz, Cesar Iii
52 pages
RHR Questions Tep 3312
No ratings yet
RHR Questions Tep 3312
15 pages
(Alberto Del Guerra) Ionizing Radiation Detectors
No ratings yet
(Alberto Del Guerra) Ionizing Radiation Detectors
524 pages
Distributed System Lab Manual
No ratings yet
Distributed System Lab Manual
36 pages
An Algebraic Expression That Contains Only One Term Is Called
No ratings yet
An Algebraic Expression That Contains Only One Term Is Called
8 pages
Serres and Hallward - The Science of Relations - An Interview
No ratings yet
Serres and Hallward - The Science of Relations - An Interview
13 pages
Journal of Pharmaceutical Analysis
No ratings yet
Journal of Pharmaceutical Analysis
8 pages
MA1001 Dynamics
No ratings yet
MA1001 Dynamics
5 pages
Mechanical Specifications: TIPO I.2: C-BXA-80033/8-M
No ratings yet
Mechanical Specifications: TIPO I.2: C-BXA-80033/8-M
1 page
60W NDFL Amplifier ETI-1983 Reg PSU
No ratings yet
60W NDFL Amplifier ETI-1983 Reg PSU
20 pages
Implementation of Green Accounting in Improving Operational Sustainability at PT Malindo Animal Feed Company in Gresik
No ratings yet
Implementation of Green Accounting in Improving Operational Sustainability at PT Malindo Animal Feed Company in Gresik
19 pages
LFTC Flanged Bearing Dimensions.
No ratings yet
LFTC Flanged Bearing Dimensions.
60 pages
Cummins Guidanz Mobile User Training
100% (1)
Cummins Guidanz Mobile User Training
56 pages
FTIR Report
100% (10)
FTIR Report
15 pages
IBM Infosphere Change Data Capture
No ratings yet
IBM Infosphere Change Data Capture
9 pages

Extract Text From PDF, Office Files (.Doc, .PPT, .XLS), Open Office Files, .RTF, and Text - Plain Files in Java - Codezrule's Blog

Uploaded by

Extract Text From PDF, Office Files (.Doc, .PPT, .XLS), Open Office Files, .RTF, and Text - Plain Files in Java - Codezrule's Blog

Uploaded by

01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .

Technology from the viewing point..

MARCH 24, 2010 by CODEZRULE (SHIVA)

Extract Text From pdf, office files(.doc, .ppt,

The code is as follows:

public class ReadFileFormat {

StringBuffer sb = new StringBuffer(8192);

public String pdftotext(String fileName) {

public String doc2text(String fileName) throws IOException {

public String rtf2text(InputStream is) throws Exception {

public String ppt2text(String fileName) throws Exception {

class MyPOIFSReaderListener implements POIFSReaderListener {

public void processPOIFSReaderEvent(POIFSReaderEvent event) {

char ch0 = (char) 0;

public String xls2text(InputStream in) throws Exception {

Iterator<HSSFRow> rowIt = sheet.rowIterator();

HSSFCell cell = it.next();

public void processElement(Object o) {

public String getOpenOfficeText(String fileName) throws Exception {

public String fileToStringNow(File f) throws Exception {

public static void main(String[] args) throws Exception {

You might also like