I'm working with Apache POI , I have a project Convert word document to pdf. Now, I used Apache POI ,org.apache.poi.hwpf.extractor library to getText from word document:
but i can't get object :hyperlink, table, image and format of word document :=(:. I used other library as: jdoctopdf-0.9-beta.jar , tika-parsers-0.9-jdk14.jar library but doesn't get all format from word document. Therefore who have way help me, please reply soon. Thank all!:(handshake):
Code:
import com.itextpdf.text.*;
import com.itextpdf.text.pdf.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
POIFSFileSystem fs = null;
fs = new POIFSFileSystem(new FileInputStream(filename));
//Couldn't close the braces at the end as my site did not allow it to close
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("/Program Files/NCMSCT/HopDong.pdf"));
Range range = doc.getRange();
document.open();
writer.setPageEmpty(true);
document.newPage();
writer.setPageEmpty(true);
String[] paragraphs = we.getParagraphText();
for (int i = 0; i < paragraphs.length; i++) {
paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");
System.out.println("Paragraph " + i + ": " + paragraphs[i]);
System.out.println("Length:" + paragraphs[ i].length());
}