When it comes to parsing PDF files in Java, two popular libraries stand out: Apache Tika and Apache PDFBox. Both libraries provide powerful features for working with PDF documents, but they have different approaches and trade-offs. In this article, we will explore how to parse a PDF using each library and compare their pros and cons.
Parsing PDF (and more) using Apache Tika
Apache Tika is a versatile content analysis toolkit that supports parsing various file formats, including PDF. It aims to provide a unified interface for content extraction and metadata retrieval. Here’s how you can parse a PDF using Apache Tika:
Add the necessary dependencies to your project’s configuration.
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.8.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>2.8.0</version>
</dependency>
And here is the code to parse the PDF in Java:
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
public class TikaPdfParser {
public static void main(String[] args) {
String filePath = "path/to/your/pdf/file.pdf";
try (InputStream inputStream = new FileInputStream(new File(filePath))) {
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
ParseContext parseContext = new ParseContext();
parser.parse(inputStream, handler, null, parseContext);
String text = handler.toString();
System.out.println(text);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Parsing PDF using Tika and Quarkus
if you are using Quarkus to develop your applications then it’s even easier! there’s a Quarkus extension for Apache Tika which makes parsing PDF or OpenOffice document fairly easier!
Just plug the following dependencies in your project:
<dependency>
<groupId>io.quarkiverse.tika</groupId>
<artifactId>quarkus-tika</artifactId>
<version>${quarkus-tika.version}</version>
</dependency>
<!-- some of the operations done here require AWT -->
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-awt</artifactId>
</dependency>
And here is a sample QuarkusApplication which you can run from the command line:
import java.io.*;
import javax.inject.Inject;
import io.quarkus.runtime.Quarkus;
import io.quarkus.runtime.QuarkusApplication;
import io.quarkus.runtime.annotations.QuarkusMain;
import io.quarkus.tika.TikaParser;
@QuarkusMain
public class HelloWorldMain implements QuarkusApplication {
@Inject
TikaParser parser;
public static void main(String... args) {
Quarkus.run(HelloWorldMain.class, args);
}
@Override
public int run(String... args) throws Exception {
if (args.length < 1) {
System.out.println("Usage: quarkus dev <filename> ");
System.exit(1);
}
try (InputStream inputStream = new FileInputStream(new File(args[0]))) {
System.out.println("=============================");
System.out.println( parser.getText(inputStream));
System.out.println("=============================");
} catch (IOException e) {
e.printStackTrace();
}
return 0;
}
}
For example, you can run it using “quarkus” CLI as follows:
quarkus dev /path/to/pdf
Here is a sample execution:

Pros of Apache Tika
- Unified Interface: Apache Tika provides a consistent API for parsing various file formats, making it easier to work with different document types.
- Metadata Extraction: Tika excels at extracting metadata from PDF files, such as author, title, creation date, and more.
- Support for Multiple Formats: Tika supports parsing not only PDF but also a wide range of other file formats, such as Microsoft Office documents, HTML, XML, and more.
Cons of Apache Tika
- Limited PDF-Specific Features: Apache Tika focuses on content extraction and metadata retrieval, so its PDF parsing capabilities might be less advanced compared to dedicated PDF libraries.
- Performance Overhead: Tika provides a generalized approach to handle various formats, which can introduce some performance overhead compared to specialized libraries.
Parsing PDF with PDFBox
Apache PDFBox is a robust Java library specifically designed for working with PDF files. It offers comprehensive functionality for creating, manipulating, and extracting data from PDF documents. Let’s see how to parse a PDF using Apache PDFBox:
As an example, let’s code a JBang script which you can use to parse a PDF text from the Command Line:
//DEPS org.apache.pdfbox:pdfbox:2.0.28
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.IOException;
public class PdfParser {
public static void main(String[] args) {
if (args.length < 1 || args[0] == null) {
System.err.println("Please provide the path to the PDF file as the first command-line argument.");
return;
}
String filePath = args[0];
try (PDDocument document = PDDocument.load(new File(filePath))) {
PDFTextStripper textStripper = new PDFTextStripper();
String text = textStripper.getText(document);
System.out.println(text);
} catch (IOException e) {
e.printStackTrace();
}
}
}
With JBang, you can just run it with:
jbang PdfParser
Clearly replace the //DEPS with the dependencies in a pom.xml if you are running the application in Maven.
To learn how to run a Java main Class from the command line check this article: Run a Java Class from Maven made simple
Pros of Apache PDFBox
- Rich PDF Manipulation: Apache PDFBox provides extensive features for working with PDF files, including parsing, text extraction, metadata manipulation, merging documents, adding annotations, and more.
- PDF-Specific Capabilities: PDFBox offers fine-grained control over PDF elements, such as fonts, images, bookmarks, and annotations, making it suitable for advanced PDF processing tasks.
- Active Community: Apache PDFBox has an active community and frequent updates, ensuring ongoing support and bug fixes.
Cons of Apache PDFBox
Steeper Learning Curve: Due to its rich feature set and complex API, Apache PDFBox might have a steeper learning curve compared to simpler libraries like Tika.
Lack of Format Support: While PDFBox primarily focuses on PDF manipulation, it has limited support for other file formats, which can be a drawback for projects requiring multi-format parsing.
Conclusion
Both Apache Tika and Apache PDFBox offer powerful capabilities for parsing PDF files in Java, but they have different strengths and trade-offs. Apache Tika provides a unified interface for parsing various file formats, including PDF, with excellent metadata extraction capabilities. On the other hand, Apache PDFBox is a dedicated PDF library with advanced PDF manipulation features but a narrower focus.