Parsing PDF in Java made simple

When it comes to parsing PDF files in Java, two popular libraries stand out: Apache Tika and Apache PDFBox. Both libraries provide powerful features for working with PDF documents, but they have different approaches and trade-offs. In this article, we will explore how to parse a PDF using each library and compare their pros and cons.

Parsing PDF (and more) using Apache Tika

Apache Tika is a versatile content analysis toolkit that supports parsing various file formats, including PDF. It aims to provide a unified interface for content extraction and metadata retrieval. Here’s how you can parse a PDF using Apache Tika:

Add the necessary dependencies to your project’s configuration.

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>2.8.0</version>
</dependency>
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>2.8.0</version>
</dependency>

And here is the code to parse the PDF in Java:

import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

public class TikaPdfParser {

    public static void main(String[] args) {
        String filePath = "path/to/your/pdf/file.pdf";
        try (InputStream inputStream = new FileInputStream(new File(filePath))) {
            AutoDetectParser parser = new AutoDetectParser();
            BodyContentHandler handler = new BodyContentHandler();
            ParseContext parseContext = new ParseContext();
            parser.parse(inputStream, handler, null, parseContext);
            String text = handler.toString();
            System.out.println(text);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Parsing PDF using Tika and Quarkus

if you are using Quarkus to develop your applications then it’s even easier! there’s a Quarkus extension for Apache Tika which makes parsing PDF or OpenOffice document fairly easier!

Just plug the following dependencies in your project:

<dependency>
    <groupId>io.quarkiverse.tika</groupId>
    <artifactId>quarkus-tika</artifactId>
    <version>${quarkus-tika.version}</version>
</dependency>
<!-- some of the operations done here require AWT -->
<dependency>
    <groupId>io.quarkus</groupId>
    <artifactId>quarkus-awt</artifactId>
</dependency>

And here is a sample QuarkusApplication which you can run from the command line:

import java.io.*;

import javax.inject.Inject;

import io.quarkus.runtime.Quarkus;
import io.quarkus.runtime.QuarkusApplication;
import io.quarkus.runtime.annotations.QuarkusMain;
import io.quarkus.tika.TikaParser;

@QuarkusMain
public class HelloWorldMain implements QuarkusApplication {
  @Inject
    TikaParser parser;
    public static void main(String... args) {
        Quarkus.run(HelloWorldMain.class, args);
    }

    @Override
    public int run(String... args) throws Exception {
      if (args.length < 1) {
        System.out.println("Usage: quarkus dev <filename> ");
        System.exit(1);
      }

        try (InputStream inputStream = new FileInputStream(new File(args[0]))) {
           System.out.println("=============================");
           System.out.println( parser.getText(inputStream));
           System.out.println("=============================");
        } catch (IOException e) {
            e.printStackTrace();
        }
      return 0;
    }
}

For example, you can run it using “quarkus” CLI as follows:

quarkus dev /path/to/pdf

Here is a sample execution:

Pros of Apache Tika

Unified Interface: Apache Tika provides a consistent API for parsing various file formats, making it easier to work with different document types.
Metadata Extraction: Tika excels at extracting metadata from PDF files, such as author, title, creation date, and more.
Support for Multiple Formats: Tika supports parsing not only PDF but also a wide range of other file formats, such as Microsoft Office documents, HTML, XML, and more.

Cons of Apache Tika

Limited PDF-Specific Features: Apache Tika focuses on content extraction and metadata retrieval, so its PDF parsing capabilities might be less advanced compared to dedicated PDF libraries.
Performance Overhead: Tika provides a generalized approach to handle various formats, which can introduce some performance overhead compared to specialized libraries.

Parsing PDF with PDFBox

Apache PDFBox is a robust Java library specifically designed for working with PDF files. It offers comprehensive functionality for creating, manipulating, and extracting data from PDF documents. Let’s see how to parse a PDF using Apache PDFBox:

As an example, let’s code a JBang script which you can use to parse a PDF text from the Command Line:

//DEPS org.apache.pdfbox:pdfbox:2.0.28

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.IOException;

public class PdfParser {

    public static void main(String[] args) {
        if (args.length < 1 || args[0] == null) {
            System.err.println("Please provide the path to the PDF file as the first command-line argument.");
            return;
        }

        String filePath = args[0];
        try (PDDocument document = PDDocument.load(new File(filePath))) {
            PDFTextStripper textStripper = new PDFTextStripper();
            String text = textStripper.getText(document);
            System.out.println(text);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

With JBang, you can just run it with:

jbang PdfParser

Clearly replace the //DEPS with the dependencies in a pom.xml if you are running the application in Maven.
To learn how to run a Java main Class from the command line check this article: Run a Java Class from Maven made simple

Pros of Apache PDFBox

Rich PDF Manipulation: Apache PDFBox provides extensive features for working with PDF files, including parsing, text extraction, metadata manipulation, merging documents, adding annotations, and more.
PDF-Specific Capabilities: PDFBox offers fine-grained control over PDF elements, such as fonts, images, bookmarks, and annotations, making it suitable for advanced PDF processing tasks.
Active Community: Apache PDFBox has an active community and frequent updates, ensuring ongoing support and bug fixes.

Cons of Apache PDFBox

Steeper Learning Curve: Due to its rich feature set and complex API, Apache PDFBox might have a steeper learning curve compared to simpler libraries like Tika.
Lack of Format Support: While PDFBox primarily focuses on PDF manipulation, it has limited support for other file formats, which can be a drawback for projects requiring multi-format parsing.

Conclusion

Both Apache Tika and Apache PDFBox offer powerful capabilities for parsing PDF files in Java, but they have different strengths and trade-offs. Apache Tika provides a unified interface for parsing various file formats, including PDF, with excellent metadata extraction capabilities. On the other hand, Apache PDFBox is a dedicated PDF library with advanced PDF manipulation features but a narrower focus.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.