Easily Convert PDF Files to Word Documents in R

posted Originally published at blog.devgenius.io 3 min read

Convert to Docx R package

In this post, I will show you how you can very easily convert your PDF files into Word documents in the R programming language using the Convert2Docx R package.

Convert2Docx is a lightweight R package that R developers can readily use in their workflow to convert PDF files to Word documents easily.

I wrote the package as an R wrapper for the pdf2docx Python module, which converts PDF files to Word documents in Python.

The package contains just three functions for converting your PDF file to Word or DOCX.

Now, let us see this package in action.

Install Package

At the moment, the Convert2Docx R package is only available on GitHub. Therefore, to install it, you will need to first of all install devtools. Therefore, in your R console, run the code below to install devtools:

# install devtools 
install. Packages("devtools")

Having installed devtools, we can now install Convert2Docx package from GitHub. Run the code below to install it:

# install Convert2Docx from Github
devtools::install_github("Ifeanyi55/Convert2Docx")

Awesome! We are now ready to do some conversion. The next thing to do now is to install the conversion engine. Run the below code to install it:

# install engine
install_engine()

Please note that you only need to run this code once, and should you encounter any problem installing the engine, go to your terminal and run:

pip install pdf2docx

This should take care of any dependency issues. Make sure you have the latest version of Python installed on your machine before running the code, though.

After doing that, try installing the conversion engine again, and if all goes well, you should be able to access the full functionality of the package.

Let us now start converting files!

Convert Entire PDF File

It is good to mention at this point that you do not have to read the PDF file into your R environment before you can convert it. All you need to do is just specify the relative path to the file from your current working directory like so and run the converter:

# convert entire pdf file
pdf_file <- "myFile.pdf"

Converter(pdf_file = pdf_file,
          docx_filename = "myFile.docx")

Now, if you check your current working directory, you should see the converted file there. Let us explore this package further by converting from one page to another.

Convert From One Page to the Other

Here, we will use another function in the package to convert pages 3 to 5 of the PDF file to Word

# convert from one page to the other
pdf_file <- "myFile.pdf"

startANDend(pdf_file = pdf_file,
            docx_filename = "threetofive.docx",
            start = 2,
            end = 5)

It is good to mention here that the pages of some PDF files might not be correctly numbered.

Therefore, when the conversion to Word is done, especially when converting from one page to another, you could find that the page numbering is slightly different from what you were expecting.

In the example above, page 3 of the PDF file starts at page 2 by page count, which is why you see start = 2 in the code instead of 3.

However, this “problem” does not occur when you convert the entire document to Word as demonstrated earlier.

Now, let us convert selected pages in the PDF file.

Convert Selected Pages

To do this, you will need to parse a numeric vector representing the pages from the PDF file you want to convert

# convert selected pages
pdf_file <- "myFile.pdf"

selectPages(pdf_file = pdf_file,
            docx_filename = "mySelectedPages.docx",
            pages = c(2,4,6))

In the above code, I selected pages 2, 4, and 6 from the PDF file to be converted to Word, and it did a good job.

As you convert, don’t forget to check your current working directory for the converted files.

So, there you have it! Now you can easily convert your PDF files to Word documents in the R programming language thanks to the Convert2Docx R package.

Please, do not forget to give the Convert2Docx R package’s GitHub repository a ⭐️ if you find it helpful.

I hope you enjoyed reading this post.

You can follow me on GitHub: Ifeanyi55 and on X: @Ifeanyidiaye.

If you read this far, tweet to the author to show them you care. Tweet a Thanks

Awesome guide! Didn’t know PDF to Word in R was this easy. Quick question—does Convert2Docx work with scanned PDFs too, or just text-based ones? Appreciate the effort!

I appreciate the kind words. For now, the package does not support scanned PDFs, sorry!

More Posts

Building Automated Data Reports from Supabase with GitHub Actions and R Markdown

AMAH Daniel - Apr 10

Handling Looping Errors in a Caching Matrix in R

AYANFE - Feb 20

How to read a file and search specific word locations in Python

Brando - Nov 8, 2023

Writing to Files In Python

Abdul Daim - Apr 12, 2024

Convert Tkinter Python App to .Exe File [pyinstaller] Step by Step Guide

Tejas Vaij - Apr 1, 2024
chevron_left