Batch convert images to PDF with Python by using Pillow or img2pdf
Posted on June 12, 2019 by Paul
In this article I will show you how to batch convert a folder with images to a PDF file. This is a problem that I encountered recently when I had to process a bunch of scanned images and save the processed files as a single PDF.
While there are many libraries that could be used for the above, in this post I will consider only two libraries Pillow, formerly named PIL and img2pdf.
If you need to install the two libraries, you could use pip, e.g.:
Let’s start simple, say that you have a folder with some images that are named for example 0.png, 1.png, … Every file name has an optional prefix, an order number and a suffix that corresponds to the image type. The prefix is optional and could include the path to the folder if you want to keep your Python program outside of the image folder.
As mentioned earlier, we’ll start with a simple code, make it work, and abstract it later to some utility functions.
Next, we’ll need to loop over the folder content:
Let’s add the code that loads the image. For simplicity, in this post I won’t do any image processing, I will leave the image unchanged, but you can add your own image processing code once the image is loaded.
Observation, the Pillow library can’t save to PDFRGBA images. If your input images have four channels, you’ll need to strip the last one before saving to PDF.
Please note that after the optional processing step we’ve stored the images in the images list. Next, let’s save this list of images to PDF:
If you run the above code, it should create a PDF image with four pages, assuming that you have four images in the current folder.
We can use the argparse module from the Python standard library to let the user pass parameters to the program. This will abstract a bit the hard coded values from my initial example:
Let’s also move the code that loops over the image folder in a separate function:
Here is an example of running the above code:
If you have the input images in a folder named my_images also stored in the current directly, this is how you change the above command:
If you prefer to use a dedicated library for PDF output, like img2pdf and you don’t need to do any processing on the original images we can modify the above code to not use Pillow:
Please note that in this case we open the output file in main and pass a file handle to the process_image function. Another notable difference is that we don’t store the images in a list, but rather the image names, which makes the code a bit faster than the Pillow version. The code is faster and uses less memory, but it is less flexible in the sense that it doesn’t let you do any processing on the input images.
Also, img2pdf doesn’t work with input images that have a transparency channel. So I’ve changed the default suffix parameter from the example to .jpg for this case.
You can find the complete source code on the GitHub repository for this article.
If you want to learn more about Python, I recommend reading Python Crash Course by Eric Matthes, the book is intended for beginners: