Solarian Programmer

My programming ramblings

Batch convert images to PDF with Python by using Pillow or img2pdf

Posted on June 12, 2019 by Paul

In this article I will show you how to batch convert a folder with images to a PDF file. This is a problem that I encountered recently when I had to process a bunch of scanned images and save the processed files as a single PDF.

While there are many libraries that could be used for the above, in this post I will consider only two libraries Pillow, formerly named PIL and img2pdf.

If you need to install the two libraries, you could use pip, e.g.:

1 python -m pip install Pillow
2 python -m pip install img2pdf

Let’s start simple, say that you have a folder with some images that are named for example 0.png, 1.png, … Every file name has an optional prefix, an order number and a suffix that corresponds to the image type. The prefix is optional and could include the path to the folder if you want to keep your Python program outside of the image folder.

1 from PIL import Image
2 
3 prefix = ""
4 min_range = 0
5 max_range = 3
6 suffix = ".png"
7 out_fname = "out.pdf"
8 
9 # ...

As mentioned earlier, we’ll start with a simple code, make it work, and abstract it later to some utility functions.

Next, we’ll need to loop over the folder content:

1 # ...
2 
3 for i in range(min_range, max_range + 1):
4     fname = prefix + str(i) + suffix
5     print(fname)
6     # Load the current image and store it
7 
8 # Save the output
9 # ...

Let’s add the code that loads the image. For simplicity, in this post I won’t do any image processing, I will leave the image unchanged, but you can add your own image processing code once the image is loaded.

Observation, the Pillow library can’t save to PDF RGBA images. If your input images have four channels, you’ll need to strip the last one before saving to PDF.

 1 # ...
 2 
 3 images = []
 4 for i in range(min_range, max_range + 1):
 5     fname = prefix + str(i) + suffix
 6     print(fname)
 7     # Load the current image and store it
 8     im = Image.open(fname)
 9     # (Optional) Process the image if necessary ...
10     # Pillow can't save RGBA images to pdf,
11     # make sure the image is RGB
12     if im.mode == "RGBA":
13         im = im.convert("RGB")
14     # Add the image to the images list
15     images.append(im)
16 
17 # Save the output
18 # ...

Please note that after the optional processing step we’ve stored the images in the images list. Next, let’s save this list of images to PDF:

1 # ...
2 
3 images = []
4 for i in range(min_range, max_range + 1):
5     # ...
6 
7 # Convert the images list to pdf
8 images[0].save(out_fname, save_all = True, quality=100, append_images = images[1:])

If you run the above code, it should create a PDF image with four pages, assuming that you have four images in the current folder.

We can use the argparse module from the Python standard library to let the user pass parameters to the program. This will abstract a bit the hard coded values from my initial example:

 1 import argparse
 2 from PIL import Image
 3 
 4 if __name__ == "__main__":
 5     # Let the user pass parameters to the code, all parameters are optional have some default values
 6     parser = argparse.ArgumentParser()
 7     parser.add_argument("-min_range", type=int, default=0, help="Min range of input images")
 8     parser.add_argument("-max_range", type=int, default=0, help="Max range of input images")
 9     parser.add_argument("-prefix", default="", help="Image name prefix")
10     parser.add_argument("-suffix", default=".png", help="Image termination, e.g. .png or .jpg")
11     parser.add_argument("-output", default="out.pdf", help="Output file name")
12     args = parser.parse_args()
13 
14     min_range = args.min_range
15     max_range = args.max_range
16     prefix = args.prefix
17     suffix = args.suffix
18     out_fname = args.output
19 
20     # Make sure the output file ends with *.pdf*
21     if not (out_fname.endswith(".pdf") or out_fname.endswith(".PDF")):
22         out_fname += ".pdf"
23 
24     # ...

Let’s also move the code that loops over the image folder in a separate function:

 1 import argparse
 2 from PIL import Image
 3 
 4 def process_images(min_range, max_range, prefix, suffix, out_fname):
 5     images = []
 6     for i in range(min_range, max_range + 1):
 7         fname = prefix + str(i) + suffix
 8         # Load and process the image
 9         im = Image.open(fname)
10         # Pillow can't save RGBA images to pdf,
11         # make sure the image is RGB
12         if im.mode == "RGBA":
13             im = im.convert("RGB")
14         # Add the (optionally) processed image to the images list
15         images.append(im)
16 
17     # Convert the images list to pdf
18     images[0].save(out_fname, save_all = True, quality=100, append_images = images[1:])
19 
20 if __name__ == "__main__":
21     # Let the user pass parameters to the code, all parameters are optional have some default values
22     # ...
23 
24     # Make sure the output file ends with *.pdf*
25     # ...
26 
27     process_images(min_range, max_range, prefix, suffix, out_fname)

Here is an example of running the above code:

1 python convertor_1.py -max_range=3 -output=mydoc.pdf

If you have the input images in a folder named my_images also stored in the current directly, this is how you change the above command:

1 python convertor_1.py -max_range=3 -prefix="my_images/" -output=mydoc.pdf

If you prefer to use a dedicated library for PDF output, like img2pdf and you don’t need to do any processing on the original images we can modify the above code to not use Pillow:

 1 import img2pdf
 2 import argparse
 3 
 4 def process_images(min_range, max_range, prefix, suffix, out_file):
 5     images = []
 6     for i in range(min_range, max_range + 1):
 7         fname = prefix + str(i) + suffix
 8         images.append(fname)
 9     out_file.write(img2pdf.convert(images))
10 
11 if __name__ == "__main__":
12     # Let the user pass parameters to the code, all parameters are optional have some default values
13     # ...
14 
15     # Make sure the output file ends with *.pdf*
16     # ...
17 
18     with open(out_fname, "wb") as out_file:
19         process_images(min_range, max_range, prefix, suffix, out_file)

Please note that in this case we open the output file in main and pass a file handle to the process_image function. Another notable difference is that we don’t store the images in a list, but rather the image names, which makes the code a bit faster than the Pillow version. The code is faster and uses less memory, but it is less flexible in the sense that it doesn’t let you do any processing on the input images.

Also, img2pdf doesn’t work with input images that have a transparency channel. So I’ve changed the default suffix parameter from the example to .jpg for this case.

You can find the complete source code on the GitHub repository for this article.

If you want to learn more about Python, I recommend reading Python Crash Course by Eric Matthes, the book is intended for beginners:


Show Comments