How to Automatically Split PDFs by HAWB Number Using Python -- A Complete Step-by-Step Guide

Mohammed Raza
0

How to Automatically Split PDFs by HAWB Number Using Python — A Complete Step-by-Step Guide

Poster image for splitting PDFs by HAWB number using Python

🧭 Introduction

In logistics, shipping, and freight forwarding, large PDF documents often contain multiple House Air Waybill (HAWB) sets — each consisting of two or more pages. Manually splitting these into individual files can be tedious and error-prone.

This guide will show you how to automate PDF splitting by HAWB Number using a simple Python script. You’ll learn how to extract text, detect HAWB numbers like HAWB Number: 9221038752, and save each set as a separate, correctly named PDF — all in seconds.

By the end of this tutorial, you’ll have a fully automated workflow for handling multi-page logistics documents efficiently.

🚀 Why Automate PDF Splitting by HAWB?

If you work in air cargo, freight forwarding, or documentation processing, you’ve probably faced this issue:

  • A single PDF contains multiple shipment sets.
  • Each set is two pages — invoice + airwaybill.
  • Every set has a unique HAWB number.
  • Manually splitting and renaming 20 or more sets is slow and repetitive.

With Python, you can split, detect, and rename all sets in one go — accurately and automatically.

🧰 Tools and Requirements

To complete this tutorial, you’ll need:

Tool Description
🐍 Python 3.x Programming language used for automation
📄 PyPDF2 Python library for reading and writing PDF files
🧮 Regular Expressions (re) Used to find HAWB numbers in text
💻 A Searchable PDF PDF with selectable text (not scanned)

⚙️ Step-by-Step Guide

Step 1: Install Python and PyPDF2

If you don’t already have Python, download it from python.org/downloads and install it. Then open your terminal or Command Prompt and install the required library:

pip install PyPDF2

Step 2: Prepare Your PDF File

Place your PDF in a known folder. Example path:

C:\Users\Admin\Desktop\Data Entry\Trick\Temp work\New folder\pdf\20pagespdf.pdf

Each HAWB set contains 2 pages, and each set includes a line such as:

HAWB Number: 9221038752

Step 3: Create the Python Script

Open Notepad, VS Code, or any code editor and paste this script. Save it as split_hawb_sets.py:

import os
import re
from PyPDF2 import PdfReader, PdfWriter

# --- CONFIG ---
input_pdf = r"C:\Users\Admin\Desktop\Data Entry\Trick\Temp work\New folder\pdf\20pagespdf.pdf"
pages_per_set = 2  # each HAWB set = 2 pages
output_folder = os.path.join(os.path.dirname(input_pdf), "output_sets")

# --- Create output folder ---
os.makedirs(output_folder, exist_ok=True)

# --- Read the PDF ---
reader = PdfReader(input_pdf)
total_pages = len(reader.pages)
print(f"Total pages: {total_pages}")

# --- Process each 2-page set ---
for i in range(0, total_pages, pages_per_set):
    writer = PdfWriter()
    set_pages = reader.pages[i:i + pages_per_set]
    text = ""

    # Extract text from both pages
    for page in set_pages:
        text += page.extract_text() or ""

    # Find HAWB number using regex
    match = re.search(r"HAWB\s*Number:\s*(\d+)", text)
    if match:
        hawb_number = match.group(1)
        filename = f"HAWB_{hawb_number}.pdf"
    else:
        set_num = (i // pages_per_set) + 1
        filename = f"Set_{set_num:02d}.pdf"

    # Write the set
    for page in set_pages:
        writer.add_page(page)

    output_path = os.path.join(output_folder, filename)
    with open(output_path, "wb") as f:
        writer.write(f)

    print(f"✅ Saved: {output_path}")

print("\nAll sets split successfully!")
print(f"Output folder: {output_folder}")

Step 4: Run the Script

Open your Command Prompt and run:

python "C:\Users\Admin\Desktop\Data Entry\Trick\Temp work\New folder\pdf\split_hawb_sets.py"

When the script finishes, you’ll find a new folder named:

output_sets

containing your separated and renamed files, such as:

  • HAWB_9221038752.pdf
  • HAWB_9221038753.pdf
  • HAWB_9_221038754.pdf
  • ...

🧩 How It Works — Step-by-Step Explanation

  1. Read and Count PDF Pages: The script uses PdfReader to open your PDF and determine how many pages it contains.
  2. Divide into 2-Page Sets: Using a simple loop, it processes every two pages as one shipment set.
  3. Extract Text: It extracts text from those two pages using extract_text().
  4. Find HAWB Number: The line HAWB Number: 9221038752 is detected using a regular expression (re.search()), which looks for a number pattern following the phrase “HAWB Number”.
  5. Rename and Save: Each set is then saved as a new PDF file named after the detected HAWB number. If no number is found, it assigns a generic name like Set_01.pdf.

🎯 Benefits of This Automation

  • Saves Time: Split and rename 20 or 200 pages in seconds.
  • Error-Free: No more manual typing or page selection mistakes.
  • Consistent Naming: All PDFs are named in a clean, uniform format.
  • Reusable: You can modify it for other patterns (like Invoice No, Shipment ID, etc.).

🧠 Pro Tips

  • If your PDF pages per set vary, you can change pages_per_set = 2 to another number.
  • If you want a CSV report of extracted HAWB numbers and filenames, you can easily add one line using Python’s csv module.
  • Works best on text-based PDFs (not scanned images). For scanned PDFs, add OCR using pytesseract.

🏁 Conclusion

Automating document splitting can save hours of manual work in logistics and data processing. Using a simple Python script and the PyPDF2 library, you can extract, split, and rename multi-page PDFs by HAWB Number in one automated workflow.

Whether you handle 10 or 1,000 sets, this method ensures speed, accuracy, and professional output every time.

Keywords: HAWB PDF Split Python, Split PDF by HAWB Number, Python PDF automation, Air Waybill PDF Processing, PyPDF2 tutorial, Logistics document automation, Extract text from PDF Python, Rename PDFs automatically Python

Tags:

Post a Comment

0 Comments

Post a Comment (0)

#buttons=(Ok, Go it!) #days=(20)

Our website uses cookies to enhance your experience. Check Now
Ok, Go it!