How to Automatically Split PDFs by HAWB Number Using Python — A Complete Step-by-Step Guide
🧭 Introduction
In logistics, shipping, and freight forwarding, large PDF documents often contain multiple House Air Waybill (HAWB) sets — each consisting of two or more pages. Manually splitting these into individual files can be tedious and error-prone.
This guide will show you how to automate PDF splitting by HAWB Number using a simple Python script. You’ll learn how to extract text, detect HAWB numbers like HAWB Number: 9221038752, and save each set as a separate, correctly named PDF — all in seconds.
By the end of this tutorial, you’ll have a fully automated workflow for handling multi-page logistics documents efficiently.
🚀 Why Automate PDF Splitting by HAWB?
If you work in air cargo, freight forwarding, or documentation processing, you’ve probably faced this issue:
- A single PDF contains multiple shipment sets.
- Each set is two pages — invoice + airwaybill.
- Every set has a unique HAWB number.
- Manually splitting and renaming 20 or more sets is slow and repetitive.
With Python, you can split, detect, and rename all sets in one go — accurately and automatically.
🧰 Tools and Requirements
To complete this tutorial, you’ll need:
| Tool | Description |
|---|---|
| 🐍 Python 3.x | Programming language used for automation |
| 📄 PyPDF2 | Python library for reading and writing PDF files |
| 🧮 Regular Expressions (re) | Used to find HAWB numbers in text |
| 💻 A Searchable PDF | PDF with selectable text (not scanned) |
⚙️ Step-by-Step Guide
Step 1: Install Python and PyPDF2
If you don’t already have Python, download it from python.org/downloads and install it. Then open your terminal or Command Prompt and install the required library:
pip install PyPDF2
Step 2: Prepare Your PDF File
Place your PDF in a known folder. Example path:
C:\Users\Admin\Desktop\Data Entry\Trick\Temp work\New folder\pdf\20pagespdf.pdf
Each HAWB set contains 2 pages, and each set includes a line such as:
HAWB Number: 9221038752
Step 3: Create the Python Script
Open Notepad, VS Code, or any code editor and paste this script. Save it as split_hawb_sets.py:
import os
import re
from PyPDF2 import PdfReader, PdfWriter
# --- CONFIG ---
input_pdf = r"C:\Users\Admin\Desktop\Data Entry\Trick\Temp work\New folder\pdf\20pagespdf.pdf"
pages_per_set = 2 # each HAWB set = 2 pages
output_folder = os.path.join(os.path.dirname(input_pdf), "output_sets")
# --- Create output folder ---
os.makedirs(output_folder, exist_ok=True)
# --- Read the PDF ---
reader = PdfReader(input_pdf)
total_pages = len(reader.pages)
print(f"Total pages: {total_pages}")
# --- Process each 2-page set ---
for i in range(0, total_pages, pages_per_set):
writer = PdfWriter()
set_pages = reader.pages[i:i + pages_per_set]
text = ""
# Extract text from both pages
for page in set_pages:
text += page.extract_text() or ""
# Find HAWB number using regex
match = re.search(r"HAWB\s*Number:\s*(\d+)", text)
if match:
hawb_number = match.group(1)
filename = f"HAWB_{hawb_number}.pdf"
else:
set_num = (i // pages_per_set) + 1
filename = f"Set_{set_num:02d}.pdf"
# Write the set
for page in set_pages:
writer.add_page(page)
output_path = os.path.join(output_folder, filename)
with open(output_path, "wb") as f:
writer.write(f)
print(f"✅ Saved: {output_path}")
print("\nAll sets split successfully!")
print(f"Output folder: {output_folder}")
Step 4: Run the Script
Open your Command Prompt and run:
python "C:\Users\Admin\Desktop\Data Entry\Trick\Temp work\New folder\pdf\split_hawb_sets.py"
When the script finishes, you’ll find a new folder named:
output_sets
containing your separated and renamed files, such as:
HAWB_9221038752.pdfHAWB_9221038753.pdfHAWB_9_221038754.pdf- ...
🧩 How It Works — Step-by-Step Explanation
- Read and Count PDF Pages: The script uses
PdfReaderto open your PDF and determine how many pages it contains. - Divide into 2-Page Sets: Using a simple loop, it processes every two pages as one shipment set.
- Extract Text: It extracts text from those two pages using
extract_text(). - Find HAWB Number: The line
HAWB Number: 9221038752is detected using a regular expression (re.search()), which looks for a number pattern following the phrase “HAWB Number”. - Rename and Save: Each set is then saved as a new PDF file named after the detected HAWB number. If no number is found, it assigns a generic name like
Set_01.pdf.
🎯 Benefits of This Automation
- ✅ Saves Time: Split and rename 20 or 200 pages in seconds.
- ✅ Error-Free: No more manual typing or page selection mistakes.
- ✅ Consistent Naming: All PDFs are named in a clean, uniform format.
- ✅ Reusable: You can modify it for other patterns (like Invoice No, Shipment ID, etc.).
🧠 Pro Tips
- If your PDF pages per set vary, you can change
pages_per_set = 2to another number.- If you want a CSV report of extracted HAWB numbers and filenames, you can easily add one line using Python’s
csvmodule.- Works best on text-based PDFs (not scanned images). For scanned PDFs, add OCR using
pytesseract.
🏁 Conclusion
Automating document splitting can save hours of manual work in logistics and data processing. Using a simple Python script and the PyPDF2 library, you can extract, split, and rename multi-page PDFs by HAWB Number in one automated workflow.
Whether you handle 10 or 1,000 sets, this method ensures speed, accuracy, and professional output every time.
Keywords: HAWB PDF Split Python, Split PDF by HAWB Number, Python PDF automation, Air Waybill PDF Processing, PyPDF2 tutorial, Logistics document automation, Extract text from PDF Python, Rename PDFs automatically Python
