Parse pdf data to convert to ubl
Parsing pdf files reliably can be a challenge, I tried two hosted solutions and wanted to share my findings.
One of my customers wants to automate the process of importing supplier invoices in their accounting software, because they now needed to print the pdf and then manually type in all the invoice data into the accounting software, which was a lot of work (+/- 20 invoices per week with an average of 10 lines per invoice).
Their accounting software allows importing of ubl files directly into the system, if only everyone started sending ubl files alongside their human readable pdf invoices, this would make the world of bookkeeping a lot easier..
Unfortunately this isn’t the case yet,ubl has been around for about 10 years and it is taking a long time for businesses to change their workflows, most businesses only recently started sending pdf invoices..
go2ubl
First we tried go2ubl for about 6 months, which states they convert pdf invoices to ubl within a few hours, in practice we had to wait for around 24 hours for most of the invoices (and sometimes longer), maybe because my customer has a lot of foreign invoices they are having toruble parsing them, which led me to believe that most of my customer’s invoices where being converted to ubl manually..
With the manual converting we noticed some typos in supplier names addresses etc.. that’s not what we want, handling supplier invoices correctly is a critical part of a business.
Although we needed to be patient with retreiving the data, we would always receive a ubl file eventually, so if you don’t mind being patient, it might be a good fit.
The bigger downside for my customer is that invoice lines are all grouped into one line with the whole sum of all lines, so if an invoice had 4 lines with different amounts, (e.g. product A, B, C and shipping) the ubl would only have 1 line item with the total amount in it.
Since in the accountingsoftware different products need to be booked under different categories, this wasn’t an option, now they still needed to split the lines manually (to split and categorize transport costs, fees and different product types).
Pros
- works out of the box
Create an account, mail your pdf invoices and receive ubl files. - No need for programming skills
Cons
- Receiving the converted ubl files can take some time (sometimes a few days)
- UBL files contain only one invoiceline, which means there is still some manual work
- Only converts to ubl (hence the name go2ubl 🙂 ), no other data structures selectable
- received some files with typo’s
Docparser
After go2ubl we switched to Docparser which allows you to create parsers for converting pdf files to usable data structures.
For this to work you will need to create parser that will handle yourrecurring pdf files, which at first seems a bit challenging, but after a few tries you’ll get the hang of it, and create one within 30 minutes.
Just upload a few test pdfs from your supplier, and start creating parsing rules.
You can use from many different parsing rule presets, these are the basic ones (there are a lot more) :
In the lineitems preset, you will split the columns in your pdf, so that docparser will know where to split the lineitem data (in this file you see quantity, price, total):
Next step is to define rules to group the data into usable content:
To get the best results it’s advisable to create 1 parser per supplier, so 1 parser is tailored to that specific suppliers pdf structure).
I have set the different parsers up in a way that I will always receive the same data structure posted to our webhook. I receive the data in json (other data structures are selectable). And have a script convert them to uml xml files, which are automatically imported into the accounting software.
Pros
- Different data structures selectable (xml, json, form data)
- Lot’s of intergrations (zapier, salesforce, google spreadsheets etc..)
- Flexable and customizable
You can change parser outputs yourself, giving you great flexability in what data to parse
cons
- Developer knowledge required
I think it’s easier to setup parsers if you have some form of development / computer skills (as a developer or system administrator), you will probably be able to create a parser without it, but more complex pdf files need some work to setup correctly. - Requires a few days to setup
If you need to create parsers for 10+ suppliers it will take a few days to set everything up and test, but after this you will save a lot of time.
Fascinating. Have you evaluated any text parsers that could process every email that enters a specified mailbox?