Scenario: You are given PDF with all the data you need… except PDFs are not good usable formats for collating data. Ideally it should have been a spreadsheet, a text list or something similar. Worse still… the PDF is mostly images with the data you need contained within the image. You could type out all the data into a spreadsheet which could take ages… or you could get clever and extract the data inside 10 minutes.
There is software on the market which will do everything you need in a single step. I haven’t bothered testing any such software because there are excellent free online tools which do the same job. Therefore, the method I explain here is a completely no cost method of data extraction. There are a few extra steps involved but the extra time it takes is near negligible.
I live (by choice) in a regional country town. The local newspaper is an A3 print, folded with two staples in the middle. When its delivered its A4 size. On the newspaper website, the whole newspaper is uploaded as an PDF download. (I did mention its a small town??) At the rear of the newspaper is a local business directory with small business card size ads. Most of them contain an email address, website and phone number. As a marketing guy… I want to contact each of these businesses and I need the data as text to feed into an emailing program.
So here’s what I do.
First you need to scan the PDF document for all text. Once upon a time one would print the document and use the OCR (optical Character Recognition) software that came with the scanner. But few people own scanner these days. Instead we use online OCR scanners to scan the digital copy rather than the hard paper copy.
First we need an online PDF scanner to scan the PDF document and lift and text. The link for this is provided below. I don’t know how they do this… but it has amazing accuracy. There may be other online OCR scanners … but I don’t know any who have this level of accuracy.
All we need to do is upload the PDF document and then its sent back as plain text
After the document has been scanned, it will be saved for you to download.
The text file will contain the text you need… but it will be unusable. Everything the OCR didn’t recognize will be replaced with a space character, tab or something similar. Therefore, it will be just as useless as the PDF… except this time we have some text for software to recognize the ‘@’ sign in email addresses. Yay.
Now we need a tool to scan the text and lift all the email addresses that contain “@’.
The link for a free tool to lift email addresses in provided below. Again they have an amazing level of accuracy.
Copy all the text and paste it into the text box as shown.
Press the extract email address button and the email addresses will be lifted and collated into a neat list. Copy all these addresses into a spreadsheet for safe keeping. How good is that!
Some of the ads in the PDF document didn’t contain an email address …but they contained a website address. Often websites will provide the email address on the contacts page. So… lets lift the URLs as well.
A free online tool to lift URLs from text is provided below. In the same way, we paste in the text lifted from the PDF document to scan for URL’s
Click the “Parse” button and all URLs will be lifted from the text.
Save to a safe place. Later I will demonstrate how to automatically open each of these websites and check for email addresses.
Some of the ads didn’t contain an email address or a website address… but the ads supplied a phone number.
So lets grab the phone numbers too and I’ll show you a way to SMS these people or make automated phone calls to contact them in a non intrusive way.
An excellent online tool to extract phone numbers from text is provided in the links below.
Now… the be honest … the website looks like it was designed by a 12 year old in the 1990’s. But what it lacks in design is made up for by accuracy in extracting phone numbers from text. It really is superb.
In the same way , we paste in the text and phone numbers are extracted as shown below.
Data mining PDF document doesn’t need to be difficult. All it needs is the right tools and a few smarts and the whole process can be accomplished inside a few minutes. In the example above, I managed to scan 3 PDF documents and extract everything I needed. Inside 10 minutes, I lifted around a 500 contact details. In some circles this might not be a lot… but converting 500 contacts to a sale is well worth the 10 minutes I spent extracting the data.
There are some things that need further explaining such as contacting people via SMS, automated phone calling that isn’t intrusive and automatically lifting email addresses from websites. I’ll cover this in a future post. Make sure you subscribe to the blog so that you grab my methods when I have the posts ready.
Don’t forget to grab your links. If you got something out of this post, please leave your thoughts below.
Online Scanner PDF to TXT extractor http://www.onlineocr.net
Online TXT to Email extractor: http://www.procato.com/mailextract/
Online TXT to URL extractor: http://www.noteparse.com/archive
Online TXT to Phone Number extractor: http://phonenumberextractor.com/