This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

Overview

pdf-scraper-with-ocr

With this tool I am aiming to facilitate the work of those who need to scrape PDFs either by hand or using tools that doesn't implement any kind of character recognition.

How it works

When you run the program a GUI will open with four buttons. Only two of them are available for use at the begining: "Choose a PDF" and "Extract Information". We will start choosing our PDF. When the button is clicked a new window will open where we can navigate through our folders and select the PDF we want.

Once we have selected the PDF the button "Delete Pages" will activate. Here we will be able to select which pages we want to delete from our PDF because they do not contain information we want to scrape. Do not worry, the program will create a copy of your PDF and modify the copy, it will not touch the original except to create the copy. In case you do not want to delete any pages just leave the field in blank, however, if our PDF contains a cover, index or other kind of one time only pages you can delete them by indicating each page separated by a semicolon, see: 1;2;10; this will delete pages 1, 2 and 10. If you want to delete a range of pages you can indicate the first and last page separated by a hyphen: 5-10 will delete pages 5, 6, 7, 8, 9 and 10. See below for other commands.

Now that we have deleted the pages we did not need the button "PDF to images" will activate, pressing it will open a window where we will be asked to select the folder where the pages of the PDF will be saved as images. If the PDF has over 100 pages this might take a while (around 25 minutes for 456 pages in my case). It might look like the window freezes but do not worry, the program is still running.

Finally, once all the pages have been converted to images we can start scraping the PDF. By clicking on "Extract Information" the window will change and present four new buttons: "Load images", "Undo", "Show image" and "Extract text". Clicking on "Load images" will open a window where we can select the folder where our images where saved. Once we have selected the folder we will be asked if our PDF follows any pattern. A pattern is used whenever the information we want to obtain is divided in different pages. Maybe the phone number of a client is in one page and the email in the next one, however we must be sure that every client will follow this pattern and have the phone number and email in the same place. In case our information is not split across diferent pages we can write 1, as the pattern will repeat every page. We will also need to choose if we want to see random images or not. We will select not randomized by now, see below for information.

Whenever we click on "ok" the program will load a series of preview images where we can select by clicking and draggin the information we want to keep. Every time we start clicking a red rectangle will follow the mouse until the click is released. After releasing the mouse we will be asked what is the name of the field we just selected. This name will be the name of the column where this is information is stored. After creating as many selections as we want we can click on "Extract text". Go grab a coffe, this might take a long time but after finishing a new file will appear in the folder where you are running this script. An Excel file with all the information you wanted.

Here you have a demo of the process of selecting the area with a project that has a pattern of 2: https://i.imgur.com/Pt9unky.mp4

Deleting pages

Every PDF is different from others. They can be organized in a lot of different ways, making the automation of the pages to delete kind of a pain. Currently this are the commands supported for deleting pages:

Single page deletion

This will delete the pages that to correspond to the written indexes: 1;2;10; will delete pages 1, 2 and 10.

Delete page in range

This will delete the pages between the first and last index seperated by a hyphen: 5-10 will delete pages 5, 6, 7, 8, 9 and 10.

Delete every Nx pages:

If every three files in our PDF we have a file that does not have any interesting information by using. Nx we will delete every index multiple of N. 3x will delete pages 3, 6, 9, 12, 15...

Delete every Nx + C pages:

Maybe the pattern our PDF follows goes like this: page 1 (useful), page 2 (useless), page 3 (useful),(the pattern begins again here) page 4 (useful)... We will need to delete pages 2, 5, 8, 11... Then using 3x+1 will delete every three pages the next page.

Delete everything after or before N:

In case we want to delete all pages after page N using: N- will delete every page after page N. In the same way, using: -N will delete all pages before N.

Combinations

You can combine different methods to delete pages separating them by a semicolon: 4x; 100-; 45; this will delete every fourth page, all pages after index 100 and the page 45.

The Show image button

It is important that you make sure all your selections grab all the information in all pages. To help you create better selections you can click on the "Show image" button to navigate across different pages. If you have a pattern of 1 you will see that every time you click on the button your image change but the rectangles stay in place. In case you want to delete any of them you can use the "Undo" button (explanation below). If you have a pattern greater than 1 when clicking on "Show image" you will see how your selections disappear. This is because the program keeps track of what selections you have made in which page of the pattern. You can also create selections here that will be analyzed next to the ones in the previous page.

Randomized preview

Selecting to randomize the preview images can be quite helpful. Many times every section in a PDF seems to follow the same pattern and fill the same space but every now and them some fields might not be were they should or some piece of text might be bigger than rectangle you created before. This is were the randomized preview can save your output file. Keep in mind that the random preview will keep showing images in order according to the pattern you selected, you will just see different patterns instead of the three first ones that the not randomized option offers.

The Undo button

In case you clicked something by mistake, did not write correctly the name you wanted for a field or created a rectangle that later you discovered will not capture all the info you wanted there is an undo button. The Undo button will eliminate the last rectangle created. In case your PDF follows a pattern greater than 1 the undo button will delete the last rectangle created in the page you are. For example, if your PDF has a pattern of 3 and you have created two rectangles on page 1, then click on "Show image" to see the next image in your pattern (page 2) and create a rectangle there and go back to page 1 (by clicking twice on "Show image"), clicking the undo button will not delete the selection from page 2, it will delete the last created selection in the page you are at the moment of clicking.

Owner
Jacobo José Guijarro Villalba
Jacobo José Guijarro Villalba
Rubik's Cube in pygame with OpenGL

Rubik Rubik's Cube in pygame with OpenGL The script show on the screen a Rubik Cube buit with OpenGL. Then I have also implemented all the possible mo

Gabro 2 Apr 15, 2022
Document manipulation detection with python

image manipulation detection task: -- tianchi function image segmentation salie

JiaKui Hu 3 Aug 22, 2022
PyQT5 app that colorize black & white pictures using CNN(use pre-trained model which was made with OpenCV)

About PyQT5 app that colorize black & white pictures using CNN(use pre-trained model which was made with OpenCV) Colorizor Приложение для проекта Yand

1 Apr 04, 2022
kaldi-asr/kaldi is the official location of the Kaldi project.

Kaldi Speech Recognition Toolkit To build the toolkit: see ./INSTALL. These instructions are valid for UNIX systems including various flavors of Linux

Kaldi 12.3k Jan 05, 2023
OpenMMLab Text Detection, Recognition and Understanding Toolbox

Introduction English | 简体中文 MMOCR is an open-source toolbox based on PyTorch and mmdetection for text detection, text recognition, and the correspondi

OpenMMLab 3k Jan 07, 2023
Programa que viabiliza a OCR (Optical Character Reading - leitura óptica de caracteres) de um PDF.

Este programa tem o intuito de ser um modificador de arquivos PDF. Os arquivos PDFs podem ser 3: PDFs verdadeiros - em que podem ser selecionados o ti

Daniel Soares Saldanha 2 Oct 11, 2021
Corner-based Region Proposal Network

Corner-based Region Proposal Network CRPN is a two-stage detection framework for multi-oriented scene text. It employs corners to estimate the possibl

xhzdeng 140 Nov 04, 2022
PyTorch Re-Implementation of EAST: An Efficient and Accurate Scene Text Detector

Description This is a PyTorch Re-Implementation of EAST: An Efficient and Accurate Scene Text Detector. Only RBOX part is implemented. Using dice loss

365 Dec 20, 2022
基于Paddle框架的PSENet复现

PSENet-Paddle 基于Paddle框架的PSENet复现 本项目基于paddlepaddle框架复现PSENet,并参加百度第三届论文复现赛,将在2021年5月15日比赛完后提供AIStudio链接~敬请期待 AIStudio链接 参考项目: whai362-PSENet 环境配置 本项目

QuanHao Guo 4 Apr 24, 2022
MXNet OCR implementation. Including text recognition and detection.

insightocr Text Recognition Accuracy on Chinese dataset by caffe-ocr Network LSTM 4x1 Pooling Gray Test Acc SimpleNet N Y Y 99.37% SE-ResNet34 N Y Y 9

Deep Insight 99 Nov 01, 2022
天池2021"全球人工智能技术创新大赛"【赛道一】:医学影像报告异常检测 - 第三名解决方案

天池2021"全球人工智能技术创新大赛"【赛道一】:医学影像报告异常检测 比赛链接 个人博客记录 目录结构 ├── final------------------------------------决赛方案PPT ├── preliminary_contest--------------------

19 Aug 17, 2022
CellProfiler is a open-source application for biological image analysis

CellProfiler is a free open-source software designed to enable biologists without training in computer vision or programming to quantitatively measure phenotypes from thousands of images automaticall

CellProfiler 732 Dec 23, 2022
Rest API Written In Python To Classify NSFW Images.

✨ NSFW Classifier API ✨ Rest API Written In Python To Classify NSFW Images. Fastest Solution If you don't want to selfhost it, there's already an inst

Akshay Rajput 23 Dec 30, 2022
TableBank: A Benchmark Dataset for Table Detection and Recognition

TableBank TableBank is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on th

844 Jan 04, 2023
OCR system for Arabic language that converts images of typed text to machine-encoded text.

Arabic OCR OCR system for Arabic language that converts images of typed text to machine-encoded text. The system currently supports only letters (29 l

Hussein Youssef 144 Jan 05, 2023
原神风花节自动弹琴辅助

GenshinAutoPlayBalladsofBreeze 原神风花节自动弹琴辅助(已适配1920*1080分辨率) 本程序基于opencv图像识别技术,不存在任何封号。 因为正确率取决于你的cpu性能,10900k都不一定全对。 由于图像识别存在误差,根本无法确定出错时间。更不用说被检测到了。

晓轩 20 Oct 27, 2022
A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.

LAREX LAREX is a semi-automatic open-source tool for layout analysis on early printed books. It uses a rule based connected components approach which

162 Jan 05, 2023
governance proposal to make fei redeemable for eth

Feil Proposal 🌲 Abstract Migrate all ETH from Fei protocol-controlled value into Yearn ETH Vault. Allow redemptions of outstanding FEI for yvETH. At

13 Mar 31, 2022
Augmenting Anchors by the Detector Itself

Augmenting Anchors by the Detector Itself Introduction It is difficult to determine the scale and aspect ratio of anchors for anchor-based object dete

4 Nov 06, 2022
Thresholding-and-masking-using-OpenCV - Image Thresholding is used for image segmentation

Image Thresholding is used for image segmentation. From a grayscale image, thresholding can be used to create binary images. In thresholding we pick a threshold T.

Grace Ugochi Nneji 3 Feb 15, 2022