Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Last update: Oct 28, 2021

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Gizmodo has been working through the Facebook Papers and releasing the docs that they process and review.

As luck would have it, I had some ugly but functional code lying around that would do a first pass on OCR on these docs. That code is in the pdf_to_image.py script. I'd welcome improvement to the code, especially in image cleanup prior to OCR (lines 92-97, approx). I experimented with cleaning up the image via PIL and cv2, but the results were less accurate, almost certainly due to my lack of familiarity with either of these approaches.

These Facebook Papers are especially challenging from an OCR perspective because many of them are pictures taken of a screen, so the base image quality isn't especially good. Because of this, not every document can be processed cleanly, and the documents that do get processed have some cruft in them.

With that said, the text pulled from these files simplifies the process of parsing through a large amount of data for keywords.

Other (Better) Options

This OCR should be seen as a first step. Text files are generally a decent starting point because they allow for a wide range of follow on analysis.

And, other/better options exist. For a comprehensive, contained analysis, these other options will almost certainly be a better choice.

Want to help?

If you want to collaborate on this project, let me know!

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Other (Better) Options

Want to help?

Owner

Bill Fitzgerald

Sign Language Recognition service utilizing a deep learning model with Long Short-Term Memory to perform sign language recognition.

Implementation of EAST scene text detector in Keras

Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

TextBoxes re-implement using tensorflow

A Vietnamese personal card OCR website built with Django.

make a better chinese character recognition OCR than tesseract

Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation, CVPR 2020 (Oral)

TensorFlow Implementation of FOTS, Fast Oriented Text Spotting with a Unified Network.

Extract tables from scanned image PDFs using Optical Character Recognition.

MXNet OCR implementation. Including text recognition and detection.

【Auto】原神⭐钓鱼辅助工具 | 自动收竿、校准游标 | ✨您只需要抛出鱼竿，我们会帮你完成一切✨

Use Convolutional Recurrent Neural Network to recognize the Handwritten line text image without pre segmentation into words or characters. Use CTC loss Function to train.

This is a c++ project deploying a deep scene text reading pipeline with tensorflow. It reads text from natural scene images. It uses frozen tensorflow graphs. The detector detect scene text locations. The recognizer reads word from each detected bounding box.

Um simples projeto para fazer o reconhecimento do captcha usado pelo jogo bombcrypto

An OCR evaluation tool

Python library to extract tabular data from images and scanned PDFs

Pre-Recognize Library - library with algorithms for improving OCR quality.

An easy to use an (hopefully useful) captcha solution for pyTelegramBotAPI

An unofficial implementation of the paper "AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss".

Some Boring Research About Products Recognition 、Duplicate Img Detection、Img Stitch、OCR