Ground truth data for the Optical Character Recognition of Historical Classical Commentaries.

Overview

OCR Ground Truth for Historical Commentaries

DOI License: CC BY 4.0

The dataset OCR ground truth for historical commentaries (GT4HistComment) was created from the public domain subset of scholarly commentaries on Sophocles' Ajax. Its main goal is to enable the evaluation of the OCR quality on printed materials that contain a mix of Latin and polytonic Greek scripts. It consists of five 19C commentaries written in German, English, and Latin, for a total of 3,356 GT lines.

Data

GT4HistComment are contained in data/, where each sub-folder corresponds to a different publication (i.e. commentary). For each each commentary we provide the following data:

  • <commentary_id>/GT-pairs: pairs of image/text files for each GT line
  • <commentary_id>/imgs: original images on which the OCR was performed
  • <commentary_id>/<commentary_id>_olr.tsv: OLR annotations with image region coordinates and layout type ground truth label

The OCR output produced by the Kraken + Ciaconna pipeline was manually corrected by a pool of annotators using the Lace platform. In order to ensure the quality of the ground truth datasets, an additional verification of all transcriptions made in Lace was carried out by an annotator on line-by-line pairs of image and corresponding text.

Commentary overview

ID Commentator Year Languages Image source Line example
bsb10234118 Lobeck [1] 1835 Greek, Latin BSB
sophokle1v3soph Schneidewin [2] 1853 Greek, German Internet Archive
cu31924087948174 Campbell [3] 1881 Greek, English Internet Archive
sophoclesplaysa05campgoog Jebb [4] 1896 Greek, English Internet Archive
Wecklein1894 Wecklein [5] 1894 [5] Greek. German internal

Stats

Line, word and char counts for each commentary are indicated in the following table. Detailled counts for each region can be found here.

ID Commentator Type lines words all chars greek chars
bsb10234118 Lobeck training 574 2943 16081 5344
bsb10234118 Lobeck groundtruth 202 1491 7917 2786
sophokle1v3soph Schneidewin training 583 2970 16112 3269
sophokle1v3soph Schneidewin groundtruth 382 1599 8436 2191
cu31924087948174 Campbell groundtruth 464 2987 14291 3566
sophoclesplaysa05campgoog Jebb training 561 4102 19141 5314
sophoclesplaysa05campgoog Jebb groundtruth 324 2418 10986 2805
Wecklein1894 Wecklein groundtruth 211 1912 9556 3268

Commentary editions used:

  • [1] Lobeck, Christian August. 1835. Sophoclis Aiax. Leipzig: Weidmann.
  • [2] Sophokles. 1853. Sophokles Erklaert von F. W. Schneidewin. Erstes Baendchen: Aias. Philoktetes. Edited by Friedrich Wilhelm Schneidewin. Leipzig: Weidmann.
  • [3] Lewis Campbell. 1881. Sophocles. Oxford : Clarendon Press.
  • [4] Wecklein, Nikolaus. 1894. Sophokleus Aias. München: Lindauer.
  • [5] Jebb, Richard Claverhouse. 1896. Sophocles: The Plays and Fragments. London: Cambridge University Press.

Citation

If you use this dataset in your research, please cite the following publication:

@inproceedings{romanello_optical_2021,
  title = {Optical {{Character Recognition}} of 19th {{Century Classical Commentaries}}: The {{Current State}} of {{Affairs}}},
  booktitle = {The 6th {{International Workshop}} on {{Historical Document Imaging}} and {{Processing}} ({{HIP}} '21)},
  author = {Romanello, Matteo and Sven, Najem-Meyer and Robertson, Bruce},
  year = {2021},
  publisher = {{Association for Computing Machinery}},
  address = {{Lausanne}},
  doi = {10.1145/3476887.3476911}
}

Acknowledgements

Data in this repository were produced in the context of the Ajax Multi-Commentary project, funded by the Swiss National Science Foundation under an Ambizione grant PZ00P1_186033.

Contributors: Carla Amaya (UNIL), Sven Najem-Meyer (EPFL), Matteo Romanello (UNIL), Bruce Robertson (Mount Allison University).

You might also like...
Official Repo for Ground-aware Monocular 3D Object Detection for Autonomous Driving

Visual 3D Detection Package: This repo aims to provide flexible and reproducible visual 3D detection on KITTI dataset. We expect scripts starting from

[WACV 2020] Reducing Footskate in Human Motion Reconstruction with Ground Contact Constraints

Reducing Footskate in Human Motion Reconstruction with Ground Contact Constraints Official implementation for Reducing Footskate in Human Motion Recon

PointCloud Annotation Tools, support to label object bound box, ground, lane and kerb
PointCloud Annotation Tools, support to label object bound box, ground, lane and kerb

PointCloud Annotation Tools, support to label object bound box, ground, lane and kerb

GndNet: Fast ground plane estimation and point cloud segmentation for autonomous vehicles using deep neural networks.
GndNet: Fast ground plane estimation and point cloud segmentation for autonomous vehicles using deep neural networks.

GndNet: Fast Ground plane Estimation and Point Cloud Segmentation for Autonomous Vehicles. Authors: Anshul Paigwar, Ozgur Erkent, David Sierra Gonzale

Autonomous Ground Vehicle Navigation and Control Simulation Examples in Python

Autonomous Ground Vehicle Navigation and Control Simulation Examples in Python THIS PROJECT IS CURRENTLY A WORK IN PROGRESS AND THUS THIS REPOSITORY I

Using LSTM to detect spoofing attacks in an Air-Ground network
Using LSTM to detect spoofing attacks in an Air-Ground network

Using LSTM to detect spoofing attacks in an Air-Ground network Specifications IDE: Spider Packages: Tensorflow 2.1.0 Keras NumPy Scikit-learn Matplotl

ObjectDrawer-ToolBox: a graphical image annotation tool to generate ground plane masks for a 3D object reconstruction system
ObjectDrawer-ToolBox: a graphical image annotation tool to generate ground plane masks for a 3D object reconstruction system

ObjectDrawer-ToolBox is a graphical image annotation tool to generate ground plane masks for a 3D object reconstruction system, Object Drawer.

Implementation of
Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch

PyGAS: Auto-Scaling GNNs in PyG PyGAS is the practical realization of our G NN A uto S cale (GAS) framework, which scales arbitrary message-passing GN

A two-stage U-Net for high-fidelity denoising of historical recordings
A two-stage U-Net for high-fidelity denoising of historical recordings

A two-stage U-Net for high-fidelity denoising of historical recordings Official repository of the paper (not submitted yet): E. Moliner and V. Välimäk

Comments
  • adds line-, word- and char-counts to README.md

    adds line-, word- and char-counts to README.md

    Adds a table to README.md as suggested by reviewer 1. The table also link to a more complete table, itself a public version of spreadsheet OCR evaluation and stats!detailed_counts. Note that the publishable version is an external reference to our private version, meaning that actualising the latter will also update the former.

    opened by sven-nm 0
  • Pages à exclure - OCR

    Pages à exclure - OCR

    La page contient les schémas métriques des passages. De ce fait l'OCR ne les reconnaît pas, de plus la correction de l'OCR n'a pas été achevée.

    Voici les pages à exclure : sophoclesplaysa05campgoog_0072.png (Jebb, p. 72)

    opened by camaya28 0
Releases(v1.0)
Owner
Ajax Multi-Commentary
How does a classical hero die in the digital age? Using Sophocles’ Ajax to create a commentary on commentaries.
Ajax Multi-Commentary
Disentangled Cycle Consistency for Highly-realistic Virtual Try-On, CVPR 2021

Disentangled Cycle Consistency for Highly-realistic Virtual Try-On, CVPR 2021 [WIP] The code for CVPR 2021 paper 'Disentangled Cycle Consistency for H

ChongjianGE 94 Dec 11, 2022
Explore extreme compression for pre-trained language models

Code for paper "Exploring extreme parameter compression for pre-trained language models ICLR2022"

twinkle 16 Nov 14, 2022
Prototype-based Incremental Few-Shot Semantic Segmentation

Prototype-based Incremental Few-Shot Semantic Segmentation Fabio Cermelli, Massimiliano Mancini, Yongqin Xian, Zeynep Akata, Barbara Caputo -- BMVC 20

Fabio Cermelli 21 Dec 29, 2022
A Domain-Agnostic Benchmark for Self-Supervised Learning

DABS: A Domain Agnostic Benchmark for Self-Supervised Learning This repository contains the code for DABS, a benchmark for domain-agnostic self-superv

Alex Tamkin 81 Dec 09, 2022
This is an official implementation of our CVPR 2021 paper "Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression" (https://arxiv.org/abs/2104.02300)

Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression Introduction In this paper, we are interested in the bottom-up paradigm of estima

HRNet 367 Dec 27, 2022
A simple and useful implementation of LPIPS.

lpips-pytorch Description Developing perceptual distance metrics is a major topic in recent image processing problems. LPIPS[1] is a state-of-the-art

So Uchida 121 Dec 24, 2022
The pytorch implementation of SOKD (BMVC2021).

Semi-Online Knowledge Distillation Implementations of SOKD. Requirements This repo was tested with Python 3.8, PyTorch 1.5.1, torchvision 0.6.1, CUDA

4 Dec 19, 2021
A Learning-based Camera Calibration Toolbox

Learning-based Camera Calibration A Learning-based Camera Calibration Toolbox Paper The pdf file can be found here. @misc{zhang2022learningbased,

Eason 14 Dec 21, 2022
Implementations of LSTM: A Search Space Odyssey variants and their training results on the PTB dataset.

An LSTM Odyssey Code for training variants of "LSTM: A Search Space Odyssey" on Fomoro. Check out the blog post. Training Install TensorFlow. Clone th

Fomoro AI 95 Apr 13, 2022
TensorFlow-based implementation of "ICNet for Real-Time Semantic Segmentation on High-Resolution Images".

ICNet_tensorflow This repo provides a TensorFlow-based implementation of paper "ICNet for Real-Time Semantic Segmentation on High-Resolution Images,"

HsuanKung Yang 406 Nov 27, 2022
Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)

DocFormer - PyTorch Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for t

171 Jan 06, 2023
[CVPR 2022] Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels

Using Unreliable Pseudo Labels Official PyTorch implementation of Semi-Supervised Semantic Segmentation Using Unreliable Pseudo Labels, CVPR 2022. Ple

Haochen Wang 268 Dec 24, 2022
Second Order Optimization and Curvature Estimation with K-FAC in JAX.

KFAC-JAX - Second Order Optimization with Approximate Curvature in JAX Installation | Quickstart | Documentation | Examples | Citing KFAC-JAX KFAC-JAX

DeepMind 90 Dec 22, 2022
Implementation of "With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition, BMVC, 2021" in PyTorch

Multimodal Temporal Context Network (MTCN) This repository implements the model proposed in the paper: Evangelos Kazakos, Jaesung Huh, Arsha Nagrani,

Evangelos Kazakos 13 Nov 24, 2022
Implementation of TransGanFormer, an all-attention GAN that combines the finding from the recent GanFormer and TransGan paper

TransGanFormer (wip) Implementation of TransGanFormer, an all-attention GAN that combines the finding from the recent GansFormer and TransGan paper. I

Phil Wang 146 Dec 06, 2022
Cluttered MNIST Dataset

Cluttered MNIST Dataset A setup script will download MNIST and produce mnist/*.t7 files: luajit download_mnist.lua Example usage: local mnist_clutter

DeepMind 50 Jul 12, 2022
Learn the Deep Learning for Computer Vision in three steps: theory from base to SotA, code in PyTorch, and space-repetition with Anki

DeepCourse: Deep Learning for Computer Vision arthurdouillard.com/deepcourse/ This is a course I'm giving to the French engineering school EPITA each

Arthur Douillard 113 Nov 29, 2022
Single cell current best practices tutorial case study for the paper:Luecken and Theis, "Current best practices in single-cell RNA-seq analysis: a tutorial"

Scripts for "Current best-practices in single-cell RNA-seq: a tutorial" This repository is complementary to the publication: M.D. Luecken, F.J. Theis,

Theis Lab 968 Dec 28, 2022
Face Recognition Attendance Project

Face-Recognition-Attendance-Project In This Project You will learn how to mark attendance using face recognition, Hello Guys This is Gautam Kumar, Thi

Gautam Kumar 1 Dec 03, 2022
Jupyter notebooks showing best practices for using cx_Oracle, the Python DB API for Oracle Database

Python cx_Oracle Notebooks, 2022 The repository contains Jupyter notebooks showing best practices for using cx_Oracle, the Python DB API for Oracle Da

Christopher Jones 13 Dec 15, 2022