Arxiv harvester - Poor man's simple harvester for arXiv resources

Overview

Poor man's simple harvester for arXiv resources

This modest Python script takes advantage of arXiv resources hosted by Kaggle to harvest arXiv metadata and PDF, without using the AWS requester paid buckets.

The harvester performs the following tasks:

  • parse the full JSON arXiv metadata file available at Kaggle

  • parallel download PDF located at the public access bucket gs://arxiv-dataset and store them (also in parallel) on a cloud storage, AWS S3 and Swift OpenStack supported, or on the local file system

  • store the metadata of the uploaded article along with the PDF in JSON format

To save storage space, only the most recent available version of the PDF for an article is harvested, not every available versions.

Resuming interrupted and incremental update are automatically supported.

In case an article is only available in postcript, it will be converted into PDF too - but it is extremely rare (and usually when it happens the conversion fails because the PostSript is corrupted...).

Install

The tool is supposed to work on a POSIX environment. External call to the following command lines are used: gzip, gunzip and ps2pdf.

First, download the full arXiv metadata JSON file available at https://www.kaggle.com/Cornell-University/arxiv (1GB compressed). It's actually a JSONL file (one JSON document per line), currently named arxiv-metadata-oai-snapshot.json.zip. You can also generate yourself this file with arxiv-public-dataset OAI harvester using the arXiv OAI-PMH service.

Get this github repo:

git clone https://github.com/kermitt2/arxiv_harvester
cd arxiv_harvester

Setup a virtual environment:

virtualenv --system-site-packages -p python3.8 env
source env/bin/activate

Install the dependencies:

pip3 install -r requirements.txt

Finally install the project in editable state:

pip3 install -e .

Usage

First check the configuration file:

  • set the parameters according to your selected storage (AWS S3, SWIFT OepnStack or local storage), see below for more details,
  • the default batch_size for parallel download/upload is 10, change it as you wish and dare,
  • by default gzip compression of files on the target storage is selected.
arXiv harvester

optional arguments:
  -h, --help           show this help message and exit
  --config CONFIG      path to the config file, default is ./config.json
  --reset              ignore previous processing states and re-init the harvesting process from
                       the beginning
  --metadata METADATA  arXiv metadata json file
  --diagnostic         produce a summary of the harvesting

For example, to harvest articles from a metadata snapshot file:

python3 arxiv_harvester/harvester.py --metadata arxiv-metadata-oai-snapshot.json.zip --config config.json

To reset an existing harvesting and starts the harvesting again from scratch, add the --reset argument:

python3 arxiv_harvester/harvester.py --metadata arxiv-metadata-oai-snapshot.json.zip --config config.json --reset

Note that with --reset, no actual stored PDF file is removed - only the harvesting process is reinitialized.

Interrupted harvesting / Incremental update

Launching the harvesting command on an interrupted harvesting will resume the harvesting automatically where it stopped.

If the arXiv metadata file has been updated to a newer version (downloaded from https://www.kaggle.com/Cornell-University/arxiv or generated with arxiv-public-dataset OAI harvester), launching the harvesting command on the updated metadata file will harvest only the new and updated articles (new most recent PDF version).

Resource file organization

The organization of harvested files permits a direct access to the PDF based on the arxiv identifier. More particularly, the Open Access link given for an arXiv resource by Unpaywall is enough to create a direct access path. It also avoids storing too many files in the same directory for performance reasons.

The stored PDF is always the most recent version. There is no need to know what is the exact latest version (an information that we don't have with the Unpaywall arXiv full text links for example). The local metadata file for the article gives the version number of the stored PDF.

For example, to get access path from the identifiers or Unpaywall OA url:

  • post-2007 arXiv identifiers (pattern arXiv:YYMM.numbervV or commonly YYMM.numbervV):

    • 1501.00001v1 -> $root/arXiv/1501/1501.00001/1501.00001.pdf (most recent version of the PDF), $root/arXiv/1501/1501.00001/1501.00001.json (arXiv metadata for the article)
    • Unpaywall link http://arxiv.org/pdf/1501.00001 -> $root/arXiv/1501/1501.00001/1501.00001.pdf, $root/arXiv/1501/1501.00001/1501.00001.json
  • pre-2007 arXiv identifiers (pattern archive.subject_call/YYMMnumber):

    • quant-ph/0602109 -> $root/quant-ph/0602/0602109/0602109.pdf (most recent version of the PDF), $root/quant-ph/0602/0602109/0602109.json (arXiv metadata for the article)

    • Unpaywall link https://arxiv.org/pdf/quant-ph/0602109 -> $root/quant-ph/0602/0602109/0602109.pdf, $root/quant-ph/0602/0602109/0602109.json

If the compression option is set to True in the configuration file config.json, all the resources have an additional .gz extension.

$root in the above examples should be adapted to the storage of choice, as configured in the configuration file config.json. For instance with AWS S3: https://bucket_name.s3.amazonaws.com/arXiv/1501/1501.00001/1501.00001.pdf (if access rights are appropriate). The same applies to a SWIFT object storage based on the container name indicated in the config file.

AWS S3 and SWIFT configuration

For a local storage, just indicate the path where to store the PDF with the parameter data_path in the configuration file config.json.

The configuration for a S3 storage uses the following parameters:

{
    "aws_access_key_id": "",
    "aws_secret_access_key": "",
    "bucket_name": "",
    "region": ""
}

If you are not using a S3 storage, remove these keys or leave these values empty.

The configuration for a SWIFT object storage uses the following parameters:

{
    "swift": {},
    "swift_container": ""
}

If you are not using a SWIFT storage, remove these keys or leave these above values empty.

The "swift" key will contain the account and authentication information, typically via Keystone, something like this:

{
    "swift": {
        "auth_version": "3",
        "auth_url": "https://auth......./v3",
        "os_username": "user-007",
        "os_password": "1234",
        "os_user_domain_name": "Default",
        "os_project_domain_name": "Default",
        "os_project_name": "myProjectName",
        "os_project_id": "myProjectID",
        "os_region_name": "NorthPole",
        "os_auth_url": "https://auth......./v3"
    },
    "swift_container": "my_arxiv_harvesting"
}

Limitations

Source files (LaTeX sources) are not available via the Kaggle dataset and thus via this modest harvester. The LaTeX source files are available via AWS S3 Bulk Source File Access.

There are 44 articles only available in HTML format. These articles will not be harvested.

Acknowledgements

Kaggle arXiv dataset relies on arxiv-public-datasets:

Clement, C. B., Bierbaum, M., O'Keeffe, K. P., & Alemi, A. A. (2019). On the Use of ArXiv as a Dataset. arXiv preprint arXiv:1905.00075.

License and contact

This modest tool is distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

If you contribute to this Open Source project, you agree to share your contribution following this license.

Kaggle dataset arXiv Metadata is distributed under CC0 1.0 license. Note that most articles on arXiv are submitted with the default arXiv license, which does usually not allow redistribution. See here about the possible usage of the harvested PDF.

Main author and contact: Patrice Lopez ([email protected])

Owner
Patrice Lopez
Patrice Lopez
Code for the paper "Graph Attention Tracking". (CVPR2021)

SiamGAT 1. Environment setup This code has been tested on Ubuntu 16.04, Python 3.5, Pytorch 1.2.0, CUDA 9.0. Please install related libraries before r

122 Dec 24, 2022
RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering Authors: Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou and

Salesforce 72 Dec 05, 2022
Keras community contributions

keras-contrib : Keras community contributions Keras-contrib is deprecated. Use TensorFlow Addons. The future of Keras-contrib: We're migrating to tens

Keras 1.6k Dec 21, 2022
The official codes for the ICCV2021 presentation "Uniformity in Heterogeneity: Diving Deep into Count Interval Partition for Crowd Counting"

UEPNet (ICCV2021 Poster Presentation) This repository contains codes for the official implementation in PyTorch of UEPNet as described in Uniformity i

Tencent YouTu Research 15 Dec 14, 2022
Mixed Neural Likelihood Estimation for models of decision-making

Mixed neural likelihood estimation for models of decision-making Mixed neural likelihood estimation (MNLE) enables Bayesian parameter inference for mo

mackelab 9 Dec 22, 2022
This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

TransFG: A Transformer Architecture for Fine-grained Recognition Official PyTorch code for the paper: TransFG: A Transformer Architecture for Fine-gra

Ju He 307 Jan 03, 2023
DSL for matching Python ASTs

py-ast-rule-engine This library provides a DSL (domain-specific language) to match a pattern inside a Python AST (abstract syntax tree). The library i

1 Dec 18, 2021
Learning Features with Parameter-Free Layers (ICLR 2022)

Learning Features with Parameter-Free Layers (ICLR 2022) Dongyoon Han, YoungJoon Yoo, Beomyoung Kim, Byeongho Heo | Paper NAVER AI Lab, NAVER CLOVA Up

NAVER AI 65 Dec 07, 2022
Message Passing on Cell Complexes

CW Networks This repository contains the code used for the papers Weisfeiler and Lehman Go Cellular: CW Networks (Under review) and Weisfeiler and Leh

Twitter Research 108 Jan 05, 2023
Ground truth data for the Optical Character Recognition of Historical Classical Commentaries.

OCR Ground Truth for Historical Commentaries The dataset OCR ground truth for historical commentaries (GT4HistComment) was created from the public dom

Ajax Multi-Commentary 3 Sep 08, 2022
A Strong Baseline for Image Semantic Segmentation

A Strong Baseline for Image Semantic Segmentation Introduction This project is an open source semantic segmentation toolbox based on PyTorch. It is ba

Clark He 49 Sep 20, 2022
Check out the StyleGAN repo and place it in the same directory hierarchy as the present repo

Variational Model Inversion Attacks Kuan-Chieh Wang, Yan Fu, Ke Li, Ashish Khisti, Richard Zemel, Alireza Makhzani Most commands are in run_scripts. W

Jackson Wang 15 Dec 26, 2022
MassiveSumm: a very large-scale, very multilingual, news summarisation dataset

MassiveSumm: a very large-scale, very multilingual, news summarisation dataset This repository contains links to data and code to fetch and reproduce

Daniel Varab 19 Dec 16, 2022
Code for Generating Disentangled Arguments with Prompts: A Simple Event Extraction Framework that Works

GDAP Code for Generating Disentangled Arguments with Prompts: A Simple Event Extraction Framework that Works Environment Python (verified: v3.8) CUDA

45 Oct 29, 2022
Surrogate- and Invariance-Boosted Contrastive Learning (SIB-CL)

Surrogate- and Invariance-Boosted Contrastive Learning (SIB-CL) This repository contains all source code used to generate the results in the article "

Charlotte Loh 3 Jul 23, 2022
PyDEns is a framework for solving Ordinary and Partial Differential Equations (ODEs & PDEs) using neural networks

PyDEns PyDEns is a framework for solving Ordinary and Partial Differential Equations (ODEs & PDEs) using neural networks. With PyDEns one can solve PD

Data Analysis Center 220 Dec 26, 2022
A nutritional label for food for thought.

Lexiscore As a first effort in tackling the theme of information overload in content consumption, I've been working on the lexiscore: a nutritional la

Paul Bricman 34 Nov 08, 2022
Official implementation of our CVPR2021 paper "OTA: Optimal Transport Assignment for Object Detection" in Pytorch.

OTA: Optimal Transport Assignment for Object Detection This project provides an implementation for our CVPR2021 paper "OTA: Optimal Transport Assignme

217 Jan 03, 2023
Software & Hardware to do multi color printing with Sharpies

3D Print Colorizer is a combination of 3D printed parts and a Cura plugin which allows anyone with an Ender 3 like 3D printer to produce multi colored

343 Jan 06, 2023
Implementation of Kaneko et al.'s MaskCycleGAN-VC model for non-parallel voice conversion.

MaskCycleGAN-VC Unofficial PyTorch implementation of Kaneko et al.'s MaskCycleGAN-VC (2021) for non-parallel voice conversion. MaskCycleGAN-VC is the

86 Dec 25, 2022