Dataset and Source code of paper 'Enhancing Keyphrase Extraction from Academic Articles with their Reference Information'.

Overview

Enhancing Keyphrase Extraction from Academic Articles with their Reference Information

Overview

Dataset and code for paper "Enhancing Keyphrase Extraction from Academic Articles with their Reference Information".

The research content of this project is to analyze the impact of the introduction of reference title in scientific literature on the effect of keyword extraction. This project uses three datasets: SemEval-2010, PubMed and LIS-2000, which are located in the dataset folder. At the same time, we use two unsupervised methods: TF-IDF and TextRank, and three supervised learning methods: NaiveBayes, CRF and BiLSTM-CRF. The first four are traditional keywords extraction methods, located in the folder ML, and the last one is deep learning method, located in the folder DL.

Directory structure

Keyphrase_Extraction:                 Root directory
│  dl.bat:                            Batch commands to run deep learning model
│  ml.bat:                            Batch commands to run traditional models
│ 
├─Dataset:                            Store experimental datasets
│      SemEval-2010:                  Contains 244 scientific papers 
│      PubMed:                        Contains 1316 scientific papers
│      LIS-2000:                      Contains 2000 scientific papers
│ 
├─DL:                                 Store the source code of the deep learning model
│  │  build_path.py:                  Create file paths for saving preprocessed data
│  │  crf.py:                         Source code of CRF algorithm implementation(Use pytorch framework)
│  │  main.py:                        The main function of running the program
│  │  model.py:                       Source code of BiLSTM-CRF model
│  │  preprocess.py:                  Source code of preprocessing function
│  │  textrank.py:                    Source code of TextRank algorithm implementation.
│  │  tf_idf.py:                      Source code of TF-IDF algorithm implementation.
│  │  utils.py:                       Some auxiliary functions
│  ├─models:                          Parameter configuration of deep learning models
│  └─datas
│        tags:                        Label settings for sequence labeling
│ 
└─ML:                                 Store the source code of the traditional models
    │  build_path.py:                 Create file paths for saving preprocessed data
    │  configs.py:                    Path configuration file
    │  crf.py:                        Source code of CRF algorithm implementation(Use CRF++ Toolkit)
    │  evaluate.py:                   Source code for result evaluation
    │  naivebayes.py:                 Source code of naivebayes algorithm implementation(Use KEA-3.0 Toolkit)
    │  preprocessing.py:              Source code of preprocessing function
    │  textrank.py:                   Source code of TextRank algorithm implementation
    │  tf_idf.py:                     Source code of TF-IDF algorithm implementation
    │  utils.py:                      Some auxiliary functions
    ├─CRF++:                          CRF++ Toolkit
    └─KEA-3.0:                        KEA-3.0 Toolkit

Dataset Description

The dataset includes the following three json files:

  • SemEval-2010: SemEval-2010 Task 5 dataset, it contains 244 scientific papers and can be visited at: https://semeval2.fbk.eu/semeval2.php?location=data.
  • PubMed: Contains 1316 scientific papers from PubMed (https://github.com/boudinfl/ake-datasets/tree/master/datasets/PubMed).
  • LIS-2000: Contains 2000 scientific papers from journals in Library and Information Science (LIS).

    Each line of the json file includes:

  • title (T): The title of the paper.
  • abstract (A): The abstract of the paper.
  • introduction (I): The introduction of the paper.
  • conclusion (C): The conclusion of the paper.
  • body1 (Fp): The first sentence of each paragraph.
  • body2 (Lp): The last sentence of each paragraph.
  • full_text (F): The full text of the paper.
  • references (R): references list and only the title of each reference is provided.
  • keywords (K): the keywords of the paper and these keywords were annotated manually.

    Quick Start

    In order to facilitate the reproduction of the experimental results, the project uses bat batch command to run the program uniformly (only in Windows Environment). The dl.bat file is the batch command to run the deep learning model, and the ml.bat file is the batch command to run the traditional algorithm.

    How does it work?

    In the Windows environment, use the key combination Win + R and enter cmd to open the DOS command box, and switch to the project's root directory (Keyphrase_Extraction). Then input dl.bat, that is, run deep learning model to get the result of keyword extraction; Enter ml.bat to run traditional algorithm to get keywords Extract the results.

    Experimental results

    The following figures show that the influence of reference information on keyphrase extraction results of TF*IDF, TextRank, NB, CRF and BiLSTM-CRF.

    Table 1: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of SemEval-2010 Table1

    Table 2: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of PubMed Table2

    Table 3: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of LIS-2000 Table3

    Note: The yellow, green and blue bold fonts in the table represent the largest of the P, R and F1 value obtained from different corpora using the same model, respectively.

    Dependency packages

    Before running this project, check that the following Python packages are included in your runtime environment.

  • pytorch 1.7.1
  • nltk 3.5
  • numpy 1.19.2
  • pandas 1.1.3
  • tqdm 4.50.2

    Citation

    Please cite the following paper if you use this codes and dataset in your work.

    Chengzhi Zhang, Lei Zhao, Mengyuan Zhao, Yingyi Zhang. Enhancing Keyphrase Extraction from Academic Articles with their Reference Information. Scientometrics, 2021. (in press) [arXiv]

  • Owner
    Professor at iSchool of Nanjing University of Science and Technology
    Learnable Boundary Guided Adversarial Training (ICCV2021)

    Learnable Boundary Guided Adversarial Training This repository contains the implementation code for the ICCV2021 paper: Learnable Boundary Guided Adve

    DV Lab 27 Sep 25, 2022
    Code repository for the paper: Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild (ICCV 2021)

    Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild Akash Sengupta, Ignas Budvytis, Robert

    Akash Sengupta 149 Dec 14, 2022
    Reproducible research and reusable acyclic workflows in Python. Execute code on HPC systems as if you executed them on your personal computer!

    Reproducible research and reusable acyclic workflows in Python. Execute code on HPC systems as if you executed them on your machine! Motivation Would

    Joeri Hermans 15 Sep 11, 2022
    TransPrompt - Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification

    TransPrompt This code is implement for our EMNLP 2021's paper 《TransPrompt:Towards an Automatic Transferable Prompting Framework for Few-shot Text Cla

    WangJianing 23 Dec 21, 2022
    Supplemental learning materials for "Fourier Feature Networks and Neural Volume Rendering"

    Fourier Feature Networks and Neural Volume Rendering This repository is a companion to a lecture given at the University of Cambridge Engineering Depa

    Matthew A Johnson 133 Dec 26, 2022
    Scaling and Benchmarking Self-Supervised Visual Representation Learning

    FAIR Self-Supervision Benchmark is deprecated. Please see VISSL, a ground-up rewrite of benchmark in PyTorch. FAIR Self-Supervision Benchmark This cod

    Meta Research 584 Dec 31, 2022
    Python Environment for Bayesian Learning

    Pebl is a python library and command line application for learning the structure of a Bayesian network given prior knowledge and observations. Pebl in

    Abhik Shah 103 Jul 14, 2022
    NR-GAN: Noise Robust Generative Adversarial Networks

    Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter Code and checkpoints for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling

    Takuhiro Kaneko 59 Dec 11, 2022
    Implementation of our paper "Video Playback Rate Perception for Self-supervised Spatio-Temporal Representation Learning".

    PRP Introduction This is the implementation of our paper "Video Playback Rate Perception for Self-supervised Spatio-Temporal Representation Learning".

    yuanyao366 39 Dec 29, 2022
    YolactEdge: Real-time Instance Segmentation on the Edge

    YolactEdge, the first competitive instance segmentation approach that runs on small edge devices at real-time speeds. Specifically, YolactEdge runs at up to 30.8 FPS on a Jetson AGX Xavier (and 172.7

    Haotian Liu 1.1k Jan 06, 2023
    Semantic Edge Detection with Diverse Deep Supervision

    Semantic Edge Detection with Diverse Deep Supervision This repository contains the code for our IJCV paper: "Semantic Edge Detection with Diverse Deep

    Yun Liu 12 Dec 31, 2022
    Code for NeurIPS2021 submission "A Surrogate Objective Framework for Prediction+Programming with Soft Constraints"

    This repository is the code for NeurIPS 2021 submission "A Surrogate Objective Framework for Prediction+Programming with Soft Constraints". Edit 2021/

    10 Dec 20, 2022
    Implementation of algorithms for continuous control (DDPG and NAF).

    DEPRECATION This repository is deprecated and is no longer maintaned. Please see a more recent implementation of RL for continuous control at jax-sac.

    Ilya Kostrikov 288 Dec 31, 2022
    A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

    A tour through tensorflow with financial data I present several models ranging in complexity from simple regression to LSTM and policy networks. The s

    195 Dec 07, 2022
    AutoVideo: An Automated Video Action Recognition System

    AutoVideo is a system for automated video analysis. It is developed based on D3M infrastructure, which describes machine learning with generic pipeline languages. Currently, it focuses on video actio

    Data Analytics Lab at Texas A&M University 267 Dec 17, 2022
    TEDSummary is a speech summary corpus. It includes TED talks subtitle (Document), Title-Detail (Summary), speaker name (Meta info), MP4 URL, and utterance id

    TEDSummary is a speech summary corpus. It includes TED talks subtitle (Document), Title-Detail (Summary), speaker name (Meta info), MP4 URL

    3 Dec 26, 2022
    Lane follower: Lane-detector (OpenCV) + Object-detector (YOLO5) + CAN-bus

    Lane Follower This code is for the lane follower, including perception and control, as shown below. Environment Hardware Industrial Camera Intel-NUC(1

    Siqi Fan 3 Jul 07, 2022
    Instance-based label smoothing for improving deep neural networks generalization and calibration

    Instance-based Label Smoothing for Neural Networks Pytorch Implementation of the algorithm. This repository includes a new proposed method for instanc

    Mohamed Maher 1 Aug 13, 2022
    blind SQLIpy sebuah alat injeksi sql yang menggunakan waktu sql untuk mendapatkan sebuah server database.

    blind SQLIpy Alat blind SQLIpy ini merupakan alat injeksi sql yang menggunakan metode time based blind sql injection metode tersebut membutuhkan waktu

    Galih Anggoro Prasetya 4 Feb 24, 2022
    Semi-supervised Adversarial Learning to Generate Photorealistic Face Images of New Identities from 3D Morphable Model

    Semi-supervised Adversarial Learning to Generate Photorealistic Face Images of New Identities from 3D Morphable Model Baris Gecer 1, Binod Bhattarai 1

    Baris Gecer 190 Dec 29, 2022