Python package for machine learning for healthcare using a OMOP common data model

Overview

omop-learn

What is omop-learn?

This library was developed in order to facilitate rapid prototyping in Python of predictive machine-learning models using longitudinal medical data from an OMOP CDM-standard database. omop-learn supports the easy definition of predictive clinical tasks, featurizations of OMOP data, and cohorts of relevance. We further provide methods using sparse tensor implementations to rapidly manipulate the collected features in the rawest form possible, allowing for dynamic transformations of the data.

Two machine-learning models are included with the library. First, a windowed linear model, which uses various backwards-facing windows to aggregate features over different timescales, then feeds these features into a regularized logistic regression model. This model was inspired by the work of Razavian et. al. '15, and despite its simplicity is often competitive with state-of-the-art algorithms. We also include SARD (Self-Attention with Reverse Distillation), a novel deep-learning algorithm that uses self-attention to allow medical events to contextualize themselves using other events in a patient's timeline. SARD also makes use of reverse distillation, a training technique we introduce that effectively initializes a deep model using a high-performing linear proxy, in this case the windowed linear model described above -- for the details of this method and the SARD architecture, please see our paper Kodialam et al. AAAI '21.

Documentation

For a more detailed summary of omop-learn's data collection pipeline, and for documentation of functions, please see the full documentation for this repo, which also describes the process of creating one's own cohorts, predictive tasks, and features.

Dependencies

The following libraries are necessary to run omop-learn:

  • numpy
  • sqlalchemy
  • pandas
  • torch
  • sklearn
  • matplotlib
  • ipywidgets
  • IPython.display
  • gensim.models
  • scipy.sparse
  • sparse

Note that sparse is the PyData Sparse library, documented here

Running omop-learn

We provide several example notebooks, which all use an example task of predicting mortality over a six-month window for patients over the age of 70.

  • End of Life Linear Model Example.ipynb and End of Life Deep Model Example.ipynb run the windowed linear and deep SARD models respectively -- note that your machine must be able to access a GPU in order to run the deep models.
  • End of Life Linear Model Example (With Nontemporal Features).ipynb demonstrates how to add nontemporal features.
  • End of Life Linear Model Ancestors Example.ipynb demonstrates how to add feature ancestors.
  • End of Life Linear Model Example More Prediction Times.ipynb uses a larger dataset with predictions from any date within a time range.

To run the models, first set up the file config.py with connection information for your Postgres server containing an OMOP CDM database. Then, simply run through the cells of the notebook in order. Further documentation of the exact steps taken to define a task, collect data, and run a predictive model are embedded within the notebooks.

Contributors and Acknowledgements

Omop-learn was written by Rohan Kodialam and Jake Marcus, with additional contributions by Rebecca Boiarsky, Ike Lage, and Shannon Hwang.

This package was developed as part of a collaboration with Independence Blue Cross and would not have been possible without the advice and support of Aaron Smith-McLallen, Ravi Chawla, Kyle Armstrong, Luogang Wei, and Jim Denyer.

Owner
Sontag Lab
Machine learning algorithms and applications to health care.
Sontag Lab
Neighbourhood Retrieval (Nearest Neighbours) with Distance Correlation.

Neighbourhood Retrieval with Distance Correlation Assign Pseudo class labels to datapoints in the latent space. NNDC is a slim wrapper around FAISS. N

The Learning Machines 1 Jan 16, 2022
This jupyter notebook project was completed by me and my friend using the dataset from Kaggle

ARM This jupyter notebook project was completed by me and my friend using the dataset from Kaggle. The world Happiness 2017, which ranks 155 countries

1 Jan 23, 2022
Price forecasting of SGB and IRFC Bonds and comparing there returns

Project_Bonds Project Title : Price forecasting of SGB and IRFC Bonds and comparing there returns. Introduction of the Project The 2008-09 global fina

Tishya S 1 Oct 28, 2021
Machine Learning Algorithms ( Desion Tree, XG Boost, Random Forest )

implementation of machine learning Algorithms such as decision tree and random forest and xgboost on darasets then compare results for each and implement ant colony and genetic algorithms on tsp map,

Mohamadreza Rezaei 1 Jan 19, 2022
A machine learning model for Covid case prediction

CovidcasePrediction A machine learning model for Covid case prediction Problem Statement Using regression algorithms we can able to track the active c

VijayAadhithya2019rit 1 Feb 02, 2022
Pyomo is an object-oriented algebraic modeling language in Python for structured optimization problems.

Pyomo is a Python-based open-source software package that supports a diverse set of optimization capabilities for formulating and analyzing optimization models. Pyomo can be used to define symbolic p

Pyomo 1.4k Dec 28, 2022
Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Call of Duty World League: Search & Destroy Outcome Predictions Growing up as an avid Call of Duty player, I was always curious about what factors led

Brett Vogelsang 2 Jan 18, 2022
High performance implementation of Extreme Learning Machines (fast randomized neural networks).

High Performance toolbox for Extreme Learning Machines. Extreme learning machines (ELM) are a particular kind of Artificial Neural Networks, which sol

Anton Akusok 174 Dec 07, 2022
Skforecast is a python library that eases using scikit-learn regressors as multi-step forecasters

Skforecast is a python library that eases using scikit-learn regressors as multi-step forecasters. It also works with any regressor compatible with the scikit-learn API (pipelines, CatBoost, LightGBM

JoaquĆ­n Amat Rodrigo 297 Jan 09, 2023
cleanlab is the data-centric ML ops package for machine learning with noisy labels.

cleanlab is the data-centric ML ops package for machine learning with noisy labels. cleanlab cleans labels and supports finding, quantifying, and lear

Cleanlab 51 Nov 28, 2022
Uses WiFi signals :signal_strength: and machine learning to predict where you are

Uses WiFi signals and machine learning (sklearn's RandomForest) to predict where you are. Even works for small distances like 2-10 meters.

Pascal van Kooten 5k Jan 09, 2023
Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

RGF-team 363 Dec 14, 2022
Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis.

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics

Facebook Research 4.1k Dec 29, 2022
Software Engineer Salary Prediction

Based on 2021 stack overflow data, this machine learning web application helps one predict the salary based on years of experience, level of education and the country they work in.

Jhanvi Mimani 1 Jan 08, 2022
Built on python (Mathematical straight fit line coordinates error predictor machine learning foundational model)

Sum-Square_Error-Business-Analytical-Tool- Built on python (Mathematical straight fit line coordinates error predictor machine learning foundational m

om Podey 1 Dec 03, 2021
onelearn: Online learning in Python

onelearn: Online learning in Python Documentation | Reproduce experiments | onelearn stands for ONE-shot LEARNning. It is a small python package for o

15 Nov 06, 2022
Breast-Cancer-Classification - Using SKLearn breast cancer dataset which contains 569 examples and 32 features classifying has been made with 6 different algorithms

Breast-Cancer-Classification - Using SKLearn breast cancer dataset which contains 569 examples and 32 features classifying has been made with 6 different algorithms

Mert Sezer Ardal 1 Jan 31, 2022
This is the code repository for Interpretable Machine Learning with Python, published by Packt.

Interpretable Machine Learning with Python, published by Packt

Packt 299 Jan 02, 2023
MaD GUI is a basis for graphical annotation and computational analysis of time series data.

MaD GUI Machine Learning and Data Analytics Graphical User Interface MaD GUI is a basis for graphical annotation and computational analysis of time se

Machine Learning and Data Analytics Lab FAU 10 Dec 19, 2022
Sleep stages are classified with the help of ML. We have used 4 different ML algorithms (SVM, KNN, RF, NN) to demonstrate them

Sleep stages are classified with the help of ML. We have used 4 different ML algorithms (SVM, KNN, RF, NN) to demonstrate them.

Anirudh Edpuganti 3 Apr 03, 2022