Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models

Overview

Highly interpretable, sklearn-compatible classifier based on decision rules

This is a scikit-learn compatible wrapper for the Bayesian Rule List classifier developed by Letham et al., 2015 (see Letham's original code), extended by a minimum description length-based discretizer (Fayyad & Irani, 1993) for continuous data, and by an approach to subsample large datasets for better performance.

It produces rule lists, which makes trained classifiers easily interpretable to human experts, and is competitive with state of the art classifiers such as random forests or SVMs.

For example, an easily understood Rule List model of the well-known Titanic dataset:

IF male AND adult THEN survival probability: 21% (19% - 23%)
ELSE IF 3rd class THEN survival probability: 44% (38% - 51%)
ELSE IF 1st class THEN survival probability: 96% (92% - 99%)
ELSE survival probability: 88% (82% - 94%)

Letham et al.'s approach only works on discrete data. However, this approach can still be used on continuous data after discretization. The RuleListClassifier class also includes a discretizer that can deal with continuous data (using Fayyad & Irani's minimum description length principle criterion, based on an implementation by navicto).

The inference procedure is slow on large datasets. If you have more than a few thousand data points, and only numeric data, try the included BigDataRuleListClassifier(training_subset=0.1), which first determines a small subset of the training data that is most critical in defining a decision boundary (the data points that are hardest to classify) and learns a rule list only on this subset (you can specify which estimator to use for judging which subset is hardest to classify by passing any sklearn-compatible estimator in the subset_estimator parameter - see examples/diabetes_bigdata_demo.py).

Usage

The project requires pyFIM, scikit-learn, and pandas to run.

The included RuleListClassifier works as a scikit-learn estimator, with a model.fit(X,y) method which takes training data X (numpy array or pandas DataFrame; continuous, categorical or mixed data) and labels y.

The learned rules of a trained model can be displayed simply by casting the object as a string, e.g. print model, or by using the model.tostring(decimals=1) method and optionally specifying the rounding precision.

Numerical data in X is automatically discretized. To prevent discretization (e.g. to protect columns containing categorical data represented as integers), pass the list of protected column names in the fit method, e.g. model.fit(X,y,undiscretized_features=['CAT_COLUMN_NAME']) (entries in undiscretized columns will be converted to strings and used as categorical values - see examples/hepatitis_mixeddata_demo.py).

Usage example:

from RuleListClassifier import *
from sklearn.datasets.mldata import fetch_mldata
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier

feature_labels = ["#Pregnant","Glucose concentration test","Blood pressure(mmHg)","Triceps skin fold thickness(mm)","2-Hour serum insulin (mu U/ml)","Body mass index","Diabetes pedigree function","Age (years)"]
    
data = fetch_mldata("diabetes") # get dataset
y = (data.target+1)/2 # target labels (0 or 1)
Xtrain, Xtest, ytrain, ytest = train_test_split(data.data, y) # split

# train classifier (allow more iterations for better accuracy; use BigDataRuleListClassifier for large datasets)
model = RuleListClassifier(max_iter=10000, class1label="diabetes", verbose=False)
model.fit(Xtrain, ytrain, feature_labels=feature_labels)

print "RuleListClassifier Accuracy:", model.score(Xtest, ytest), "Learned interpretable model:\n", model
print "RandomForestClassifier Accuracy:", RandomForestClassifier().fit(Xtrain, ytrain).score(Xtest, ytest)
"""
**Output:**
RuleListClassifier Accuracy: 0.776041666667 Learned interpretable model:
Trained RuleListClassifier for detecting diabetes
==================================================
IF Glucose concentration test : 157.5_to_inf THEN probability of diabetes: 81.1% (72.5%-72.5%)
ELSE IF Body mass index : -inf_to_26.3499995 THEN probability of diabetes: 5.2% (1.9%-1.9%)
ELSE IF Glucose concentration test : -inf_to_103.5 THEN probability of diabetes: 14.4% (8.8%-8.8%)
ELSE IF Age (years) : 27.5_to_inf THEN probability of diabetes: 59.6% (51.8%-51.8%)
ELSE IF Glucose concentration test : 103.5_to_127.5 THEN probability of diabetes: 15.9% (8.0%-8.0%)
ELSE probability of diabetes: 44.7% (29.5%-29.5%)
=================================================

RandomForestClassifier Accuracy: 0.729166666667
"""
Owner
Tamas Madl
Tamas Madl
Data Efficient Decision Making

Data Efficient Decision Making

Microsoft 197 Jan 06, 2023
Responsible AI Workshop: a series of tutorials & walkthroughs to illustrate how put responsible AI into practice

Responsible AI Workshop Responsible innovation is top of mind. As such, the tech industry as well as a growing number of organizations of all kinds in

Microsoft 9 Sep 14, 2022
List of Data Science Cheatsheets to rule the world

Data Science Cheatsheets List of Data Science Cheatsheets to rule the world. Table of Contents Business Science Business Science Problem Framework Dat

Favio André Vázquez 11.7k Dec 30, 2022
Falken provides developers with a service that allows them to train AI that can play their games

Falken provides developers with a service that allows them to train AI that can play their games. Unlike traditional RL frameworks that learn through rewards or batches of offline training, Falken is

Google Research 223 Jan 03, 2023
Tribuo - A Java machine learning library

Tribuo - A Java prediction library (v4.1) Tribuo is a machine learning library in Java that provides multi-class classification, regression, clusterin

Oracle 1.1k Dec 28, 2022
Python bindings for MPI

MPI for Python Overview Welcome to MPI for Python. This package provides Python bindings for the Message Passing Interface (MPI) standard. It is imple

MPI for Python 604 Dec 29, 2022
XManager: A framework for managing machine learning experiments 🧑‍🔬

XManager is a platform for packaging, running and keeping track of machine learning experiments. It currently enables one to launch experiments locally or on Google Cloud Platform (GCP). Interaction

DeepMind 620 Dec 27, 2022
Merlion: A Machine Learning Framework for Time Series Intelligence

Merlion is a Python library for time series intelligence. It provides an end-to-end machine learning framework that includes loading and transforming data, building and training models, post-processi

Salesforce 2.8k Jan 05, 2023
A simple python program which predicts the success of a movie based on it's type, actor, actress and director

Movie-Success-Prediction A simple python program which predicts the success of a movie based on it's type, actor, actress and director. The program us

Mahalinga Prasad R N 1 Dec 17, 2021
DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning.

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported ha

Microsoft 1.1k Jan 04, 2023
A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and A* Search (Manhattan Distance Heuristic)

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and the A* Search (using the Manhattan Distance Heuristic)

17 Aug 14, 2022
Book Item Based Collaborative Filtering

Book-Item-Based-Collaborative-Filtering Collaborative filtering methods are used

Şebnem 3 Jan 06, 2022
PySpark + Scikit-learn = Sparkit-learn

Sparkit-learn PySpark + Scikit-learn = Sparkit-learn GitHub: https://github.com/lensacom/sparkit-learn About Sparkit-learn aims to provide scikit-lear

Lensa 1.1k Jan 04, 2023
Python module for data science and machine learning users.

dsnk-distributions package dsnk distribution is a Python module for data science and machine learning that was created with the goal of reducing calcu

Emmanuel ASIFIWE 1 Nov 23, 2021
Fourier-Bayesian estimation of stochastic volatility models

fourier-bayesian-sv-estimation Fourier-Bayesian estimation of stochastic volatility models Code used to run the numerical examples of "Bayesian Approa

15 Jun 20, 2022
MasTrade is a trading bot in baselines3,pytorch,gym

mastrade MasTrade is a trading bot in baselines3,pytorch,gym idea we have for example 1 btc and we buy a crypto with it with market option to trade in

Masoud Azizi 18 May 24, 2022
A data preprocessing and feature engineering script for a machine learning pipeline is prepared.

FEATURE ENGINEERING Business Problem: A data preprocessing and feature engineering script for a machine learning pipeline needs to be prepared. It is

Pinar Oner 7 Dec 18, 2021
Bayesian Modeling and Computation in Python

Bayesian Modeling and Computation in Python Open access and Code This repository contains the open access version of the text and the code examples in

Bayesian Modeling and Computation in Python 339 Jan 02, 2023
A Python package for time series classification

pyts: a Python package for time series classification pyts is a Python package for time series classification. It aims to make time series classificat

Johann Faouzi 1.4k Jan 01, 2023
Automatically create Faiss knn indices with the most optimal similarity search parameters.

It selects the best indexing parameters to achieve the highest recalls given memory and query speed constraints.

Criteo 419 Jan 01, 2023