Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Last update: May 15, 2022

Overview

Amplo - AutoML (for Machine Data)

Welcome to the Automated Machine Learning package Amplo. Amplo's AutoML is designed specifically for machine data and works very well with tabular time series data (especially unbalanced classification!).

Though this is a standalone Python package, Amplo's AutoML is also available on Amplo's Smart Maintenance Platform. With a graphical user interface and various data connectors, it is the ideal place for service engineers to get started on Predictive.

Amplo's AutoML Pipeline contains the entire Machine Learning development cycle, including exploratory data analysis, data cleaning, feature extraction, feature selection, model selection, hyper parameter optimization, stacking, version control, production-ready models and documentation. It comes with additional tools such as interval analysers, drift detectors, data quality checks, etc.

Downloading Amplo

The easiest way is to install our Python package through PyPi:

pip install Amplo

2. Usage

Usage is very simple with Amplo's AutoML Pipeline.

from Amplo import Pipeline
from sklearn.datasets import make_classification
from sklearn.datasets import make_regression


x, y = make_classification()
pipeline = Pipeline()
pipeline.fit(x, y)
yp = pipeline.predict_proba(x)

x, y = make_regression()
pipeline = Pipeline()
pipeline.fit(x, y)
yp = pipeline.predict(x)

3. Amplo AutoML Features

Interval Analyser

from Amplo.AutoML import IntervalAnalyser

Interval Analyser for Log file classification. When log files have to be classified, and there is not enough data for time series methods (such as LSTMs, ROCKET or Weasel, Boss, etc), one needs to fall back to classical machine learning models which work better with lower samples. This raises the problem of which samples to classify. You shouldn't just simply classify on every sample and accumulate, that may greatly disrupt classification performance. Therefore, we introduce this interval analyser. By using an approximate K-Nearest Neighbors algorithm, one can estimate the strength of correlation for every sample inside a log. Using this allows for better interval selection for classical machine learning models.

To use this interval analyser, make sure that your logs are located in a folder of their class, with one parent folder with all classes, e.g.:

+-- Parent Folder
|   +-- Class_1
|       +-- Log_1.*
|       +-- Log_2.*
|   +-- Class_2
|       +-- Log_3.*

Exploratory Data Analysis

from Amplo.AutoML import DataExplorer

Automated Exploratory Data Analysis. Covers binary classification and regression. It generates:

Missing Values Plot
Line Plots of all features
Box plots of all features
Co-linearity Plot
SHAP Values
Random Forest Feature Importance
Predictive Power Score

Additional plots for Regression:

Seasonality Plots
Differentiated Variance Plot
Auto Correlation Function Plot
Partial Auto Correlation Function Plot
Cross Correlation Function Plot
Scatter Plots

Data Processing

from Amplo.AutoML import DataProcesser

Automated Data Cleaning:

Infers & converts data types (integer, floats, categorical, datetime)
Reformats column names
Removes duplicates columns and rows
Handles missing values by:
- Removing columns
- Removing rows
- Interpolating
- Filling with zero's
Removes outliers using:
- Clipping
- Z-score
- Quantiles
Removes constant columns

Data Sampler

from Amplo.AutoML import DataSampler

This pipeline is designed to handle unbalanced classification problems. Aside weighted loss functions, under sampling the majority class or down sampling the minority class helps. Various algorithms are analysed:

SMOTE
Borderline SMOTE
Random Over Sampler
Tomek Links
One Sided Selection
Random Under Sampler
Edited Nearest Neighbours
SMOTE Tomek
SMOTE Edited Nearest Neighbours

Feature Processing

from Amplo.AutoML import FeatureProcesser

Automatically extracts and selects features. Removes Co-Linear Features. Included Feature Extraction algorithms:

Multiplicative Features
Dividing Features
Additive Features
Subtractive Features
Trigonometric Features
K-Means Features
Lagged Features
Differencing Features
Inverse Features
Datetime Features

Included Feature Selection algorithms:

Random Forest Feature Importance (Threshold and Increment)
Predictive Power Score

Sequencing

from Amplo.AutoML import Sequencer

For time series regression problems, it is often useful to include multiple previous samples instead of just the latest. This class sequences the data, based on which time steps you want included in the in- and output. This is also very useful when working with tensors, as a tensor can be returned which directly fits into a Recurrent Neural Network.

Modelling

from Amplo.AutoML import Modeller

Runs various regression or classification models. Includes:

Scikit's Linear Model
Scikit's Random Forest
Scikit's Bagging
Scikit's GradientBoosting
Scikit's HistGradientBoosting
DMLC's XGBoost
Catboost's Catboost
Microsoft's LightGBM
Stacking Models

Grid Search

from Amplo.GridSearch import *

Contains three hyper parameter optimizers with extended predefined model parameters:

Grid Search
Halving Random Search
Optuna's Tree-Parzen-Estimator

Automatic Documntation

from Amplo.AutoML import Documenter

Contains a documenter for classification (binary and multiclass problems), as well as for regression. Creates a pdf report for a Pipeline, including metrics, data processing steps, and everything else to recreate the result.

Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

25 Dec 28, 2022

A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

11.6k Jan 2, 2023

The easy way to combine mlflow, hydra and optuna into one machine learning pipeline.

mlflow_hydra_optuna_the_easy_way The easy way to combine mlflow, hydra and optuna into one machine learning pipeline. Objective TODO Usage 1. build do

9 Sep 9, 2022

fMRIprep Pipeline To Machine Learning

fMRIprep Pipeline To Machine Learning(Demo) 所有配置均在config.py文件下定义前置环境(lilab) 各个节点均安装docker，并有fmripre的镜像可以使用conda中的base环境（相应的第三份包之后更新） 1. fmriprep scr

3 Mar 8, 2022

This repository contains full machine learning pipeline of the Zillow Houses competition on Kaggle platform.

Zillow-Houses This repository contains full machine learning pipeline of the Zillow Houses competition on Kaggle platform. Pipeline is consists of 10

2 Jan 9, 2022

MachineLearningStocks is designed to be an intuitive and highly extensible template project applying machine learning to making stock predictions.

Using python and scikit-learn to make stock predictions

1.3k Jan 3, 2023

Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Related tags

Overview

Amplo - AutoML (for Machine Data)

Downloading Amplo

2. Usage

3. Amplo AutoML Features

Interval Analyser

Exploratory Data Analysis

Data Processing

Data Sampler

Feature Processing

Sequencing

Modelling

Grid Search

Automatic Documntation

You might also like...

Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

A toolkit for making real world machine learning and data analysis applications in C++

The easy way to combine mlflow, hydra and optuna into one machine learning pipeline.

fMRIprep Pipeline To Machine Learning

This repository contains full machine learning pipeline of the Zillow Houses competition on Kaggle platform.

MachineLearningStocks is designed to be an intuitive and highly extensible template project applying machine learning to making stock predictions.

TorchDrug is a PyTorch-based machine learning toolbox designed for drug discovery

Automated Machine Learning with scikit-learn

MLBox is a powerful Automated Machine Learning python library.

Releases(v0.10.2)

v0.10.2(Jun 2, 2022)

v0.10.1(May 26, 2022)

v0.9.0(May 2, 2022)

v0.8.27(Apr 6, 2022)

v0.8.26(Apr 1, 2022)

v0.8.25(Apr 1, 2022)

v0.8.24(Mar 31, 2022)

v0.8.23(Mar 30, 2022)

v0.8.22(Mar 24, 2022)

v0.8.21(Mar 23, 2022)

v0.8.20(Mar 22, 2022)

v0.8.19(Mar 4, 2022)

v0.8.18(Mar 3, 2022)

v0.8.17(Mar 3, 2022)

v0.8.16(Mar 2, 2022)

v0.8.15(Feb 2, 2022)

v0.8.14(Jan 27, 2022)

v0.8.13(Jan 25, 2022)

v0.8.12(Jan 3, 2022)

v0.8.11(Dec 24, 2021)

v0.8.10(Dec 23, 2021)

v0.8.9(Dec 23, 2021)

v0.8.8(Dec 23, 2021)

v0.8.7(Dec 23, 2021)

v0.8.6(Dec 23, 2021)

v0.8.5(Dec 21, 2021)

v0.8.4(Dec 21, 2021)

v0.8.3(Dec 21, 2021)

v0.8.2(Dec 21, 2021)

v0.8.1(Dec 21, 2021)

Owner

Amplo

🎛 Distributed machine learning made simple.

Binary Classification Problem with Machine Learning

Python 3.6+ toolbox for submitting jobs to Slurm

A machine learning model for Covid case prediction

MLR - Machine Learning Research

Machine Learning Algorithms ( Desion Tree, XG Boost, Random Forest )

Avocado hass time series vs predict price

It is a forest of random projection trees

Esse é o meu primeiro repo tratando de fim a fim, uma pipeline de dados abertos do governo brasileiro relacionado a compras de contrato e cronogramas anuais com spark, em pyspark e SQL!

Simple data balancing baselines for worst-group-accuracy benchmarks.

ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

Python module for data science and machine learning users.

Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

Repositório para o #alurachallengedatascience1

Fit interpretable models. Explain blackbox machine learning.

AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.

Land Cover Classification Random Forest

Mortality risk prediction for COVID-19 patients using XGBoost models

A comprehensive repository containing 30+ notebooks on learning machine learning!

Module for statistical learning, with a particular emphasis on time-dependent modelling