Python Automated Machine Learning library for tabular data.

Overview

Read the Docs Lines of code GitHub issues GitHub Repo stars GitHub contributors


Logo

Simple but powerful Automated Machine Learning library for tabular data. It uses efficient in-memory SAP HANA algorithms to automate routine Data Science tasks.
📚 Explore the docs »

🐞 Report Bug · 🆕 Request Feature

Table of Contents

  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact

About the project

Disclaimer

This library is an open-source research project and is not part of any official SAP products.

What's this?

This is a simple but accurate Automated Machine Learning library. Based on SAP HANA powerful in-memory algorithms, it provides high accuracy in multiple machine learning tasks. Our library also uses numerous data preprocessing functions to automate routine data cleaning tasks. So, hana_automl goes through all AutoML steps and makes Data Science work easier.

What is SAP HANA?

From www.sap.com: SAP HANA is a high-performance in-memory database that speeds data-driven, real-time decisions and actions.

Web app

https://share.streamlit.io/dan0nchik/sap-hana-automl/main/web.py

Documentation

https://sap-hana-automl.readthedocs.io/en/latest/index.html

Benchmarks

https://github.com/dan0nchik/SAP-HANA-AutoML/blob/main/comparison_openml.ipynb

ML tasks:

  • Binary classification
  • Regression
  • Multiclass classification
  • Forecasting

Steps automated:

  • Data exploration
  • Data preparation
  • Feature engineering
  • Model selection
  • Model training
  • Hyperparameter tuning

👇 By the end of summer 2021, blue part will be fully automated by our library Logo

Clients

Streamlit client Streamlit client

Built With

Getting Started

To get a package up and running, follow these simple steps.

Prerequisites

Make sure you have the following:

  1. Setup SAP HANA (skip this step if you have an instance with PAL enabled). There are 2 ways to do that.
    In HANA Cloud:

    • Create a free trial account
    • Setup an instance
    • Enable PAL - Predictive Analysis Library. It is vital to enable it because we use their algorithms.

    In Virtual Machine:

    • Rent a virtual machine in Azure, AWS, Google Cloud, etc.
    • Install HANA instance there or on your PC (if you have >32 Gb RAM).
    • Enable PAL - Predictive Analysis Library. It is vital to enable it because we use their algorithms.
  2. Installed software

  • Python > 3.6
    Skip this step if python --version returns > 3.6
  • Cython
    pip3 install Cython

Installation

There are 2 ways to install the library

  • Stable: from pypi
    pip3 install hana_automl
  • Latest: from the repository
    pip3 install https://github.com/dan0nchik/SAP-HANA-AutoML/archive/dev.zip
    Note: latest version may contain bugs, be careful!

After installation

Check that PAL (Predictive Analysis Library) is installed and roles are granted

  • Read docs section about that.
  • If you don't want to read docs, run this code
    from hana_automl.utils.scripts import setup_user
    from hana_ml.dataframe import ConnectionContext
    
    cc = ConnectionContext(address='address', user='user', password='password', port=39015)
    
    # replace with credentials of user that will be created or granted a role to run PAL.
    setup_user(connection_context=cc, username='user', password="password")

Usage

From code

Our library in a few lines of code

Connect to database.

from hana_ml.dataframe import ConnectionContext

cc = ConnectionContext(address='address',
                     user='username',
                     password='password',
                     port=1234)

Create AutoML model and fit it.

from hana_automl.automl import AutoML

model = AutoML(cc)
model.fit(
  file_path='path to training dataset', # it may be HANA table/view, or pandas DataFrame
  steps=10, # number of iterations
  target='target', # column to predict
  time_limit=120 # time limit in seconds
)

Predict.

model.predict(
file_path='path to test dataset',
id_column='ID',
verbose=1
)

For more examples, please refer to the Documentation

How to run Streamlit client

  1. Clone repository: git clone https://github.com/dan0nchik/SAP-HANA-AutoML.git
  2. Install dependencies: pip3 install -r requirements.txt
  3. Run GUI: streamlit run ./web.py

Roadmap

See the open issues for a list of proposed features (and known issues). Feel free to report any bugs :)

Contributing

Any contributions you make are greatly appreciated 👏 !

  1. Fork the Project

  2. Create your Feature Branch (git checkout -b feature/NewFeature)

  3. Install dependencies

    pip3 install Cython
    pip3 install -r requirements.txt
  4. Create credentials.py file in tests directory Your files should look like this:

    SAP-HANA-AutoML
    │   README.md
    │   all other files   
    │   .....
    |
    └───tests
        │   test files...
        │   credentials.py
    

    Copy and paste this piece of code there and replace it with your credentials:

    host = "host"
    user = "username"
    password = "password"
    port = 39015 # or any port you need
    schema = "your schema"

    Don't worry, this file is in .gitignore, so your credentials won't be seen by anyone.

  5. Make some changes

  6. Write tests that cover your code in tests directory

  7. Run tests (under SAP-HANA-AutoML directory)

    pytest
  8. Commit your changes (git commit -m 'Add some amazing features')

  9. Push to the branch (git push origin feature/AmazingFeature)

  10. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.
Don't really understand license? Check out the MIT license summary.

Contact

Authors: @While-true-codeanything, @DbusAI, @dan0nchik

Project Link: https://github.com/dan0nchik/SAP-HANA-AutoML

Owner
Daniel Khromov
Learning Swift, C#, and Data Science
Daniel Khromov
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 09, 2023
Neighbourhood Retrieval (Nearest Neighbours) with Distance Correlation.

Neighbourhood Retrieval with Distance Correlation Assign Pseudo class labels to datapoints in the latent space. NNDC is a slim wrapper around FAISS. N

The Learning Machines 1 Jan 16, 2022
vortex particles for simulating smoke in 2d

vortex-particles-method-2d vortex particles for simulating smoke in 2d -vortexparticles_s

12 Aug 23, 2022
Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Iris Species Predictor Iris species predictor app is used to classify iris species using their sepal length, sepal width, petal length and petal width

Siva Prakash 5 Apr 05, 2022
李航《统计学习方法》复现

本项目复现李航《统计学习方法》每一章节的算法 特点: 笔记摘要:在每个文件开头都会有一些核心的摘要 pythonic:这里会用尽可能规范的方式来实现,包括编程风格几乎严格按照PEP8 循序渐进:前期的算法会更list的方式来做计算,可读性比较强,后期几乎完全为numpy.array的计算,并且辅助详

58 Oct 22, 2021
Python factor analysis library (PCA, CA, MCA, MFA, FAMD)

Prince is a library for doing factor analysis. This includes a variety of methods including principal component analysis (PCA) and correspondence anal

Max Halford 915 Dec 31, 2022
ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions

A library for debugging/inspecting machine learning classifiers and explaining their predictions

154 Dec 17, 2022
Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and t

164 Jan 04, 2023
Polyglot Machine Learning example for scraping similar news articles.

Polyglot Machine Learning example for scraping similar news articles In this example, we will see how we can work with Machine Learning applications w

MetaCall 15 Mar 28, 2022
Scikit-Learn useful pre-defined Pipelines Hub

Scikit-Pipes Scikit-Learn useful pre-defined Pipelines Hub Usage: Install scikit-pipes It's advised to install sklearn-genetic using a virtual env, in

Rodrigo Arenas 1 Apr 26, 2022
A Multipurpose Library for Synthetic Time Series Generation in Python

TimeSynth Multipurpose Library for Synthetic Time Series Please cite as: J. R. Maat, A. Malali, and P. Protopapas, “TimeSynth: A Multipurpose Library

278 Dec 26, 2022
neurodsp is a collection of approaches for applying digital signal processing to neural time series

neurodsp is a collection of approaches for applying digital signal processing to neural time series, including algorithms that have been proposed for the analysis of neural time series. It also inclu

NeuroDSP 224 Dec 02, 2022
A visual dataflow programming language for sklearn

Persimmon What is it? Persimmon is a visual dataflow language for creating sklearn pipelines. It represents functions as blocks, inputs and outputs ar

Álvaro Bermejo 194 Jan 04, 2023
Solve automatic numerical differentiation problems in one or more variables.

numdifftools The numdifftools library is a suite of tools written in _Python to solve automatic numerical differentiation problems in one or more vari

Per A. Brodtkorb 181 Dec 16, 2022
This is the code repository for Interpretable Machine Learning with Python, published by Packt.

Interpretable Machine Learning with Python, published by Packt

Packt 299 Jan 02, 2023
Python module for machine learning time series:

seglearn Seglearn is a python package for machine learning time series or sequences. It provides an integrated pipeline for segmentation, feature extr

David Burns 536 Dec 29, 2022
Fit interpretable models. Explain blackbox machine learning.

InterpretML - Alpha Release In the beginning machines learned in darkness, and data scientists struggled in the void to explain them. Let there be lig

InterpretML 5.2k Jan 09, 2023
Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

Miles Cranmer 924 Jan 03, 2023
Traingenerator 🧙 A web app to generate template code for machine learning ✨

Traingenerator 🧙 A web app to generate template code for machine learning ✨ 🎉 Traingenerator is now live! 🎉

Johannes Rieke 1.2k Jan 07, 2023
🚪✊Knock Knock: Get notified when your training ends with only two additional lines of code

Knock Knock A small library to get a notification when your training is complete or when it crashes during the process with two additional lines of co

Hugging Face 2.5k Jan 07, 2023