PipeChain is a utility library for creating functional pipelines.

Overview

PipeChain

Motivation

PipeChain is a utility library for creating functional pipelines. Let's start with a motivating example. We have a list of Australian phone numbers from our users. We need to clean this data before we insert it into the database. With PipeChain, you can do this whole process in one neat pipeline:

from pipechain import PipeChain, PLACEHOLDER as _

nums = [
    "493225813",
    "0491 570 156",
    "55505488",
    "Barry",
    "02 5550 7491",
    "491570156",
    "",
    "1800 975 707"
]

PipeChain(
    nums
).pipe(
    # Remove spaces
    map, lambda x: x.replace(" ", ""), _
).pipe(
    # Remove non-numeric entries
    filter, lambda x: x.isnumeric(), _
).pipe(
    # Add the mobile code to the start of 8-digit numbers
    map, lambda x: "04" + x if len(x) == 8 else x, _
).pipe(
    # Add the 0 to the start of 9-digit numbers
    map, lambda x: "0" + x if len(x) == 9 else x, _
).pipe(
    # Convert to a set to remove duplicates
    set
).eval()
{'0255507491', '0455505488', '0491570156', '0493225813', '1800975707'}

Without PipeChain, we would have to horrifically nest our code, or else use a lot of temporary variables:

set(
    map(
        lambda x: "0" + x if len(x) == 9 else x,
        map(
            lambda x: "04" + x if len(x) == 8 else x,
            filter(
                lambda x: x.isnumeric(),
                map(
                    lambda x: x.replace(" ", ""),
                    nums
                )
            )
        )
    )
)
{'0255507491', '0455505488', '0491570156', '0493225813', '1800975707'}

Installation

pip install pipechain

Usage

Basic Usage

PipeChain has only two exports: PipeChain, and PLACEHOLDER.

PipeChain is a class that defines a pipeline. You create an instance of the class, and then call .pipe() to add another function onto the pipeline:

from pipechain import PipeChain, PLACEHOLDER
PipeChain(1).pipe(str)
PipeChain(arg=1, pipes=[functools.partial(
   
    )])

   

Finally, you call .eval() to run the pipeline and return the result:

PipeChain(1).pipe(str).eval()
'1'

You can "feed" the pipe at either end, either during construction (PipeChain("foo")), or during evaluation .eval("foo"):

PipeChain().pipe(str).eval(1)
'1'

Each call to .pipe() takes a function, and any additional arguments you provide, both positional and keyword, will be forwarded to the function:

PipeChain(["b", "a", "c"]).pipe(sorted, reverse=True).eval()
['c', 'b', 'a']

Argument Position

By default, the previous value is passed as the first positional argument to the function:

PipeChain(2).pipe(pow, 3).eval()
8

The only magic here is that if you use the PLACEHOLDER variable as an argument to .pipe(), then the pipeline will replace it with the output of the previous pipe at runtime:

PipeChain(2).pipe(pow, 3, PLACEHOLDER).eval()
9

Note that you can rename PLACEHOLDER to something more usable using Python's import statement, e.g.

from pipechain import PLACEHOLDER as _
PipeChain(2).pipe(pow, 3, _).eval()
9

Methods

It might not see like methods will play that well with this pipe convention, but after all, they are just functions. You should be able to access any object's method as a function by accessing it on that object's parent class. In the below example, str is the parent class of "":

"".join(["a", "b", "c"])
'abc'
PipeChain(["a", "b", "c"]).pipe(str.join, "", _).eval()
'abc'

Operators

The same goes for operators, such as +, *, [] etc. We just have to use the operator module in the standard library:

from operator import add, mul, getitem

PipeChain(5).pipe(mul, 3).eval()
15
PipeChain(5).pipe(add, 3).eval()
8
PipeChain(["a", "b", "c"]).pipe(getitem, 1).eval()
'b'

Test Suite

Note, you will need poetry installed.

To run the test suite, use:

git clone https://github.com/multimeric/PipeChain.git
cd PipeChain
poetry install
poetry run pytest test/test.py
Owner
Michael Milton
Michael Milton
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather

Tuplex 791 Jan 04, 2023
ASOUL直播间弹幕抓取&&数据分析

ASOUL直播间弹幕抓取&&数据分析(更新中) 这些文件用于爬取ASOUL直播间的弹幕(其他直播间也可以)和其他信息,以及简单的数据分析生成。

159 Dec 10, 2022
Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python 📊

Thomas 2 May 26, 2022
Building house price data pipelines with Apache Beam and Spark on GCP

This project contains the process from building a web crawler to extract the raw data of house price to create ETL pipelines using Google Could Platform services.

1 Nov 22, 2021
Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment

Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment Brief explanation of PT Bukalapak.com Tbk Bukalapak was found

Najibulloh Asror 2 Feb 10, 2022
Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.

Hatchet Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data. It is intended for analyzing

Lawrence Livermore National Laboratory 14 Aug 19, 2022
Used for data processing in machine learning, and help us to construct ML model more easily from scratch

Used for data processing in machine learning, and help us to construct ML model more easily from scratch. Can be used in linear model, logistic regression model, and decision tree.

ShawnWang 0 Jul 05, 2022
Python script for transferring data between three drives in two separate stages

Waterlock Waterlock is a Python script meant for incrementally transferring data between three folder locations in two separate stages. It performs ha

David Swanlund 13 Nov 10, 2021
A Python package for the mathematical modeling of infectious diseases via compartmental models

A Python package for the mathematical modeling of infectious diseases via compartmental models. Originally designed for epidemiologists, epispot can be adapted for almost any type of modeling scenari

epispot 12 Dec 28, 2022
This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot.

superSFS This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot. It is easy-to-use and runing fast. What you s

3 Dec 16, 2022
Synthetic Data Generation for tabular, relational and time series data.

An Open Source Project from the Data to AI Lab, at MIT Website: https://sdv.dev Documentation: https://sdv.dev/SDV User Guides Developer Guides Github

The Synthetic Data Vault Project 1.2k Jan 07, 2023
The Dash Enterprise App Gallery "Oil & Gas Wells" example

This app is based on the Dash Enterprise App Gallery "Oil & Gas Wells" example. For more information and more apps see: Dash App Gallery See the Dash

Austin Caudill 1 Nov 08, 2021
Desafio 1 ~ Bantotal

Challenge 01 | Bantotal Please read the instructions for the challenge by selecting your preferred language below: Español Português License Copyright

Maratona Behind the Code 44 Sep 28, 2022
Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions.

About Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions. The tool provides rich data and a summary g

9 Nov 16, 2022
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Jan 04, 2023
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and lo

Coiled 102 Nov 10, 2022
CPSPEC is an astrophysical data reduction software for timing

CPSPEC manual Introduction CPSPEC is an astrophysical data reduction software for timing. Various timing properties, such as power spectra and cross s

Tenyo Kawamura 1 Oct 20, 2021
Single-Cell Analysis in Python. Scales to >1M cells.

Scanpy – Single-Cell Analysis in Python Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata. It inc

Theis Lab 1.4k Jan 05, 2023
MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

MetPy MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data. MetPy follows semantic versioni

Unidata 971 Dec 25, 2022