Tidy interface to polars

Overview

tidypolars

PyPI Latest Release

tidypolars is a data frame library built on top of the blazingly fast polars library that gives access to methods and functions familiar to R tidyverse users.

Installation

$ pip3 install tidypolars

General syntax

tidypolars methods are designed to work like tidyverse functions:

import tidypolars as tp
from tidypolars import col, desc

df = tp.Tibble(x = range(3), y = range(3, 6), z = ['a', 'a', 'b'])

(
    df
    .select('x', 'y', 'z')
    .filter(col('x') < 4, col('y') > 1)
    .arrange(desc('z'), 'x')
    .mutate(double_x = col('x') * 2,
            x_plus_y = col('x') + col('y'))
)
┌─────┬─────┬─────┬──────────┬──────────┐
│ xyzdouble_xx_plus_y │
│ ---------------      │
│ i64i64stri64i64      │
╞═════╪═════╪═════╪══════════╪══════════╡
│ 25b47        │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 03a03        │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 14a25        │
└─────┴─────┴─────┴──────────┴──────────┘

The key difference from R is that column names must be wrapped in col() in the following methods:

  • .filter()
  • .mutate()
  • .summarize()

The general idea - when doing calculations on a column you need to wrap it in col(). When doing simple column selections (like in .select()) you can pass the column names as strings.

Group by syntax

Methods operate by group by calling the by arg.

  • A single column can be passed with by = 'z'
  • Multiple columns can be passed with by = ['y', 'z']
(
    df
    .summarize(avg_x = tp.mean(col('x')),
               by = 'z')
)
┌─────┬───────┐
│ zavg_x │
│ ------   │
│ strf64   │
╞═════╪═══════╡
│ a0.5   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b2     │
└─────┴───────┘

Selecting/dropping columns

tidyselect functions can be mixed with normal selection when selecting columns:

df = tp.Tibble(x1 = range(3), x2 = range(3), y = range(3), z = range(3))

df.select(tp.starts_with('x'), 'z')
┌─────┬─────┬─────┐
│ x1x2z   │
│ --------- │
│ i64i64i64 │
╞═════╪═════╪═════╡
│ 000   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 111   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 222   │
└─────┴─────┴─────┘

To drop columns use the .drop() method:

df.drop(tp.starts_with('x'), 'z')
┌─────┐
│ y   │
│ --- │
│ i64 │
╞═════╡
│ 0   │
├╌╌╌╌╌┤
│ 1   │
├╌╌╌╌╌┤
│ 2   │
└─────┘

Converting to/from pandas data frames

If you need to use a package that requires pandas data frames, you can convert from a tidypolars Tibble to a pandas DataFrame.

To do this you'll first need to install pyarrow:

pip3 install pyarrow

To convert to a pandas DataFrame:

df = df.to_pandas()

To convert from a pandas DataFrame to a tidypolars Tibble:

df = tp.from_pandas(df)

Speed Comparisons

A few notes:

  • Comparing times from separate functions typically isn't very useful. For example - the .summarize() tests were performed on a different dataset from .pivot_wider().
  • All tests are run 5 times. The times shown are the median of those 5 runs.
  • All timings are in milliseconds.
  • All tests can be found in the source code here.
  • FAQ - Why are some tidypolars functions faster than their polars counterpart?
    • Short answer - they're not! After all they're just using polars in the background.
    • Long answer - All python functions have some slight natural variation in their execution time. By chance the tidypolars runs were slightly shorter on those specific functions on this iteration of the tests. However one goal of these tests is to show that the "time cost" of translating syntax to polars is very negligible to the user (especially on medium-to-large datasets).
  • Lastly I'd like to mention that these tests were not rigorously created to cover all angles equally. They are just meant to be used as general insight into the performance of these packages.
┌─────────────┬────────────┬─────────┬──────────┐
│ func_testedtidypolarspolarspandas   │
│ ------------      │
│ strf64f64f64      │
╞═════════════╪════════════╪═════════╪══════════╡
│ arrange190.345169.478500.112  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ case_when87.34879.427152.623  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ distinct16.88816.28228.725   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ filter29.78929.91231.397  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ full_join236.784231.2831042.689 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ inner_join49.7147.563630.98   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ left_join113.7921151100.607 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ mutate7.9797.408117.283  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ pivot_wider42.76439.93949.048   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ summarize59.43458.011453.707  │
└─────────────┴────────────┴─────────┴──────────┘

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

Comments
  • `drop` with error `RuntimeError: Any(NotFound(

    `drop` with error `RuntimeError: Any(NotFound("^x.*$"))`

    import sys
    import tidypolars as tp
    sys.version
    # '3.9.7 (default, Sep 16 2021, 13:09:58) \n[GCC 7.5.0]'
    tp.__version__
    # '0.2.1'
    ## error
    df = tp.Tibble(x1 = range(3), x2 = range(3), y=range(3), z = range(3))
    df.drop([tp.starts_with('x'), 'z'])
    df.drop()
    `
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    /tmp/ipykernel_12815/866601321.py in <module>
    ----> 1 df.drop(tp.starts_with('x'))
    
    ~/miniconda3/envs/py39/lib/python3.9/site-packages/polars/eager/frame.py in drop(self, name)
       2253             return df
       2254 
    -> 2255         return wrap_df(self._df.drop(name))
       2256 
       2257     def drop_in_place(self, name: str) -> "pl.Series":
    
    RuntimeError: Any(NotFound("^x.*$"))
    `
    
    
    
    opened by ztsweet 9
  • `AttributeError: arrange not found`

    `AttributeError: arrange not found`

    import tidypolars as tp
    from tidypolars import col, desc
    import sys
    sys.version
    # '3.10.0 | packaged by conda-forge | (default, Oct 12 2021, 21:24:52) [GCC 9.4.0]'
    tp.__version__
    # '0.2.1'
    df = tp.Tibble({'x': ['a', 'a', 'b'], 'y': range(3)})
    df.arrange('x', 'y')
    `
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    ~/miniconda3/envs/py310/lib/python3.10/site-packages/polars/eager/frame.py in __getattr__(self, item)
        882         try:
    --> 883             return pl.eager.series.wrap_s(self._df.column(item))
        884         except RuntimeError:
    
    RuntimeError: Any(NotFound("arrange"))
    
    During handling of the above exception, another exception occurred:
    
    AttributeError                            Traceback (most recent call last)
    /tmp/ipykernel_21110/1194586334.py in <module>
    ----> 1 df.arrange('x', 'y')
    
    ~/miniconda3/envs/py310/lib/python3.10/site-packages/polars/eager/frame.py in __getattr__(self, item)
        883             return pl.eager.series.wrap_s(self._df.column(item))
        884         except RuntimeError:
    --> 885             raise AttributeError(f"{item} not found")
        886 
        887     def __iter__(self) -> Iterator[Any]:
    
    AttributeError: arrange not found
    `
    
    bug 
    opened by ztsweet 6
  • Missing attributes when chaining

    Missing attributes when chaining

    Hi Mark, thanks for putting this package together. It looks very cool.

    I'm having a tough time getting the motivating examples to work, though. For example, the following triggers an error:

    import tidypolars as tp
    from tidypolars import col, desc
    
    df = tp.Tibble(x = range(3), y = range(3, 6), z = ['a', 'a', 'b'])
    
    df.filter(col('x') < 2).arrange(desc('z'), 'x')
    
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    Untitled-1 in <cell line: 1>()
    ----> <a href='untitled:Untitled-1?line=7'>8</a> df.filter(col('x') < 2).arrange(desc('z'), 'x')
    
    AttributeError: 'DataFrame' object has no attribute 'arrange'
    

    What's genuinely odd about the above is that arrange works on its own and when it comes before filter.

    # All of these work as expected
    df.filter(col('x') < 2)
    df.arrange(desc('z'), 'x')
    df.arrange(desc('z'), 'x').filter(col('x') < 2)
    

    A seemingly related issue is that I can't pass two arguments to filter when it follows arrange (or other verbs likeselect for that matter).

    df.filter(col('x') < 2, col('y')>3) ## works
    df.arrange(desc('z'), 'x').filter(col('x') < 2, col('y')>3) ## errors with "filter() takes 2 positional arguments but 3 were given"
    

    Any ideas?

    I'm on Python 3.9.2 installed via Homebrew on a 2019 Macbook (so regular Intel chip) and running the latest version of tidypolars (0.2.15).

    bug 
    opened by grantmcdermott 5
  • Is it possible to have dplyr's `group_by` + `mutate` behavior?

    Is it possible to have dplyr's `group_by` + `mutate` behavior?

    First of all, I really like this package and I've started to use it a lot in my work. As a Pythonista whose first language is R, I really enjoy tidypolars.

    In R, we can do something like the following

    library(dplyr)
    data(iris)
    
    iris %>%
      group_by(Species) %>%
      mutate(
        result = Petal.Width - mean(Petal.Width)
      )
    

    Since we have a group_by(Species) call, dplyr will subtract the mean that corresponds to each group in the mutate() operation (not the mean across all observations from all species).

    As far as I understand, this is still not possible with tidypolars since we don't have a group_by function that behaves in a similar way to the one in dplyr. So my questions are

    • Is it possible to have this behavior in tidypolars now?
      • If yes, how?
      • If not, is it going to be possible? I could volunteer to try to implement it. I'm not familiar with the existing codebase, but I suspect that Python eager evaluation of function arguments is what makes it harder to have such a feature?

    Again, thanks for the fantastic library!

    opened by tomicapretto 5
  • idiomatic way to add list as column

    idiomatic way to add list as column

    Forgive what's probably a dumb question, but is there a way to get .mutate to return the same object as the .bind_cols line?

    import tidypolars as tp
    
    tb = tp.Tibble({'a': [1, 2, 3]})
    x = [4, 5, 6]
    # gives desired output 
    tb.bind_cols(tp.Tibble({'b': x}))
    # gives error: ValueError: could not convert value '[4, 5, 6]' as a Literal
    tb.mutate(b = x)
    
    feature 
    opened by eutwt 4
  • purrr functions!?

    purrr functions!?

    I noticed that in tidytable, you have purrr functions like map.(), but not in tidypolars.

    Using for loops + lambda functions are just not desirable for collaborative coding / code readability/comprehension. In Python, even if there is a bit of sacrifice in performance, if it allows better code readability, it would be really nice to have.

    Would something like map.() be in the scope of this repo?

    feature 
    opened by exsell-jc 3
  • ```ValueError``` with ```filter```

    ```ValueError``` with ```filter```

    When I chain filter expressions with | (error message said to use | and not or), I receive a ValueError message:

    tp.Tibble(chr_col = tp.Series(['this is a test 1', 'this is a test 2', 'this is a test 3']))\
        .filter(col('chr_col') == 'this is a test 1' |
                col('chr_col') == 'this is a test 2')
    
    ValueError: Since Expr are lazy, the truthiness of an Expr is ambiguous. 
    Hint: use '&' or '|' to chain Expr together, not and/or.
    

    It works fine if I do one or the other:

    tp.Tibble(chr_col = tp.Series(['this is a test 1', 'this is a test 2', 'this is a test 3']))\
        .filter(# col('chr_col') == 'this is a test 1' |
                col('chr_col') == 'this is a test 2')
    
    # chr_col
    #   --
    #   str
    # "this is a test 2"
    
    tp.Tibble(chr_col = tp.Series(['this is a test 1', 'this is a test 2', 'this is a test 3']))\
        .filter(col('chr_col') == 'this is a test 1' # |
                # col('chr_col') == 'this is a test 2'
                )
    
    # chr_col
    #   --
    #   str
    # "this is a test 1"
    
    opened by alexandro-ag 3
  • ```as_date``` with ```RuntimeError: please define a fmt```

    ```as_date``` with ```RuntimeError: please define a fmt```

    Good afternoon,

    I think I found an issue with the as_date method. In the example per the documentation, the following succeeds:

    import tidypolars as tp
    from tidypolars import col
    
    date_df = tp.Tibble(date = ['2021-12-31']) # Year-Month-Day (%Y-%m-%d)
    date_df.mutate(date_parsed = tp.as_date(col('date'))) # Success
    

    However when parsing different formats (using the fmt argument), the date fails to parse:

    import tidypolars as tp
    from tidypolars import col
    
    date_df = tp.Tibble(date = ['12/31/2021']) # Month/Day/Year (%m/%d/%Y)
    date_df.mutate(date_parsed = tp.as_date(col('date'), fmt='%m/%d/%Y')) # RuntimeError
    

    I also extend my appreciation for all the work on this package. I've been searching for a tidyverse implementation in python and this one knocks my expectations out of the park. Thank you.

    opened by alexandro-ag 3
  • Revisit `.rename()` syntax

    Revisit `.rename()` syntax

    Should the syntax be the same as pl.DataFrame.rename? Currently polars mimics pandas syntax. Or should it be something that attempts to mimic tidyverse syntax?

    Note: polars also has a .rename_col() with syntax df.rename_col('old', 'new').

    opened by markfairbanks 3
  • Compatibility with polars v0.14.0

    Compatibility with polars v0.14.0

    PR that caused the break: https://github.com/pola-rs/polars/pull/4309

    Old behavior that tidypolars relied on: https://github.com/pola-rs/polars/pull/2862

    feature 
    opened by markfairbanks 2
  • Basics: tp.read_csv(), df.drop(x1, x2, x3, ...), and df.colnames?

    Basics: tp.read_csv(), df.drop(x1, x2, x3, ...), and df.colnames?

    Really new to the library, but looking at the documentation did not really help with understanding.

    Problem 1

    import polars as pl
    import tidypolars as tp
    import csv
    import requests
    
    url = f'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-05-24/Scrumqueens-data-2022-05-23.csv'
    
    df = tp.read_csv(file = url) # does not work
    df = pl.read_csv(file = url) # works??
    

    Problem 2

    df = df.drop('...1', 'Notes') # does not work
    df = df.drop('...1') # works separately
    df = df.drop('Notes') # works separately
    

    Problem 3

    df.colnames
    df.names
    df.colnames()
    df.names()
    # None of these work
    

    What am I missing, exactly?

    opened by exsell-jc 2
  • plans for adding type hints

    plans for adding type hints

    Hi, it seems that the codebase is not annotated making the discoverability of methods difficult and static code analysis not working. Any plans on adding type hints?

    feature 
    opened by mr-majkel 1
  • `write_csv()` returns `'super' object has no attribute 'to_csv'`

    `write_csv()` returns `'super' object has no attribute 'to_csv'`

    Hi, There seems to be a problem with write_csv(). I can import tidypolars and the data just fine:

    import tidypolars as tp
    
    rents = tp.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-05/rent.csv")
    

    But when I try to export the data frame as a csv file:

    rents.write_csv("rents.csv")
    

    I get an error stating 'super' object has no attribute 'to_csv'.

    The data come from the Tidytuesday repo. Python version is 3.10.8 and tidypolars is 0.2.19. I'm on macOS 13.

    bug 
    opened by alesvomacka 1
  • Calculating time

    Calculating time

    In R with lubridate, it would look like this:

    one_year_before = some_date - years(1)
    one_year_before = some_date - months(12)
    

    But in tidypolars functions list, there doesn't seem to be a years or months function: https://tidypolars.readthedocs.io/en/latest/reference.html

    feature 
    opened by exsell-jc 4
Releases(v0.2.19)
This is an official pytorch implementation of Fast Fourier Convolution.

Fast Fourier Convolution (FFC) for Image Classification This is the official code of Fast Fourier Convolution for image classification on ImageNet. Ma

pkumi 199 Jan 03, 2023
A Differentiable Recipe for Learning Visual Non-Prehensile Planar Manipulation

A Differentiable Recipe for Learning Visual Non-Prehensile Planar Manipulation This repository contains the source code of the paper A Differentiable

Bernardo Aceituno 2 May 05, 2022
Code for CPM-2 Pre-Train

CPM-2 Pre-Train Pre-train CPM-2 此分支为110亿非 MoE 模型的预训练代码,MoE 模型的预训练代码请切换到 moe 分支 CPM-2技术报告请参考link。 0 模型下载 请在智源资源下载页面进行申请,文件介绍如下: 文件名 描述 参数大小 100000.tar

Tsinghua AI 136 Dec 28, 2022
Real Time Object Detection and Classification using Yolo Algorithm.

Real time Object detection & Classification using YOLO algorithm. Real Time Object Detection and Classification using Yolo Algorithm. What is Object D

Ketan Chawla 1 Apr 17, 2022
1st place solution in CCF BDCI 2021 ULSEG challenge

1st place solution in CCF BDCI 2021 ULSEG challenge This is the source code of the 1st place solution for ultrasound image angioma segmentation task (

Chenxu Peng 30 Nov 22, 2022
Tutel MoE: An Optimized Mixture-of-Experts Implementation

Project Tutel Tutel MoE: An Optimized Mixture-of-Experts Implementation. Supported Framework: Pytorch Supported GPUs: CUDA(fp32 + fp16), ROCm(fp32) Ho

Microsoft 344 Dec 29, 2022
ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

This is the project page for the paper: ISTR: End-to-End Instance Segmentation via Transformers, Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Yan Wa

Jie Hu 182 Dec 19, 2022
YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with ONNX, TensorRT, ncnn, and OpenVINO supported.

Introduction YOLOX is an anchor-free version of YOLO, with a simpler design but better performance! It aims to bridge the gap between research and ind

7.7k Jan 03, 2023
Multi-Anchor Active Domain Adaptation for Semantic Segmentation (ICCV 2021 Oral)

Multi-Anchor Active Domain Adaptation for Semantic Segmentation Munan Ning*, Donghuan Lu*, Dong Wei†, Cheng Bian, Chenglang Yuan, Shuang Yu, Kai Ma, Y

Munan Ning 36 Dec 07, 2022
For the paper entitled ''A Case Study and Qualitative Analysis of Simple Cross-Lingual Opinion Mining''

Summary This is the source code for the paper "A Case Study and Qualitative Analysis of Simple Cross-Lingual Opinion Mining", which was accepted as fu

1 Nov 10, 2021
2021 Artificial Intelligence Diabetes Datathon

A.I.D.D. 2021 2021 Artificial Intelligence Diabetes Datathon A.I.D.D. 2021은 ‘2021 인공지능 학습용 데이터 구축사업’을 통해 만들어진 학습용 데이터를 활용하여 당뇨병을 효과적으로 예측할 수 있는가에 대한 A

2 Dec 27, 2021
Converts geometry node attributes to built-in attributes

Attribute Converter Simplifies converting attributes created by geometry nodes to built-in attributes like UVs or vertex colors, as a single click ope

Ivan Notaros 12 Dec 22, 2022
Lorien: A Unified Infrastructure for Efficient Deep Learning Workloads Delivery

Lorien: A Unified Infrastructure for Efficient Deep Learning Workloads Delivery Lorien is an infrastructure to massively explore/benchmark the best sc

Amazon Web Services - Labs 45 Dec 12, 2022
UnsupervisedR&R: Unsupervised Pointcloud Registration via Differentiable Rendering

UnsupervisedR&R: Unsupervised Pointcloud Registration via Differentiable Rendering This repository holds all the code and data for our recent work on

Mohamed El Banani 118 Dec 06, 2022
UIUCTF 2021 Public Challenge Repository

UIUCTF-2021-Public UIUCTF 2021 Public Challenge Repository Notes: every challenge folder contains a challenge.yml file in the format for ctfcli, CTFd'

SIGPwny 15 Nov 03, 2022
The description of FMFCC-A (audio track of FMFCC) dataset and Challenge resluts.

FMFCC-A This project is the description of FMFCC-A (audio track of FMFCC) dataset and Challenge resluts. The FMFCC-A dataset is shared through BaiduCl

18 Dec 24, 2022
A map update dataset and benchmark

MUNO21 MUNO21 is a dataset and benchmark for machine learning methods that automatically update and maintain digital street map datasets. Previous dat

16 Nov 30, 2022
This was initially the repo for the project of [email protected] of Asaf Mazar, Millad Kassaie and Georgios Chochlakis named "Powered by the Will? Exploring Lay Theories of Behavior Change through Social Media"

Subreddit Analysis This repo includes tools for Subreddit analysis, originally developed for our class project of PSYC 626 in USC, titled "Powered by

Georgios Chochlakis 1 Dec 17, 2021
This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

OpenAI 3k Dec 26, 2022
A Python package to process & model ChEMBL data.

insilico: A Python package to process & model ChEMBL data. ChEMBL is a manually curated chemical database of bioactive molecules with drug-like proper

Steven Newton 0 Dec 09, 2021