Inequality Joins in Pandas with Pyjanitor

Short Talk at 3:36 pm

Joins in Pandas are equality based. However, there are other scenarios that require joins on inequality or a combination of equi and non-equi joins. Examples include a manufacturer wishing to minimise the cost of storage while maximising profits (increasing the inventory of the more profitable product, while decreasing the storage for the less profitable product), or a tax audit to find out which employers earn more, but pay less tax. Usually in Pandas this is executed via a Cartesian join which can be inefficient as well as memory intensive. Another option is via an Interval index; however efficient execution depends on non overlapping intervals. There is also pandas.merge_asof - however it is limited to just first,last or nearest matches.

This talk shows an efficient way to handle inequality joins in Pandas. It also shows benchmarks that shows significant performance benefits as well as lower memory usage compared to Cartesian joins.

Presented by

Samuel Oranyeli