Scale your pandas workflows

Dask

Dask is simply the most revolutionary tool for data processing that I have encountered. If you love Pandas and Numpy but were sometimes struggling with data that would not fit into RAM then Dask is definitely what you need. Dask supports the Pandas dataframe and Numpy array data structures and is able to either be run on your local computer or be scaled up to run on a cluster. Essentially you write code once and then choose to either run it locally or deploy to a multi-node cluster using a just normal Pythonic syntax.

Documentation

https://docs.dask.org/en/latest/

https://docs.dask.org/en/latest/dataframe.html

Youtube

Modin

To use Modin, you do not need to know how many cores your system has and you do not need to specify how to distribute the data. In fact, you can continue using your previous pandas notebooks while experiencing a considerable speedup from Modin, even on a single machine. Once you’ve changed your import statement, you’re ready to use Modin just like you would pandas.

Documentation

https://modin.readthedocs.io/en/latest/

Git

https://github.com/modin-project/modin

Youtube

Ray

Ray is a fast and simple framework for building and running distributed applications.

Documentation

https://ray.readthedocs.io/en/latest/

Git

https://github.com/ray-project/ray

Youtube