How to Speed Up Pandas with Modin

GuDron

dumpz.ws
Admin
Регистрация
28 Янв 2020
Сообщения
7,752
Реакции
1,448
Credits
25,205
How to Speed Up Pandas with Modin

1*BNW8U6pKOHbb2Y0tq6L5lw.png

A goal of Modin is to allow data scientists to use the same code for small (kilobytes) and large datasets (terabytes). Image by Для просмотра ссылки Войди или Зарегистрируйся.
The pandas library provides easy-to-use data structures like pandas DataFrames as well as tools for data analysis. One issue with pandas is that it can be slow with large amounts of data. ItДля просмотра ссылки Войди или Зарегистрируйся. Fortunately, there is the Для просмотра ссылки Войди или Зарегистрируйся library which has benefits like the ability to scale your pandas workflows by changing one line of code and integration with the Python ecosystem and Ray clusters. This tutorial goes over how to get started with Modin and how it can speed up your pandas workflows.

How to get started with Modin​


0*et_aT2HZxtCSbzAj

To determine which Pandas methods to implement in Modin first, the developers of Modin scraped 1800 of the most upvoted Python Kaggle Kernels (Для просмотра ссылки Войди или Зарегистрируйся).
Modin’s coverage of the pandas API is over 90% with a focus on the most commonly used pandas methods like pd.read_csv, pd.DataFrame, df.fillna, and df.groupby. This means if you have a lot of data, you can perform most of the same operations as the pandas library faster. This section highlights some commonly used operations.
To get started, you need to install modin.
pip install “modin[all]” # Install Modin dependencies and modin’s execution engines

0*oOpK75j3HFip-yzg

Don’t forgot the “” when pip installing

Import Modin​

A major advantage of Modin is that it doesn’t require you to learn a new API. You only need to change your import statement.
import modin.pandas as pd

0*QUSmxvGoE23A81d0

You only need to change your import statement to use Modin.

Load data (read_csv)​


0*C1abtAtVNnHfiPgL

Modin really shines with larger datasets (Для просмотра ссылки Войди или Зарегистрируйся)
The dataset used in this tutorial is from the Для просмотра ссылки Войди или Зарегистрируйся dataset which is around 2GB .The code below reads the data into a Modin DataFrame.
modin_df = pd.read_csv("Rate.csv”)

0*u0N55ux1XOzCTCrL

In this case, Modin is faster due to it taking work off the main thread to be asynchronous. The file was read in-parallel. A large portion of the improvement was from building the DataFrame components asynchronously.
head
The code below utilizes the head command.
# Select top N number of records (default = 5)
modin_df.head()

0*2hzXz3YzBbJCbLKV

In this case, Modin is slower as it requires collecting the data together. However, users should not be able to perceive this difference in their interactive workflow.
groupby
Similar to pandas, modin has a groupby operation.
df.groupby(['StateCode’]).count()

0*NJA7GkjuDrLxj0CF

Note that there are plans to further optimize the performance of groupby operations in Modin.
fillna
Filling in missing values with the fillna method can be much faster with Modin.
modin_df.fillna({‘IndividualTobaccoRate’: ‘Unknown’})

0*Dz3P2RIFmOBrDj7K
 

GuDron

dumpz.ws
Admin
Регистрация
28 Янв 2020
Сообщения
7,752
Реакции
1,448
Credits
25,205

Default to pandas implementation​

As mentioned earlier, Modin’s API covers about 90% of the Pandas API. For methods not covered yet, Modin will default to a pandas implementation like in the code below.
modin_df.corr(method = ‘kendall’)

0*oAJ1VDwJ6nK2GYcg

When Modin defaults to pandas, you will see a warning.
While there is a performance penalty for defaulting to pandas, Modin will complete all operations whether or not the command is currently implemented in Modin.

0*JV2C2pZs9VIo6K6T

If a method is not implemented, it will default to pandas.
Для просмотра ссылки Войди или Зарегистрируйся explains how this process works.
We first convert to a pandas DataFrame, then perform the operation. There is a performance penalty for going from a partitioned Modin DataFrame to pandas because of the communication cost and single-threaded nature of pandas. Once the pandas operation has completed, we convert the DataFrame back into a partitioned Modin DataFrame. This way, operations performed after something defaults to pandas will be optimized with Modin.

How Modin can Speed up your Pandas Workflows​

The three main ways modin makes pandas workflows faster are through it’s multicore/multinode support, system architecture, and ease of use.

Multicore/Multinode Support​


0*rNa2RTy5Mkvvz2oc

Pandas can only utilize a single core. Modin is able to efficiently make use of all of the hardware available to it. The image shows resources (dark blue) that Modin can utilize with multiple cores (B) and multiple nodes available (C).
The pandas library can only utilize a single core. As virtually all computers today have multiple cores, there is a lot of opportunity to speed up your pandas workflow by having modin utilize all the cores on your computer.

0*jQzrcWdwSaBlu6xH

For the purpose of this blog, you can think of the MacBook above as a single node with 4 cores.
If you would like to scale your code to more than 1 node, Для просмотра ссылки Войди или Зарегистрируйся.

System Architecture​

Another way Modin can be faster than pandas is due to how pandas itself was implemented. Wes McKinney, the creator of pandas, gave a famous talk “Для просмотра ссылки Войди или Зарегистрируйся” where he went over some pandas’ lack of flexibility and performance issues.

0*nZ2kOvGyS-6eujXl

Some of Wes McKinney’s issues with pandas are performance related.
Modin endeavors to solve some of these issues. To understand how, it’s important to understand some of itsДля просмотра ссылки Войди или Зарегистрируйся. The diagram below outlines the general layered view to the components of Modin with a short description of each major section.

0*IgM_DHkkRCb0ToLF

Modin’s System Architecture
Для просмотра ссылки Войди или Зарегистрируйся: This is the user facing layer which primarily is Modin’s coverage of the pandas API. The SQLite API is experimental and the Modin API is something still being designed.
Modin Query Compiler:Для просмотра ссылки Войди или Зарегистрируйся, the Query Compiler layer closely follows the pandas API, but cuts out a large majority of the repetition.
Для просмотра ссылки Войди или Зарегистрируйся: This is where Modin’s optimized dataframe algebra takes place.
Execution: While Modin also supports other execution engines like Dask, the most commonly used execution engine is Для просмотра ссылки Войди или Зарегистрируйся which you can learn about in the next section.

What is Ray​


0*uPR3iQLxVAdVVjRS

Ray makes parallel and distributed processing work more like you would hope (Для просмотра ссылки Войди или Зарегистрируйся).
Ray is the default execution engine for Modin. This section briefly goes over what Ray is and how it can be used as more than just a execution engine.

0*0pA3rVk3QdLyP7Ay

The diagram above shows that at a high level, the Ray ecosystem consists of the core Ray system and scalable libraries for data science like Для просмотра ссылки Войди или Зарегистрируйся. It is a library forДля просмотра ссылки Войди или Зарегистрируйся across multiple cores or machines. It has a couple major advantages including:
  • Simplicity: you can scale your Python applications without rewriting them, and the same code can run on one machine or multiple machines.
  • Robustness: applications gracefully handle machine failures and preemption.
  • Для просмотра ссылки Войди или Зарегистрируйся: tasks run with millisecond latencies, scale to tens of thousands of cores, and handle numerical data with minimal serialization overhead.
Because Ray is a general-purpose framework, the community has built many libraries and frameworks on top of it to accomplish different tasks like Для просмотра ссылки Войди или Зарегистрируйся for hyperparameter tuning at any scale,Для просмотра ссылки Войди или Зарегистрируйся for easy-to-use scalable model serving, andДля просмотра ссылки Войди или Зарегистрируйся for reinforcement learning. It also has Для просмотра ссылки Войди или Зарегистрируйся as well as support for data processing libraries Для просмотра ссылки Войди или Зарегистрируйся.
While you don’t need to learn how to use Ray to use Modin, the image below shows that it generally only requires adding a couple lines of code to turn a simple Python program into a distributed one running across a compute cluster.

0*kM_SR_4opjEVrXGr

Example of how to turn a simple program into a distributed one using Ray (Для просмотра ссылки Войди или Зарегистрируйся).

Conclusion​


0*sKa6NMNvl89GWZzU

A goal of Modin is to allow data scientists to use the same code for small (kilobytes) and large datasets (terabytes). Image from Для просмотра ссылки Войди или Зарегистрируйся.
Modin allows you to use the same Pandas script for a 10KB dataset on a laptop as well as a 10TB dataset on a cluster. This is possible due to Modin’s easy to use API and system architecture. This architecture can utilize Ray as an execution engine to make scaling Modin easier. If you have any questions or thoughts about Ray, please feel free to join our community throughДля просмотра ссылки Войди или Зарегистрируйся orДля просмотра ссылки Войди или Зарегистрируйся. You can also check out the Для просмотра ссылки Войди или Зарегистрируйся page to see how Для просмотра ссылки Войди или Зарегистрируйся is being used throughout industry!