使用PyPolars，让Pandas快三倍-开源-知优网

本文介绍如何使用PyPolars库加快Pandas工作流程。

【51CTO.com快译】Pandas是数据科学家处理数据的最重要的Python软件包之一。Pandas库主要用于数据探索和可视化，它随带大量的内置函数。Pandas无法处理大型数据集，因为它无法在CPU的所有核心上扩展或分布进程。

为了加快计算速度，您可以使用CPU的所有核心，并加快工作流程。有各种开源库，包括Dask、Vaex、Modin、Pandarallel和PyPolars等，它们可以在CPU的多个核心上并行处理计算。我们在本文中将讨论PyPolars库的实现和用法，并将其性能与Pandas库进行比较。

PyPolars是什么?

PyPolars是一个类似Pandas的开源Python数据框库。PyPolars利用CPU的所有可用核心，因此处理计算比Pandas更快。PyPolars有一个类似Pandas的API。它是用Rust和Python包装器编写的。

理想情况下，当数据对于Pandas而言太大、对于Spark而言太小时，使用 PyPolars。

PyPolars如何工作?

PyPolars库有两个API，一个是Eager API，另一个是Lazy API。Eager API与Pandas的API非常相似，执行完成后立即获得结果，这类似Pandas。Lazy API与Spark非常相似，一执行查询，就形成地图或方案。然后在CPU的所有核心上并行执行。

图1. PyPolars API

PyPolars基本上是连接到Polars库的Python绑定。PyPolars库好用的地方是，其API与Pandas相似，这使开发人员更容易使用。

安装：

可以使用以下命令从PyPl安装 PyPolars：

pipinstallpy-polars

并使用以下命令导入库：

iportpypolarsaspl

基准时间约束：

为了演示，我使用了一个含有2500万个实例的大型数据集(~6.4Gb)。

图2. Pandas和Py-Polars基本操作的基准时间数

针对使用Pandas和PyPolars库的一些基本操作的上述基准时间数，我们可以观察到 PyPolars几乎比Pandas快2到3倍。

现在我们知道PyPolars有一个与Pandas非常相似的API，但仍没有涵盖Pandas的所有函数。比如说，PyPolars中就没有.describe()函数，相反我们可以使用df_PyPolars.to_pandas().describe()。

用法：

importpandasaspd
importnumpyasnp
importpypolarsaspl
importtime
WARNING!
py-polarswasrenamedtopolars,pleaseinstallpolars!
https://pypi.org/project/polars/
path="data.csv"

读取数据：

s=time.time()
df_pandas=pd.read_csv(path)
e=time.time()
pd_time=e-s
print("PandasLoadingTime={}".format(pd_time))
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3071:DtypeWarning:Columns(2,7,14)havemixedtypes.Specifydtypeoptiononimportorsetlow_memory=False.
has_raised=awaitself.run_ast_nodes(code_ast.body,cell_name,
PandasLoadingTime=217.1734380722046
s=time.time()
df_pypolars=pl.read_csv(path)
e=time.time()
pl_time=e-s
print("PyPolarsLoadingTime={}".format(pl_time))
PyPolarsLoadingTime=114.0408570766449

shape：

s=time.time()
print(df_pandas.shape)
e=time.time()
pd_time=e-s
print("PandasShapeTime={}".format(pd_time))
(25366521,19)
PandasShapeTime=0.0
s=time.time()
print(df_pypolars.shape)
e=time.time()
pl_time=e-s
print("PyPolarsShapeTime={}".format(pl_time))
(25366521,19)
PyPolarsShapeTime=0.0010192394256591797

过滤：

s=time.time()
temp=df_pandas[df_pandas['PAID_AMT']>500]
e=time.time()
pd_time=e-s
print("PandasFilterTime={}".format(pd_time))
PandasFilterTime=0.8010377883911133
s=time.time()
temp=df_pypolars[df_pypolars['PAID_AMT']>500]
e=time.time()
pl_time=e-s
print("PyPolarsFilterTime={}".format(pl_time))
PyPolarsFilterTime=0.7790462970733643

Groupby：

s=time.time()
temp=df_pandas.groupby(by="MARKET_SEGMENT").agg({'PAID_AMT':np.sum,'QTY_DISPENSED':np.mean})
e=time.time()
pd_time=e-s
print("PandasGroupByTime={}".format(pd_time))
PandasGroupByTime=3.5932095050811768
s=time.time()
temp=df_pypolars.groupby(by="MARKET_SEGMENT").agg({'PAID_AMT':np.sum,'QTY_DISPENSED':np.mean})
e=time.time()
pd_time=e-s
print("PyPolarsGroupByTime={}".format(pd_time))
PyPolarsGroupByTime=1.2332513110957213

运用函数：

%%time
s=time.time()
temp=df_pandas['PAID_AMT'].apply(round)
e=time.time()
pd_time=e-s
print("PandasLoadingTime={}".format(pd_time))
PandasLoadingTime=13.081078290939331
Walltime:13.1s
s=time.time()
temp=df_pypolars['PAID_AMT'].apply(round)
e=time.time()
pd_time=e-s
print("PyPolarsLoadingTime={}".format(pd_time))
PyPolarsLoadingTime=6.03610580444336

值计算：

%%time
s=time.time()
temp=df_pandas['MARKET_SEGMENT'].value_counts()
e=time.time()
pd_time=e-s
print("PandasValueCountsTime={}".format(pd_time))
PandasValueCountsTime=2.8194501399993896
Walltime:2.82s
%%time
s=time.time()
temp=df_pypolars['MARKET_SEGMENT'].value_counts()
e=time.time()
pd_time=e-s
print("PyPolarsValueCountsTime={}".format(pd_time))
PyPolarsValueCountsTime=1.7622406482696533
Walltime:1.76s

描述：

%%time
s=time.time()
temp=df_pandas.describe()
e=time.time()
pd_time=e-s
print("PandasDescribeTime={}".format(pd_time))
PandasDescribeTime=15.48347520828247
Walltime:15.5s
%%time
s=time.time()
temp=df_pypolars[temp_cols].to_pandas().describe()
e=time.time()
pd_time=e-s
print("PyPolarsDescribeTime={}".format(pd_time))
PyPolarsDescribeTime=44.31892013549805
Walltime:44.3s