pandas-multiprocess:使用多处理功能处理PandasDataframe的Python包_Missingoptionaldependency'openpyxl'.资源-CSDN文库

共15个文件

py：6个

md：1个

makefile：1个

python

pandas

Python

需积分: 43 126 浏览量 2021-05-14 13:05:20 上传评论收藏 13KB ZIP 举报

资源详情

资源评论

资源推荐

收起资源包目录

pandas-multiprocess-master.zip （15个子文件）

pandas-multiprocess-master

MANIFEST.in 177B

.travis.yml 174B

Pipfile 228B

pandas_multiprocess

multiprocess.py 7KB

__init__.py 59B

tests

test_multiprocess.py 3KB

context.py 128B

Pipfile.lock 8KB

setup.cfg 22B

examples

example.py 1KB

setup.py 858B

.gitignore 110B

Makefile 551B

README.md 2KB

LICENSE.txt 1KB

# pandas-multiprocess [![Build Status](https://travis-ci.org/xieqihui/pandas-multiprocess.svg?branch=master)](https://travis-ci.org/xieqihui/pandas-multiprocess) A Python package to process Pandas Dataframe using multi-processing. ## Install ``` pip install pandas-multiprocess ``` ## Example ### Import the package ```python from pandas_multiprocess import multi_process ``` #### Define a function which will process each row in a Pandas DataFrame The func must take a pandas.Series as its first positional argument and returns either a pandas.Series or a list of pands.Series. The function has one positional argument `data_row`, additional arguments can be defined and the values of the additional arguments will be passed through `multi_process()`. Here we use `**args` to stand for the additional arguments. ```python def func(data_row, **args): # data_row (pd.Series): a row of a panda Dataframe # args: a dict of additional arguments data_row['sum'] = data_row['col_1'] + data_row['col_2'] return data_row ``` ### Initiate a DataFrame ```python import pandas as pd import numpy as np df_len = 1000 df = pd.DataFrame({'col_1': np.random.normal(size=df_len), 'col_2': np.random.cd normal(size=df_len) }) ``` ### Process it using multiprocess ```python # The `args` will be passed to the additional arguments of `func()` args = {} result = multi_process(func=func, data=df, num_process=8, **args) ``` ### The above operation is equivalent as below, but much more efficient ``` result = df.apply(func, axis=1, **args) ``` The result of [example](examples/example.py) demonstrate the efficiency of `pandas-multiprocess` in processing computational expensive operations for each row of a Datafram. ``` Running examples... 100%|████| 100/100 [00:01<00:00, 68.65it/s]8 processes run time 2.189883 seconds. 100%|████| 100/100 [00:00<00:00, 140.90it/s]16 processes run time 1.440812 seconds. Pandas apply() run time 11.165841 seconds. ```