# pandas-parallel-apply
<div align="center">
<a href="https://gitlab.com/meehai/pandas-parallel-apply/-/blob/main/LICENSE">
<img src="https://img.shields.io/gitlab/license/meehai/pandas-parallel-apply" alt="License"/>
</a>
<a href="https://pypi.org/project/pandas-parallel-apply/">
<img src="https://img.shields.io/pypi/v/pandas-parallel-apply" alt="PyPi Latest Release"/>
</a>
</div>
Parallel wrappers for `df.apply(fn)`, `df[col].apply(fn)`, `series.apply(fn)`, and `df.groupby([cols]).apply(fn)`, with tqdm progress bars included.
## Installation
`pip install pandas-parallel-apply`
Import with:
```python
from pandas_parallel_apply import DataFrameParallel, SeriesParallel
```
## Examples
See `examples/` for usage on some dummy dataframes and series.
## Usage
```python
# Apply on each row of a dataframe
df.apply(fn)
# ->
DataFrameParallel(df, n_cores: int = None, pbar: bool = True).apply(fn)
# Apply on a column of a dataframe (returns a Series)
df[col].apply(fn, axis=1)
# ->
DataFrameParallel(df, n_cores: int = None, pbar: bool = True)[col].apply(fn, axis=1)
# Apply on a series
series.apply(fn)
# ->
SeriesParallel(series, n_cores: int = None, pbar: bool = True).apply(fn)
# GroupBy apply
df.groupby([cols]).apply(fn)
# ->
DataFrameParallel(df, n_cores: int = None, pbar: bool = True).groupby([cols]).apply(fn)
```
## How it works
It takes the length of your dataframe (or series, or grouper) = N and the `n_cores` provided to the constructors (K).
It then splits the dataframe in K chunks of N/K size and spawns K new processes, each processing the desired chunks.
Only row-wise (perfect parallelable) operations are supported, so `df.apply(fn, axis=1)` is okay, but
`df.apply(fn, axis=0)` is not because it may require rows that are on other workers.
It is assumed that each row is processed in similar time, so the N/K chunks will finishe more or less at the same time.
### Future Improvement
Not supported but may be interesting: define also a number of chunks (C>K), so the df is actually split in N/C chunks,
and theses are passed using a round-robin approach to the K processes. Right now, C=K, so whenever one process finishes,
it will not be assigned any more work.
## n_cores semantics
- `n_cores < -1` -> throws an error
- `n_cores == -1` -> uses `cpu_count()` - 1 cores
- `n_cores == 0` -> uses serial/standard pandas functions
- `n_cores == 1` -> spawns a single process alongside the main one
- `n_cores > 1` -> spanws N processes and chunks the df
- `n_cores > cpu_cpunt()` -> throws an warning
- `n_cores > len(df)` -> limits to `len(df)`
On CPU-bound tasks (calculations), `n_cores = -1` is likely to be fastest. On network-bound operations (e.g., where threads may invoke network calls),
using a very high `n_cores` value may be beneficial.
## Disclaimers
- This is an experimental repository. It may lead to unexpected behaviour.
- Not all the merging semantics of pandas are supported. Pandas has weird and complex methods of converting an apply return. For example, a series apply function may return a dataframe, a series, a dict, a list, etc. All of these are converted in some specific way. Some cases may not be supported.
- Groupby apply functions are **much** slower than their serial variant currently. Still experimenting with how to make it faster. It looks correct, just 10-100x slower for some small examples. May be better as dataframe get bigger.
- Using `n_cores = 1` will create a multiprocessing pool of just 1 core, so the code is parallel (thus not running on the main process), but may not yield much speed improvement, except for not blocking the main process. May be useful in some GUI apps.
That's all.
没有合适的资源?快使用搜索试试~ 我知道了~
pandas-parallel-apply-2.1.tar.gz
需积分: 1 0 下载量 169 浏览量
2024-03-07
12:45:37
上传
评论
收藏 8KB GZ 举报
温馨提示
共20个文件
py:11个
txt:4个
pkg-info:2个
Python库是一组预先编写的代码模块,旨在帮助开发者实现特定的编程任务,无需从零开始编写代码。这些库可以包括各种功能,如数学运算、文件操作、数据分析和网络编程等。Python社区提供了大量的第三方库,如NumPy、Pandas和Requests,极大地丰富了Python的应用领域,从数据科学到Web开发。Python库的丰富性是Python成为最受欢迎的编程语言之一的关键原因之一。这些库不仅为初学者提供了快速入门的途径,而且为经验丰富的开发者提供了强大的工具,以高效率、高质量地完成复杂任务。例如,Matplotlib和Seaborn库在数据可视化领域内非常受欢迎,它们提供了广泛的工具和技术,可以创建高度定制化的图表和图形,帮助数据科学家和分析师在数据探索和结果展示中更有效地传达信息。
资源推荐
资源详情
资源评论
收起资源包目录
pandas-parallel-apply-2.1.tar.gz (20个子文件)
pandas-parallel-apply-2.1
pandas_parallel_apply
utils.py 2KB
__init__.py 157B
data_frame_parallel.py 2KB
series_parallel.py 974B
groupby_parallel.py 4KB
logger.py 3KB
setup.py 2KB
pandas_parallel_apply.egg-info
SOURCES.txt 572B
top_level.txt 22B
PKG-INFO 4KB
requires.txt 71B
dependency_links.txt 1B
LICENSE 483B
PKG-INFO 4KB
test
test_utils.py 553B
test_apply_df_groupby.py 2KB
test_apply_df.py 562B
test_apply_df_col.py 502B
setup.cfg 38B
README.md 4KB
共 20 条
- 1
资源评论
程序员Chino的日记
- 粉丝: 2820
- 资源: 3万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 冯璐阳 42105650—祝福.docx
- 基于多种算法及改进算法实现的移动机器人路径规划matlab源码(含A星算法+PRM+RRT的改进等).zip
- 布里斯托尔纸细分市场、总体规模、先进性、市占率行业分析报告2024年.docx
- Obi绳子插件,好用的很 6.5.4版本
- openjfx-22.0.1-windows-x64-bin-sdk.zip
- 基于ros和stm32f1的小车代码(含串口通信)+项目说明.zip
- 人体姿态估计-基于Tensorflow实现的人体姿态估计算法-附项目源码-优质项目分享.zip
- java实现所有算法大全
- JDBC DAO模式 (复习)
- Proteus仿真AT89C51电子密码锁
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功