PyPI官网下载|pydra_ml-0.4.0.tar.gz资源-CSDN文库

版权申诉

153 浏览量 2022-01-15 05:13:10 上传评论收藏 19KB GZ 举报

共18个文件

py：9个

txt：5个

pkg-info：2个

PyPI（Python Package Index）是Python开发者们发布和下载Python软件包的主要平台。"PyPI 官网下载 | pydra_ml-0.4.0.tar.gz" 是一个从PyPI官方获取的Python库的压缩包，名为“pydra_ml”的版本为0.4.0的源代码文件，其格式为tar.gz，这是一种常见的在Linux和Unix系统中打包和压缩文件的方式。 "pydra_ml"是一个Python库，它很可能是一个用于机器学习（ML）任务的工具集。在Python生态系统中，这样的库通常提供一系列功能，帮助用户进行数据预处理、模型训练、评估以及部署等步骤。具体到"pydra_ml"，它的详细功能需要查看其文档或者源代码才能了解。 Python库的安装通常通过pip（Python的包管理器）进行，对于从PyPI下载的tar.gz文件，安装步骤可能包括解压缩、进入目录、运行setup.py文件，然后使用pip安装。但更简单的方法是直接通过pip在线安装，如： ```bash pip install pydra_ml ``` 如果库作者已经将其上传到PyPI，那么上述命令应该能直接安装。标签“Python库”表明pydra_ml是Python编程语言的一部分，它可能是由函数、类和其他Python对象组成的一个模块或集合，旨在解决特定问题或提供特定服务。Python库广泛应用于数据科学、Web开发、自动化、系统管理等各种场景。在压缩包"pydra_ml-0.4.0"中，我们可以期待找到以下组成部分： 1. `setup.py`：这是Python项目的配置文件，用于定义项目元数据和安装过程。 2. `README`：通常包含项目简介、安装指南、使用示例等信息。 3. `requirements.txt`（如果存在）：列出项目依赖的其他Python库。 4. `LICENSE`：包含项目的许可信息，规定了其他人可以如何使用和分发这个库。 5. `docs/`：可能包含项目文档，如 Sphinx 生成的HTML文件。 6. `src/` 或 `pydra_ml/`：实际的源代码目录，包含Python模块和脚本。 7. `tests/`：测试代码，确保库的功能正常。为了更深入地理解pydra_ml的功能和用法，我们需要查阅其官方文档，或者直接阅读源代码。此外，参与开源社区，如在GitHub上查找该项目，可以查看其更新历史、提交信息、问题跟踪和用户反馈，从而获得更全面的信息。在实际应用中，了解如何导入库，调用其提供的函数或类，以及如何与其他Python库协同工作，都是掌握新库的关键步骤。

资源推荐

资源详情

资源评论

收起资源包目录

pydra_ml-0.4.0.tar.gz （18个子文件）

pydra_ml-0.4.0

setup.cfg 2KB

README.md 10KB

pydra_ml

classifier.py 4KB

tests

__init__.py 0B

test_classifier.py 3KB

report.py 16KB

cli.py 1KB

tasks.py 6KB

_version.py 497B

__init__.py 922B

PKG-INFO 13KB

pydra_ml.egg-info

dependency_links.txt 1B

PKG-INFO 13KB

SOURCES.txt 401B

top_level.txt 9B

entry_points.txt 47B

requires.txt 793B

setup.py 728B

[![Python package](https://github.com/nipype/pydra-ml/workflows/Python%20package/badge.svg?branch=master)](https://github.com/nipype/pydra-ml/actions?query=workflow%3A%22Python+package%22) # pydra-ml Pydra-ML is a demo application that leverages [Pydra](https://github.com/nipype/pydra) together with [scikit-learn](https://scikit-learn.org) to perform model comparison across a set of classifiers. The intent is to use this as an application to make Pydra more robust while allowing users to generate classification reports more easily. This application leverages Pydra's powerful splitters and combiners to scale across a set of classifiers and metrics. It will also use Pydra's caching to not redo model training and evaluation when new metrics are added, or when number of iterations (`n_splits`) is increased. 1. Output report contains [SHAP](https://github.com/slundberg/shap) feature analysis. 2. Allows for comparing *some* scikit-learn pipelines in addition to base classifiers. ### Installation pydraml requires Python 3.7+. ``` pip install pydra-ml ``` ## CLI usage This repo installs `pydraml` a CLI to allow usage without any programming. To test the CLI for a classification example, copy the `pydra_ml/tests/data/breast_cancer.csv` and `short-spec.json.sample` to a folder and run. ``` $ pydraml -s short-spec.json.sample ``` To check a regression example, copy `pydra_ml/tests/data/diabetes_table.csv` and `diabetes_spec.json` to a folder and run. ``` $ pydraml -s diabetes_spec.json ``` For each case pydra-ml will generate a result folder with the spec file name that includes `test-{metric}-{timestamp}.png` file for each metric together with a pickled results file containing all the scores from the model evaluations. ``` $ pydraml --help Usage: pydraml [OPTIONS] Options: -s, --specfile PATH Specification file to use [required] -p, --plugin TEXT... Pydra plugin to use [default: cf, n_procs=1] -c, --cache TEXT Cache dir [default: /Users/satra/software/sensein/pydra-ml/cache-wf] --help Show this message and exit. ``` With the plugin option you can use local multiprocessing ``` $ pydraml -s ../short-spec.json.sample -p cf "n_procs=8" ``` or execution via dask. ``` $ pydraml -s ../short-spec.json.sample -p dask "address=tcp://192.168.1.154:8786" ``` ## Current specification The current specification is a JSON file as shown in the example below. It needs to contain all the fields described here. For datasets with many features, you will want to generate `x_indices` programmatically. - *filename*: Absolute path to the CSV file containing data. Can contain a column, named `group` to support `GroupShuffleSplit`, else each sample is treated as a group. - *x_indices*: Numeric (0-based) or string list of columns to use as input features - *target_vars*: String list of target variable (at present only one is supported) - *group_var*: String to indicate column to use for grouping - *n_splits*: Number of shuffle split iterations to use - *test_size*: Fraction of data to use for test set in each iteration - *clf_info*: List of scikit-learn classifiers to use. - *permute*: List of booleans to indicate whether to generate a null model or not - *gen_shap*: Boolean indicating whether shap values are generated - *nsamples*: Number of samples to use for shap estimation - *l1_reg*: Type of regularizer to use for shap estimation - *plot_top_n_shap*: Number or proportion of top SHAP values to plot (e.g., 16 or 0.1 for top 10%). Set to 1.0 (float) to plot all features or 1 (int) to plot top first feature. - *metrics*: scikit-learn metric to use ## `clf_info` specification This is a list of classifiers from scikit learn and uses an array to encode: ``` - module - classifier - (optional) classifier parameters - (optional) gridsearch param grid ``` when param grid is provided and default classifier parameters are not changed, then an empty dictionary **MUST** be provided as parameter 3. This can also be embedded as a list indicating a scikit-learn Pipeline. For example: ``` [ ["sklearn.impute", "SimpleImputer"], ["sklearn.preprocessing", "StandardScaler"], ["sklearn.tree", "DecisionTreeClassifier", {"max_depth": 5}] ] ``` ## Example specification: ``` {"filename": "breast_cancer.csv", "x_indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], "target_vars": ["target"], "group_var": null, "n_splits": 100, "test_size": 0.2, "clf_info": [ ["sklearn.ensemble", "AdaBoostClassifier"], ["sklearn.tree", "DecisionTreeClassifier", {"max_depth": 5}], ["sklearn.neural_network", "MLPClassifier", {"alpha": 1, "max_iter": 1000}], ["sklearn.svm", "SVC", {"probability": true}, [{"kernel": ["rbf", "linear"], "C": [1, 10, 100, 1000]}]], ["sklearn.neighbors", "KNeighborsClassifier", {}, [{"n_neighbors": [3, 5, 7, 9, 11, 13, 15, 17, 19], "weights": ["uniform", "distance"]}]], [ ["sklearn.impute", "SimpleImputer"], ["sklearn.preprocessing", "StandardScaler"], ["sklearn.tree", "DecisionTreeClassifier", {"max_depth": 5}] ] ], "permute": [true, false], "gen_shap": true, "nsamples": 100, "l1_reg": "aic", "plot_top_n_shap": 16, "metrics": ["roc_auc_score"] } ``` ## Output: The workflow will output: - `results-{timestamp}.pkl` containing 1 list per model used. For example, if assigned to variable `results`, it is accessed through `results[0]` to `results[N]` (e.g., if `permute: [true,false]` then it will output the model trained on permuted labels first `results[0]` and the model trained on the labels second `results[1]`. If there is an additional model, these will be accesed through `results[2]` and `results[3]`). Each model contains: - `dict` accesed through `results[0][0]` with model information: `{'ml_wf.clf_info': ['sklearn.neural_network', 'MLPClassifier', {'alpha': 1, 'max_iter': 1000}], 'ml_wf.permute': False}` - `pydra Result obj` accesed through `results[0][1]` with attribute `output` which itself has attributes: - `feature_names`: from the columns of the data csv. And the following attributes organized in N lists for N bootstrapping samples: - `output`: N lists, each one with two lists for true and predicted labels. - `score`: N lists each one containing M different metric scores. - `shaps`: N lists each one with a list of shape (P,F) where P is the amount of predictions and F the different SHAP values for each feature. `shaps` is empty if `gen_shap` is set to `false` or if `permute` is set to true. - `model`: A pickled version of the model trained on all the input data. One can use this model to test on new data that has the exact same input shape and features as the trained model. For example: ```python import pickle as pk import numpy as np with open("results-20201208T010313.229190.pkl", "rb") as fp: data = pk.load(fp) trained_model = data[0][1].output.model trained_model.predict(np.random.rand(1, 30)) ``` Please check the value of `data[N][0]` to ensure that you are not using a permuted model. - One figure per metric with performance distribution across splits (with or without null distribution trained on permuted labels) - One figure per any metric with the word `score` in it reporting the results of a Wilcoxon signed rank test. The figure reports one-sided stats values as the color of each cell and the corresponding `-log10(pvalue)` as the annotation. Higher numbers indicate stronger effect (color) and lower p-values (annotation). The actual numeric values are stored in a correspondingly named pkl file. - `shap-{timestamp}` dir - SHAP values are computed for each prediction in each split's test set (e.g., 30 bootstrapping splits with 100 prediction will create (30,100) array). The m

评论收藏

内容反馈

版权申诉