pandas-ml-utils-0.0.11.tar.gz资源-CSDN文库

需积分: 1 178 浏览量 2024-03-07 12:45:07 上传评论收藏 549KB GZ 举报

共54个文件

py：44个

png：2个

txt：1个

资源推荐

资源详情

资源评论

收起资源包目录

pandas-ml-utils-0.0.11.tar.gz （54个子文件）

pandas-ml-utils-0.0.11

setup.py 1KB

LICENSE 1KB

PKG-INFO 227B

pandas_ml_utils

utils.py 2KB

__init__.py 2KB

train_test_data.py 6KB

multi_model.py 5KB

reinforcement

__init__.py 0B

agent.py 2KB

summary.py 733B

gym.py 2KB

pandas_utils_extension.py 1KB

error

__init__.py 0B

functions.py 126B

analysis

__init__.py 0B

correlation_analysis.py 1KB

datafetching

__init__.py 0B

fetch_yahoo.py 2KB

regression

__init__.py 0B

regressor.py 2KB

summary.py 598B

model

fitter.py 6KB

__init__.py 0B

fit.py 2KB

models.py 8KB

summary.py 122B

features_and_Labels.py 3KB

selection.py 5KB

classification

__init__.py 0B

classification_plots.py 673B

summary.py 7KB

classifier.py 2KB

wrappers

__init__.py 0B

lazy_dataframe.py 2KB

hashable_dataframe.py 532B

extern

__init__.py 0B

loss_functions.py 3KB

pyproject.toml 734B

requirements.txt 201B

deploy.sh 267B

test

test__features_and_labels.py 660B

test__utils.py 1KB

test__feature_selection.py 567B

test__classification_summary.py 2KB

component_test.py 10KB

test__hashable_dataframe.py 506B

test__model.py 2KB

test__lazy_dataframe.py 827B

component_test.csv 463KB

test__make_train_test_data.py 9KB

.gitignore 64B

images

simple-fit.png 49KB

fit-with-loss.png 322KB

README.md 11KB

![PyPI - Downloads](https://img.shields.io/pypi/dw/pandas-ml-utils) # pandas-ml-utils **A note of caution**: this is a one man show hobby project in pre-alpha state mainly serving my own needs. Be my guest and use it or extend it. I was really sick of converting data frames to numpy arrays back and forth just to try out a simple logistic regression. So I have started a pandas ml utilities library where everything should be reachable from the data frame itself. Something along the lines `model = df.fit(my_super_model)` Provided utils include: * basic feature analysis / selection * fit various kinds of models directly from data frames * fit binary classifiers * fit regression models * fit reinforcement agents * develop, save, load and deploy models Check the [component tests](https://github.com/KIC/pandas_ml_utils/blob/master/test/component_test.py) for some more concrete examples. ## Basic Feature Analysis / Selection TODO ... write this stuff ## Fitting Models Directly from DataFrames ### Binary Classification ```python import pandas as pd import pandas_ml_utils as pmu from sklearn.datasets import load_breast_cancer from sklearn.linear_model import LogisticRegression bc = load_breast_cancer() df = pd.DataFrame(bc.data, columns = bc.feature_names) df["label"] = bc.target fit = df.fit_classifier(pmu.SkitModel(LogisticRegression(solver='lbfgs', max_iter=300), pmu.FeaturesAndLabels(features=['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'worst concave points', 'worst fractal dimension'], labels=['label'])), test_size=0.4) ``` As a result you get a Fit object which holds the fitted model and two ClassificationSummary. One for the training data and one for the test Data. In case of the classification was executed in a notebook you get a nice table: ![Fit](https://raw.githubusercontent.com/KIC/pandas_ml_utils/master/images/simple-fit.png) ### Binary Classification with Loss As you can see in the above example are two confusion matrices the regular well known one and a "loss". The intend of loss matrix is to tell you if a miss classification has a cost i.e. a loss in dollars. ```python import pandas as pd import pandas_ml_utils as pmu from sklearn.linear_model import LogisticRegression df = pd.fetch_yahoo(spy='SPY') df["label"] = df["spy_Close"] > df["spy_Open"] df["loss"] = (df["spy_Open"] / df["spy_Close"] - 1) * 100 fit = df.fit_classifier(pmu.SkitModel(LogisticRegression(solver='lbfgs'), pmu.FeaturesAndLabels(features=['spy_Open', 'spy_Low'], labels=['label'], loss_column='loss')), test_size=0.4) ``` ![Fit with loss](https://raw.githubusercontent.com/KIC/pandas_ml_utils/master/images/fit-with-loss.png) Now you can see the loss in % of dollars of your miss classification. The classification probabilities are plotted on the very top of the plot. ### Autoregressive Models and RNN Shape It is also possible to use the FeaturesAndLabels object to generate autoregressive features. By default lagging features results in an RNN shaped 3D array (in the format as Keras likes it). However we can also use SkitModels the features will be implicitly transformed back into a 2D array (by using the `reshape_rnn_as_ar` function). ```python import pandas_ml_utils as pmu pmu.FeaturesAndLabels(features=['feature'], labels=['label'], feature_lags=range(0, 10)) ``` One may like to use very long lags i.e. to catch seasonal effects. Since very long lags are a bit fuzzy I usually like to smooth them a bit by using simple averages. ```python import pandas_ml_utils as pmu pmu.FeaturesAndLabels(features=['feature'], labels=['label'], target_columns=['strike'], loss_column='put_loss', feature_lags=[0, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233], lag_smoothing={ 6: lambda df: df.SMA(3, price=df.columns[0]), 35: lambda df: df.SMA(5, price=df.columns[0]) }) ``` Every lag from 6 onwards will be smoothed by a 3 period average, every lag from 35 onwards with a 5 periods moving average. ## Cross Validation It is possible to apply a cross validation algorithm to the training data (after the train test split). In case you only want cross validation pass `test_size=0` Note that the current implementation is just fitting the models on all folds one after the other without any averaging of the validation loss. However the folds can be looped many times which essentially means we invented something like fold epochs. Therefore your fitter epochs can be reduced by division of the number of fold epochs. ```python from sklearn.model_selection import KFold cv = KFold(n_splits = 10) fit = df.fit_classifier(..., SomeModel(epochs=100/10), test_size=0.1 # keep 10% very unseen cross_validation=(10, cv.split), ...) ``` ## Back-Testing a Model todo ... `df.backtest_classifier(...)` ## Save, load reuse a Model To save a model you simply call the save method on the model inside of the fit. ``` fit.model.save('/tmp/foo.model') ``` Loading is as simply as calling load on the Model object. You can immediately apply the model on the dataframe to get back the features along with the classification (which is just another data frame). ```python import pandas as pd import pandas_ml_utils as pmu from sklearn.datasets import load_breast_cancer bc = load_breast_cancer() df = pd.DataFrame(bc.data, columns = bc.feature_names) df.classify(pmu.Model.load('/tmp/foo.model')).tail() ``` NOTE If you have a target level for your binary classifier like all houses cheaper then 50k then you can define this target level to the FeaturesAndLabels object likes so: `FeaturesAndLabels(target_columns=['House Price'])`. This target column is simply fed through to the classified dataframe as target columns. ### Fitting other models then classifiers #### Regression Models For non classification tasks use the regressor functions the same way as the classifier functions. * df.fit_regressor(...) * df.backtest_regressor(...) * df.regress(...) #### Reinforcement Learning For reinforcement learning there is a keras-rl backend implemented. The API is the same as for the others like classification or regression. * df.fit_agent(...) * df.backtest_agent(...) * df.agent_take_action(...) However the model is a bit more complicated as the regular SkitModel, you might take a look at the [component tests](https://github.com/KIC/pandas_ml_utils/blob/master/test/component_test.py). ### Other utility objects #### LazyDataFrame Very often I need to do a lot of feature engineering. And very often I do not want to treat averages or other engineering methods as part of the data(frame). For this use case I have added a LazyDataFrame object wrapping around a regular DataFrame where some columns will always be calculated on the fly. Here is an example: ```python import pandas_ml_utils as pmu import pandas as pd import talib df = pd.fetch_yahoo(spy='SPY') ldf = pmu.LazyDataFrame(df, rolling_stddev=lambda x: talib.STDDEV(x['spy_Close'], timeperiod=30) / 100) ldf["rolling_stddev"].tail() # Will always be calculated only the fly ``` #### HashableDataFrame The hashable dataframe is nothing which should be used directly. However this is just a hack to allow caching of feature matrices. With heavy usage o

评论收藏

内容反馈