![PyPI - Downloads](https://img.shields.io/pypi/dw/pandas-ml-utils)
# pandas-ml-utils
**A note of caution**: this is a one man show hobby project in pre-alpha state mainly
serving my own needs. Be my guest and use it or extend it.
I was really sick of converting data frames to numpy arrays back and forth just to try out a
simple logistic regression. So I have started a pandas ml utilities library where
everything should be reachable from the data frame itself. Something along the lines
`model = df.fit(my_super_model)`
Provided utils include:
* basic feature analysis / selection
* fit various kinds of models directly from data frames
* fit binary classifiers
* fit regression models
* fit reinforcement agents
* develop, save, load and deploy models
Check the [component tests](https://github.com/KIC/pandas_ml_utils/blob/master/test/component_test.py) for some more
concrete examples.
## Basic Feature Analysis / Selection
TODO ... write this stuff
## Fitting Models Directly from DataFrames
### Binary Classification
```python
import pandas as pd
import pandas_ml_utils as pmu
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
bc = load_breast_cancer()
df = pd.DataFrame(bc.data, columns = bc.feature_names)
df["label"] = bc.target
fit = df.fit_classifier(pmu.SkitModel(LogisticRegression(solver='lbfgs', max_iter=300),
pmu.FeaturesAndLabels(features=['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'worst concave points', 'worst fractal dimension'],
labels=['label'])),
test_size=0.4)
```
As a result you get a Fit object which holds the fitted model and two ClassificationSummary.
One for the training data and one for the test Data. In case of the classification was
executed in a notebook you get a nice table:
![Fit](https://raw.githubusercontent.com/KIC/pandas_ml_utils/master/images/simple-fit.png)
### Binary Classification with Loss
As you can see in the above example are two confusion matrices the regular well known one
and a "loss". The intend of loss matrix is to tell you if a miss classification has a cost
i.e. a loss in dollars.
```python
import pandas as pd
import pandas_ml_utils as pmu
from sklearn.linear_model import LogisticRegression
df = pd.fetch_yahoo(spy='SPY')
df["label"] = df["spy_Close"] > df["spy_Open"]
df["loss"] = (df["spy_Open"] / df["spy_Close"] - 1) * 100
fit = df.fit_classifier(pmu.SkitModel(LogisticRegression(solver='lbfgs'),
pmu.FeaturesAndLabels(features=['spy_Open', 'spy_Low'],
labels=['label'],
loss_column='loss')),
test_size=0.4)
```
![Fit with loss](https://raw.githubusercontent.com/KIC/pandas_ml_utils/master/images/fit-with-loss.png)
Now you can see the loss in % of dollars of your miss classification. The classification
probabilities are plotted on the very top of the plot.
### Autoregressive Models and RNN Shape
It is also possible to use the FeaturesAndLabels object to generate autoregressive
features. By default lagging features results in an RNN shaped 3D array (in the format
as Keras likes it). However we can also use SkitModels the features will be implicitly
transformed back into a 2D array (by using the `reshape_rnn_as_ar` function).
```python
import pandas_ml_utils as pmu
pmu.FeaturesAndLabels(features=['feature'],
labels=['label'],
feature_lags=range(0, 10))
```
One may like to use very long lags i.e. to catch seasonal effects. Since very long lags
are a bit fuzzy I usually like to smooth them a bit by using simple averages.
```python
import pandas_ml_utils as pmu
pmu.FeaturesAndLabels(features=['feature'],
labels=['label'],
target_columns=['strike'],
loss_column='put_loss',
feature_lags=[0, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233],
lag_smoothing={
6: lambda df: df.SMA(3, price=df.columns[0]),
35: lambda df: df.SMA(5, price=df.columns[0])
})
```
Every lag from 6 onwards will be smoothed by a 3 period average, every lag from 35 onwards
with a 5 periods moving average.
## Cross Validation
It is possible to apply a cross validation algorithm to the training data (after the train
test split). In case you only want cross validation pass `test_size=0`
Note that the current implementation is just fitting the models on all folds one after the
other without any averaging of the validation loss. However the folds can be looped many
times which essentially means we invented something like fold epochs. Therefore your fitter
epochs can be reduced by division of the number of fold epochs.
```python
from sklearn.model_selection import KFold
cv = KFold(n_splits = 10)
fit = df.fit_classifier(...,
SomeModel(epochs=100/10),
test_size=0.1 # keep 10% very unseen
cross_validation=(10, cv.split),
...)
```
## Back-Testing a Model
todo ... `df.backtest_classifier(...)`
## Save, load reuse a Model
To save a model you simply call the save method on the model inside of the fit.
```
fit.model.save('/tmp/foo.model')
```
Loading is as simply as calling load on the Model object. You can immediately apply
the model on the dataframe to get back the features along with the classification
(which is just another data frame).
```python
import pandas as pd
import pandas_ml_utils as pmu
from sklearn.datasets import load_breast_cancer
bc = load_breast_cancer()
df = pd.DataFrame(bc.data, columns = bc.feature_names)
df.classify(pmu.Model.load('/tmp/foo.model')).tail()
```
NOTE If you have a target level for your binary classifier like all houses cheaper then
50k then you can define this target level to the FeaturesAndLabels object likes so:
`FeaturesAndLabels(target_columns=['House Price'])`. This target column is simply fed
through to the classified dataframe as target columns.
### Fitting other models then classifiers
#### Regression Models
For non classification tasks use the regressor functions the same way as the classifier
functions.
* df.fit_regressor(...)
* df.backtest_regressor(...)
* df.regress(...)
#### Reinforcement Learning
For reinforcement learning there is a keras-rl backend implemented. The API is the same
as for the others like classification or regression.
* df.fit_agent(...)
* df.backtest_agent(...)
* df.agent_take_action(...)
However the model is a bit more complicated as the regular SkitModel, you might take a look
at the [component tests](https://github.com/KIC/pandas_ml_utils/blob/master/test/component_test.py).
### Other utility objects
#### LazyDataFrame
Very often I need to do a lot of feature engineering. And very often I do not want to
treat averages or other engineering methods as part of the data(frame). For this use
case I have added a LazyDataFrame object wrapping around a regular DataFrame where
some columns will always be calculated on the fly.
Here is an example:
```python
import pandas_ml_utils as pmu
import pandas as pd
import talib
df = pd.fetch_yahoo(spy='SPY')
ldf = pmu.LazyDataFrame(df,
rolling_stddev=lambda x: talib.STDDEV(x['spy_Close'], timeperiod=30) / 100)
ldf["rolling_stddev"].tail() # Will always be calculated only the fly
```
#### HashableDataFrame
The hashable dataframe is nothing which should be used directly. However this is just a
hack to allow caching of feature matrices. With heavy usage o
程序员Chino的日记
- 粉丝: 3715
- 资源: 5万+
最新资源
- YOLO算法-禾本科杂草数据集-4760张图像带标签.zip
- YOLO算法-无人机俯视视角动物数据集-10140张图像带标签-斑马-骆驼-大象-牛-羊.zip
- YOLO算法-挖掘机与火焰数据集-8129张图像带标签-挖掘机.zip
- YOLO算法-塑料数据集-3029张图像带标签-塑料制品-白色塑料.zip
- PyKDL库源码,编译安装PyKDL库
- YOLO算法-红外探测数据集-10573张图像带标签-小型车-人-无人机.zip
- 基于 C++和TCP和WebSocket的即时通信系统设计与实现(源码+文档)
- 电商管理系统项目源代码全套技术资料.zip
- 全国2022年04月高等教育自学考试02326操作系统试题及答案
- YOLO算法-垃圾数据集-3818张图像带标签-可口可乐-百事可乐.zip
- YOLO算法-瓶纸盒合并数据集-1317张图像带标签-纸张-纸箱-瓶子.zip
- YOLO算法-杂草检测项目数据集-3970张图像带标签-杂草.zip
- YOLO算法-杂草检测项目数据集-3853张图像带标签-杂草.zip
- YOLO算法-挖掘机与火焰数据集-7735张图像带标签-挖掘机.zip
- 文旅项目源代码全套技术资料.zip
- YOLO算法-罐头和瓶子数据集-1531张图像带标签-鲜奶-瓶子.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈