Python库|prince-0.4.2.tar.gz资源-CSDN文库

版权申诉

108 浏览量 2022-04-12 23:14:43 上传评论收藏 26KB GZ 举报

共22个文件

py：12个

txt：4个

pkg-info：2个

资源推荐

资源详情

资源评论

收起资源包目录

prince-0.4.2.tar.gz （22个子文件）

prince-0.4.2

MANIFEST.in 26B

PKG-INFO 26KB

LICENSE 1KB

setup.cfg 38B

prince

one_hot.py 6KB

mca.py 4KB

util.py 504B

pca.py 8KB

__init__.py 142B

mfa.py 7KB

famd.py 693B

plot.py 1KB

ca.py 6KB

__version__.py 63B

svd.py 975B

setup.py 4KB

README.md 20KB

prince.egg-info

PKG-INFO 26KB

requires.txt 81B

SOURCES.txt 360B

top_level.txt 7B

dependency_links.txt 1B

<div align="center"> <img src="images/logo.png" alt="prince_logo"/> </div> <br/> <div align="center">  <a href="https://pypi.python.org/pypi/prince"> <img src="https://img.shields.io/badge/python-3.x-blue.svg?style=flat-square" alt="PyPI version"/> </a>  <a href="https://pypi.org/project/prince/"> <img src="https://badge.fury.io/py/prince.svg" alt="PyPI"/> </a>  <a href="https://travis-ci.org/MaxHalford/Prince?branch=master"> <img src="https://img.shields.io/travis/MaxHalford/prince/master.svg?style=flat-square" alt="Build Status"/> </a>  <a href="https://coveralls.io/github/MaxHalford/prince?branch=master"> <img src="https://coveralls.io/repos/github/MaxHalford/prince/badge.svg?branch=master&style=flat-square" alt="Coverage Status"/> </a>  <a href="https://opensource.org/licenses/MIT"> <img src="http://img.shields.io/:license-mit-ff69b4.svg?style=flat-square" alt="license"/> </a> </div> <br/> ## Introduction Prince is a library for doing [factor analysis](https://www.wikiwand.com/en/Factor_analysis). This includes a variety of methods including [principal component analysis (PCA)](https://www.wikiwand.com/en/Principal_component_analysis) and [correspondence analysis (CA)](https://www.wikiwand.com/en/Correspondence_analysis). The goal is to provide an efficient implementation for each algorithm along with a nice API. ## Installation :warning: Prince is only compatible with **Python 3**. :snake: Although it isn't a requirement, using [Anaconda](https://www.continuum.io/downloads) is highly recommended. **Via PyPI** ```sh >>> pip install prince # doctest: +SKIP ``` **Via GitHub for the latest development version** ```sh >>> pip install git+https://github.com/MaxHalford/Prince # doctest: +SKIP ``` Prince doesn't have any extra dependencies apart from the usual suspects (`sklearn`, `pandas`, `matplotlib`) which are included with Anaconda. ## Usage ### Guidelines Each estimator provided by `prince` extends scikit-learn's `TransformerMixin`. This means that each estimator implements a `fit` and a `transform` method which makes them usable in a transformation pipeline. The `fit` method is actually an alias for the `row_principal_components` method which returns the row principal components. However you can also access the column principal components with the `column_principal_components`. Under the hood Prince uses a [randomised version of SVD](https://research.fb.com/fast-randomized-svd/). This is much faster than using the more commonly full approach. However the results may have a small inherent randomness. For most applications this doesn't matter and you shouldn't have to worry about it. However if you want reproducible results then you should set the `random_state` parameter. The randomised version of SVD is an iterative method. Because each of Prince's algorithms use SVD, they all possess a `n_iter` parameter which controls the number of iterations used for computing the SVD. On the one hand the higher `n_iter` is the more precise the results will be. On the other hand increasing `n_iter` increases the computation time. In general the algorithm converges very quickly so using a low `n_iter` (which is the default behaviour) is recommended. You are supposed to use each method depending on your situation: - All your variables are numeric: use principal component analysis (`prince.PCA`) - You have a contingency table: use correspondence analysis (`prince.CA`) - You have more than 2 variables and they are all categorical: use multiple correspondence analysis (`prince.MCA`) - You have groups of categorical **or** numerical variables: use multiple factor analysis (`prince.MFA`) - You have both categorical and numerical variables: use factor analysis of mixed data (`prince.FAMD`) The next subsections give an overview of each method along with usage information. The following papers give a good overview of the field of factor analysis if you want to go deeper: - [A Tutorial on Principal Component Analysis](https://arxiv.org/pdf/1404.1100.pdf) - [Theory of Correspondence Analysis](http://statmath.wu.ac.at/courses/CAandRelMeth/caipA.pdf) - [Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions](https://arxiv.org/pdf/0909.4061.pdf) - [Computation of Multiple Correspondence Analysis, with code in R](https://core.ac.uk/download/pdf/6591520.pdf) - [Singular Value Decomposition Tutorial](https://davetang.org/file/Singular_Value_Decomposition_Tutorial.pdf) - [Multiple Factor Analysis](https://www.utdallas.edu/~herve/Abdi-MFA2007-pretty.pdf) ### Principal component analysis (PCA) If you're using PCA it is assumed you have a dataframe consisting of numerical continuous variables. In this example we're going to be using the [Iris flower dataset](https://www.wikiwand.com/en/Iris_flower_data_set). ```python >>> import pandas as pd >>> import prince >>> from sklearn import datasets >>> X, y = datasets.load_iris(return_X_y=True) >>> X = pd.DataFrame(data=X, columns=['Sepal length', 'Sepal width', 'Petal length', 'Sepal length']) >>> y = pd.Series(y).map({0: 'Setosa', 1: 'Versicolor', 2: 'Virginica'}) >>> X.head() Sepal length Sepal width Petal length Sepal length 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 ``` The `PCA` class implements scikit-learn's `fit`/`transform` API. It's parameters have to passed at initialisation before calling the `fit` method. ```python >>> pca = prince.PCA( ... n_components=2, ... n_iter=3, ... rescale_with_mean=True, ... rescale_with_std=True, ... copy=True, ... engine='auto', ... random_state=42 ... ) >>> pca = pca.fit(X) ``` The available parameters are: - `n_components`: the number of components that are computed. You only need two if your intention is to make a chart. - `n_iter`: the number of iterations used for computing the SVD - `rescale_with_mean`: whether to substract each column's mean - `rescale_with_std`: whether to divide each column by it's standard deviation - `copy`: if `False` then the computations will be done inplace which can have possible side-effects on the input data - `engine`: what SVD engine to use (should be one of `['auto', 'fbpca', 'sklearn']`) - `random_state`: controls the randomness of the SVD results. Once the `PCA` has been fitted, it can be used to extract the row principal coordinates as so: ```python >>> pca.transform(X).head() # Same as pca.row_coordinates(X).head() 0 1 0 -2.264542 0.505704 1 -2.086426 -0.655405 2 -2.367950 -0.318477 3 -2.304197 -0.575368 4 -2.388777 0.674767 ``` Each column stands for a principal component whilst each row stands a row in the original dataset. You can display these projections with the `plot_row_coordinates` method: ```python >>> ax = pca.plot_row_coordinates( ... X, ... ax=None, ... figsize=(6, 6), ... x_component=0, ... y_component=1, ... labels=None, ... color_labels=y, ... ellipse_outline=False, ... ellipse_fill=True, ... show_points=True ... ) >>> ax.get_figure().savefig('images/pca_row_coordinates.png') ``` <div align="center"> <img src="images/pca_row_coordinates.png" /> </div> Each principal component explains part of the underlying of the distribution. You can see by how much by using the accessing the `explained_inertia_` property: ```python >>> pca.explained_inertia_ # doctest: +ELLIPSIS [0.727704..., 0.230305...] ``` The explained inertia represents the percentage of the inertia each principal component contributes. It sums up to 1 if the `n_components` property is equal to the numbe

评论收藏

内容反馈

版权申诉