# c-lasso: a Python package for constrained sparse regression and classification
=========
c-lasso is a Python package that enables sparse and robust linear regression and classification with linear equality
constraints on the model parameters. The forward model is assumed to be:
<img src="https://latex.codecogs.com/gif.latex?y=X\beta+\sigma\epsilon\qquad\text{s.t.}\qquad&space;C\beta=0" title="y=X\beta+\sigma\epsilon\qquad\text{s.t.}\qquad C\beta=0" />
Here, y and X are given outcome and predictor data. The vector y can be continuous (for regression) or binary (for classification). C is a general constraint matrix. The vector β comprises the unknown coefficients and σ an
unknown scale.
The package handles several different estimators for inferring β (and σ), including
the constrained Lasso, the constrained scaled Lasso, and sparse Huber M-estimation with linear equality constraints.
Several different algorithmic strategies, including path and proximal splitting algorithms, are implemented to solve
the underlying convex optimization problems.
We also include two model selection strategies for determining the sparsity of the model parameters: k-fold cross-validation and stability selection.
This package is intended to fill the gap between popular python tools such as [scikit-learn](https://scikit-learn.org/stable/) which CANNOT solve sparse constrained problems and general-purpose optimization solvers that do not scale well for the considered problems.
Below we show several use cases of the package, including an application of sparse *log-contrast*
regression tasks for *compositional* microbiome data.
The code builds on results from several papers which can be found in the [References](#references).
## Table of Contents
* [Installation](#installation)
* [Regression and classification problems](#regression-and-classification-problems)
* [Getting started](#getting-started)
* [Log-contrast regression for microbiome data](#log-contrast-regression-for-microbiome-data)
* [Optimization schemes](#optimization-schemes)
* [Structure of the code](#structure-of-the-code)
* [References](#references)
## Installation
c-lasso is available on pip. You can install the package
in the shell using
```shell
pip install c_lasso
```
To use the c-lasso package in Python, type
```python
from classo import *
```
The c-lasso package depends on several standard Python packages.
The dependencies are included in the package. Those are, namely :
`numpy` ;
`matplotlib` ;
`scipy` ;
`pandas` ;
`h5py` .
## Regression and classification problems
The c-lasso package can solve six different types of estimation problems:
four regression-type and two classification-type formulations.
#### [R1] Standard constrained Lasso regression:
<img src="https://latex.codecogs.com/gif.latex?\arg\min_{\beta\in&space;R^d}&space;||&space;X\beta-y&space;||^2&space;+&space;\lambda&space;||\beta||_1&space;\qquad\mbox{s.t.}\qquad&space;C\beta=0" />
This is the standard Lasso problem with linear equality constraints on the β vector.
The objective function combines Least-Squares for model fitting with l1 penalty for sparsity.
#### [R2] Contrained sparse Huber regression:
<img src="https://latex.codecogs.com/gif.latex?\arg\min_{\beta\in&space;R^d}&space;h_{\rho}(X\beta-y&space;)&space;+&space;\lambda&space;||\beta||_1&space;\qquad\mbox{s.t.}\qquad&space;C\beta=0" />
This regression problem uses the [Huber loss](https://en.wikipedia.org/wiki/Huber_loss) as objective function
for robust model fitting with l1 and linear equality constraints on the β vector. The parameter ρ=1.345.
#### [R3] Contrained scaled Lasso regression:
<img src="https://latex.codecogs.com/gif.latex?\arg&space;\min_{\beta&space;\in&space;\mathbb{R}^d,&space;\sigma&space;>&space;0}&space;\frac{||&space;X\beta&space;-&space;y||^2}{\sigma}&space;+&space;\frac{n}{2}&space;\sigma+&space;\lambda&space;||\beta||_1&space;\qquad&space;\mbox{s.t.}&space;\qquad&space;C\beta&space;=&space;0" title="\arg \min_{\beta \in \mathbb{R}^d, \sigma > 0} \frac{|| X\beta - y||^2}{\sigma} + \frac{n}{2} \sigma+ \lambda ||\beta||_1 \qquad \mbox{s.t.} \qquad C\beta = 0" />
This formulation is similar to [R1] but allows for joint estimation of the (constrained) β vector and
the standard deviation σ in a concomitant fashion (see [References](#references) [4,5] for further info).
This is the default problem formulation in c-lasso.
#### [R4] Contrained sparse Huber regression with concomitant scale estimation:
<img src="https://latex.codecogs.com/gif.latex?\arg&space;\min_{\beta&space;\in&space;\mathbb{R}^d,&space;\sigma&space;>&space;0}&space;\left(&space;h_{\rho}&space;\left(&space;\frac{&space;X\beta&space;-&space;y}{\sigma}&space;\right)+&space;n&space;\right)&space;\sigma+&space;\lambda&space;||\beta||_1&space;\qquad&space;\mbox{s.t.}&space;\qquad&space;C\beta&space;=&space;0" title="\arg \min_{\beta \in \mathbb{R}^d, \sigma > 0} \left( h_{\rho} \left( \frac{ X\beta - y}{\sigma} \right)+ n \right) \sigma+ \lambda ||\beta||_1 \qquad \mbox{s.t.} \qquad C\beta = 0" />
This formulation combines [R2] and [R3] to allow robust joint estimation of the (constrained) β vector and
the scale σ in a concomitant fashion (see [References](#references) [4,5] for further info).
#### [C1] Contrained sparse classification with Square Hinge loss:
<img src="https://latex.codecogs.com/gif.latex?\arg\min_{\beta\in&space;R^d}&space;l(y^TX\beta)&space;+&space;\lambda&space;||\beta||_1&space;\qquad\mbox{s.t.}\qquad&space;C\beta=0" />
where l is defined as :
<img src="https://latex.codecogs.com/gif.latex?l(r)=\max(1-r,0)^2" />
This formulation is similar to [R1] but adapted for classification tasks using the Square Hinge loss
with (constrained) sparse β vector estimation.
#### [C2] Contrained sparse classification with Huberized Square Hinge loss:
<img src="https://latex.codecogs.com/gif.latex?\arg\min_{\beta\in&space;R^d}&space;l_{\rho}(y^TX\beta)&space;+&space;\lambda&space;||\beta||_1&space;\qquad\mbox{s.t.}\qquad&space;C\beta=0" />
where l is defined as :
<img src="https://latex.codecogs.com/gif.latex?l_{\rho}(r)&space;=&space;\begin{cases}&space;(1-r)^2&space;&\mbox{if&space;}&space;\rho&space;\leq&space;r&space;\leq&space;1&space;\\&space;(1-\rho)(1+\rho-2r)&space;&\mbox{if&space;}&space;r&space;\leq&space;\rho&space;\\&space;0&space;&\mbox{if&space;}&space;r&space;\geq&space;1&space;\end{cases}" title="l_{\rho}(r) = \begin{cases} (1-r)^2 &\mbox{if } \rho \leq r \leq 1 \\ (1-\rho)(1+\rho-2r) &\mbox{if } r \leq \rho \\ 0 &\mbox{if } r \geq 1 \end{cases}" />
This formulation is similar to [C1] but uses the Huberized Square Hinge loss for robust classification
with (constrained) sparse β vector estimation.
## Getting started
#### Basic example
We begin with a basic example that shows how to run c-lasso on synthetic data. The c-lasso package includes
the routine ```random_data``` that allows you to generate problem instances using normally distributed data.
```python
n,d,d_nonzero,k,sigma =100,100,5,1,0.5
(X,C,y),sol = random_data(n,d,d_nonzero,k,sigma,zerosum=True)
```
This code snippet generates a problem instance with sparse β in dimension
d=100 (sparsity d_nonzero=5). The design matrix X comprises n=100 samples generated from an i.i.d standard normal
distribution. The dimension of the constraint matrix C is d x k matrix. The noise level is σ=0.5.
The input ```zerosum=True``` implies that C is the all-ones vector and Cβ=0. The n-dimensional outcome vector y
and the regression vector β is then generated to satisfy the given constraints.
Next we can define a default c-lasso problem instance with the generated data:
```python
problem = classo_problem(X,y,C)
```
You can look at the generated problem instance by typing:
```p
挣扎的蓝藻
- 粉丝: 8w+
- 资源: 15万+