高维生存数据的无模型特征筛选资源-CSDN文库

研究论文

需积分: 16 169 浏览量 2021-04-03 00:57:23 上传评论收藏 656KB PDF 举报

资源详情

资源评论

SCIENCE CHINA

Mathematics

September 2018 Vol. 61 No. 9: 1617–1636

https://doi.org/10.1007/s11425-016-9116-6

 Science China Press and Springer-Verlag GmbH Germany 2018 math.scichina.com

ARTICLES

Model-free feature screening for high-dimensional

survival data

Yuanyuan Lin

, Xianhui Liu

& Meiling Hao

3,∗

Department of Statistics, The Chinese University of Hong Kong, Hong Kong 999077, China;

School of Statistics and Research Center of Applied Statistics,

Jiangxi University of Finance and Economics, Nanchang 330013, China;

The Princess Margaret Cancer Center, University Health Network, Toronto M 5G 2M 9, Canada

Email: ylin@sta.cuhk.edu.hk, liuxh@mail.ustc.edu.cn, Meiling.Hao@uhnresearch.ca

Received November 14, 2016; accepted May 30, 2017; published online March 28, 2018

Abstract With the rapid-growth-in-size scientiﬁc data in various disciplines, feature screening plays an im-

portant role to reduce the high-dimensionality to a moderate scale in many scientiﬁc ﬁelds. In this paper, we

introduce a uniﬁed and robust model-free feature screening approach for high-dimensional survival data with

censoring, which has several advantages: it is a model-free approach under a general model framework, and

hence avoids the complication to specify an actual model form with huge number of candidate variables; under

mild conditions without requiring the existence of any moment of the response, it enjoys the ranking consistency

and sure screening properties in ultra-high dimension. In particular, we impose a conditional independence

assumption of the resp onse and the censoring variable given each covariate, instead of assuming the censoring

variable is independent of the response and the covariates. Moreover, we also propose a more robust variant to

the new procedure, which possesses desirable theoretical properties without any ﬁnite moment condition of the

predictors and the response. The computation of the newly proposed methods does not require any complicated

numerical optimization and it is fast and easy to implement. Extensive numerical studies demonstrate that the

proposed methods perform competitively for various conﬁgurations. Application is illustrated with an analysis

of a genetic data set.

Keywords feature screening, random censoring, robustness, sure independence screening, ultra-high dimen-

sion

MSC(2010) 35J60, 35J70

Citation: Lin Y Y, Liu X H, Hao M L. Model-free feature screening for high-dimensional survival data. Sci China

Math, 2018, 61: 1617–1636, https://doi.org/10.1007/s11425-016-9116-6

1 Introduction

High-dimensional data, where the number of candidate predictors or parameters p may be much larger

than the sample size n, arise in many ﬁelds of mo dern science and pose unprecedent challenge for statistical

analysis. With or without censoring, there have been numerous state-of-the-art variable selection methods

based on penalized methods that are successfully applied and deemed useful in analyzing high-dimensional

data, including Lasso (see [21]), smoothly clipped absolute deviation (SCAD) (see [6]), the Dantzig

* Corresponding author

1618 Lin Y Y et al. Sci China Math September 2018 Vol. 61 No. 9

selector (see [3]), group Lasso (see [26]), adaptive Lasso (see [32]) and their variants. Variable selection

methods for censored outcomes are thoroughly studied by [7, 14, 22, 27]. For moderate or large p, the

optimization problems associated with the penalized approaches can be solved eﬀectively and quickly.

However, when p grows exponentially fast with n, the aforementioned penalized methods encounter

computational complexity in handling such ultra-high-dimensional data. Feature screening methods are

particularly designed to reduce the high-dimensionality to a moderate scale. Fan and Lv [8] proposed

the sure independence screening (SIS) and the iterated sure independence screening (ISIS) for linear

regression by ranking the marginal correlations of each predictor with the response variable. These

methods are clearly motivated and are extended to the generalized linear model by [9, 10]. However, it

is known that correlation may not b e a robust measure for association. Their performance might be

inﬂuenced by outliers in the responses and predictors.

Recently, several important ﬁndings regarding model-free feature screening were reported in the liter-

ature (see [4, 17, 18, 20, 29–31] among many others). In particular, Zhu et al. [31] proposed a model-free

feature screening method under a uniﬁed model framework, which is indeed novel. Without the speciﬁ-

cation of a particular model structure, the proposal of [31] is theoretically and practically app ealing for

feature screening, especially when there are huge number of candidate variables. He et al. [12] introduced

an intriguing quantile-adaptive model-free variable screening framework for high-dimensional heteroge-

neous data. This framework can be extended to handle survival data under a conditional independence

assumption of the response and censoring variable given the covariates. Song et al. [20] studied a model-

free rank independence screening based on an inverse probability weighted Kendall’s τ rank correlation

for high-dimensional survival data. Feature screening methods for high-dimensional survival data based

on Cox’s partial likelihood can be found in, for example [11, 23, 28].

In this paper, partly motivated by the interesting work of Zhu et al. [31] and He et al. [12], we intro-

duce a uniﬁed and robust model-free feature screening approach to handle high-dimensional survival data.

There are several advantages of this method. First, it is a model-free screening approach based on an

inverse-probability-weighted correlation measure, and hence avoids the complication to specify a working

model with huge number of candidate variables. Second, the new method does not involve any nonpara-

metric estimation except the estimation of the conditional survival function of the censoring variable given

a predictor, where a lo cal Kaplan-Meier estimator is used. Third, under very mild conditions without

requiring the existence of any moment of the response variable, we prove the prop osed method enjoys

the ranking consistency and sure screening properties in ultra-high dimension. In particular, we impose

a conditional independence assumption of the response and the censoring variable given the covariates,

instead of assuming the fully independence assumption of the censoring variable and the response and

the covariates that is common in the literature. Moreover, we also propose a more robust variant to the

new procedure, which is proved to possess desirable theoretical properties without any ﬁnite moment

condition of the predictors and the response. Hence, the proposed methods are robust to outliers in the

response and predictors.

The rest of the paper is organized as follows. In Section 2, we introduce the proposed censored

model-free feature screening procedure and its more robust variant. Their theoretical properties are

also discussed in Section 2. Extensive simulation studies are carried out to verify the ﬁnite sample

performance of the new methods in Section 3. In Section 4, we demonstrate an application to a genetic

data set. Section 5 contains discussions and a few concluding remarks. All the technical proofs are

deferred to App endixes A and B.

2 Methodology and main results

Let T be the time to event of interest, C be the censoring time, and x = (X

, . . . , X

)

be the p-

dimensional predictor vector. We assume that T is subject to random right censoring. The observed data

are (x

, Y

, ∆

) for i = 1, . . . , n, independently and identically distributed copies of (x

, Y, ∆), where

Y = min(T, C), ∆ = I(T 6 C) and I(·) is the indicator function. Throughout this article, it is assumed

Lin Y Y et al. Sci China Math September 2018 Vol. 61 No. 9 1619

that the censoring time C is conditionally independent of the failure time T given X

, j = 1, 2, . . . , p. Let

F (t |x) = P(T < t |x) be the conditional distribution function of T given x. We next deﬁne the set of

active predictors and the set of inactive predictors, respectively, by

A = {k : F (t |x) functionally depends on X

, k = 1, . . . , p, for some t ∈ [0, τ ]} and A

where A ∪ A

= {1, . . . , p} and τ is the end of the study. In this paper, we consider the general model

framework given by Zhu et al. [31], i.e., F (t |x) depends on x only through a subset of x. In other words,

it is assumed that

F (t |x) = F (t |β

x) = F (t |β

Our goal is to select a subset of active variables from {X

, k = 1, . . . , p}.

2.1 A model-free sure screening procedure with censored data

Throughout this subsection, we assume that E(X

) = 0 and Var(X

) = 1 for k = 1, . . . , p for simplicity.

Deﬁne r

(y) = E{X

I( T 6 y)}. Then, it follows by law of iterated expectations that

(y) = E{X

I(T 6 y)} = E{X

F (y |X

)}.

The model framework suggests that the independence of T and X

is necessary and suﬃcient condition

for r

(y) = 0 for any y ∈ R. In the presence of censoring, a direct calculation yields that

(y) = Cov



, I(Y 6 y)

∆

G(Y |X

)



where G(t |X

) ≡ P(C > t |X

). Let G(t |X

) ≡ P(C

> t |X

) be the conditional survival function

of C

given X

. In other words, we allow covariate-dependent censoring in this article. Let

G(t |X

) be

the local Kaplan-Meier estimator of G(t |X

) (see [1, 12]). To be sp eciﬁc,

G(s |x) =



j=1,Y



1 −

(x)(1 − ∆

)



l=1

I(Y

> Y

(x)



where B

(x) = K{(x − x

)/h}/[



i=1

K{(x − x

)/h}], l = 1, . . . , n are the Nadaraya-Watson weights,

K(·) is a density function and h is the bandwidth. A natural and consistent estimator of r

(y) is

ˆr

(y) =



i=1

∆

G(Y

)

I(Y

6 y). (2.1)

We propose to rank all the candidate variables X

, k = 1, . . . , p and select features according to the

magnitude of ˆw

= 1/n



j=1

|ˆr

)|, an estimator of w

≡ E{|r

(Y )|}. With a preset threshold γ, we

will select the subset of variables

A = {k : ˆw

> γ}.

Denote G(t |x) = G(t |X

= x), P(t 6 T 6 C |x) = P(t 6 T 6 C |X = x), k = 1, 2, . . . , p. The

following conditions are needed to establish the sure screening and ranking consistency properties:

(C1) inf

P(t 6 T 6 C |x) > µ

> 0 for some positive constant µ

and any t ∈ [0, τ ]. G(t |x)

has ﬁrst derivative with respect to t, which is uniformly bounded away from ∞. Furthermore, G(t |x)

has bounded (uniformly in t) second-order partial derivatives with respect to x. Besides, there exist

some positive constants µ

and µ

satisfying µ

6 sup{t : G(t |x) > 0} 6 µ

uniformly in x, and

sup{t : P(T > t |x) > 0} > sup{t : G(t |x) > 0} almost surely for x.

(C2) The predictor vector x = (X

, . . . , X

)

satisﬁes max

16k6p

E{exp(t|X

|)} 6 C

< ∞ for any

0 < t 6 t

, where t

and C

are some p ositive constants.

(C3) The linearity condition: E(x |β

) = Cov(x, β

){Cov(β

)}

−1

1620 Lin Y Y et al. Sci China Math September 2018 Vol. 61 No. 9

(C4) {Cov(β

)}

−1

E|E{β

I{T 6

Y }|

Y }| is bounded away from zero, where

Y is an independent

copy of Y .

(C5) min

k∈A

|Cov(X

, β

)| > 2ˆc

−η

for some constants ˆc

> 0 and η ∈ [0, 1/2).

(C6) lim inf

p→∞

{min

k∈A

|Cov(X

, β

)| − max

k∈A

|Cov(X

, β

)|} > ˆm

for a ˆm

> 0.

Condition (C1) is similar to [12, Condition (C6

′

)], which is common in survival analysis literature to

ensure the local Kaplan-Meier estimator are well-behaved. Condition (C2) is the sub-exponential moment

condition of the predictors that holds for various distributions, for example, the normal distribution and

distributions with bounded support. Conditions (C3)–(C5) are crucial to ensure the sure screening prop-

erty. In addition to Conditions (C1)–(C5), Condition (C6) is imposed to ensure the ranking consistency

property. Note that we do not impose any ﬁnite moment condition on the response variable. We state

the sure screening property as well as a bound on the size of selected variables by the proposed method

in Theorem 2.1.

Theorem 2.1. Suppose Conditions (C1)–(C5) hold. Let M

be a sequence depending on n. If

log(n)/(n

1−2η

h) → 0, n

→ 0, nh

→ ∞, n

1−2η

→ ∞ and log(n)/M

→ 0 as n → ∞, then there

exist constants c

> 0, c

> 0 and θ

> 0 such that

(1) (Sure screening property)



max

16k6p

|ˆw

− w

| > c

−η



6 2p



exp(−c

1−2η

) + exp



−

1−2η



+ nC

exp(−M

)



for n suﬃciently large. In addition, letting γ = c

−η

, we have

P (A ⊆

A) > 1 − 2s



exp(−c

1−2η

) + exp



−

1−2η



+ nC

exp(−M

)



for n suﬃciently large, where s

is the size of A.

(2) (Controlling false discovery rate) Moreover,





A| 6 2c

−1



k=1



> 1 − 2p



exp(−c

1−2η

) + exp



−

1−2η



+ nC

exp(−M

)



for n suﬃciently large.

The above theorem tells that if M

= O(n

(1−2η)/2−α

) for any 0 < α

< (1 − 2η)/2 and the dimen-

sionality p is at the exponential rate of the sample size n, the sure screening property still holds. In other

words, our method is able to handle the ultra-high-dimensional data. In fact, it is pointed out that the

assumptions of Theorem 2.1 implies η < 2/5. Furthermore, under Condition (C2), the second part of

Theorem 2.1 suggests that when the true model size s

is at the order of n

, then the size of selected

active set is of polynomial size with high probability. In particular, with β + η < 1, the hard thresholding

rule with threshold [n/ log(n)] adopted in Sections 3–4 is able to select the true active predictors with

high probability.

The next theorem establishes the ranking consistency property of the proposed method.

Theorem 2.2 (Ranking consistency property). Assume Conditions (C1)–(C6) hold. If log(n)/(nh)

→

→ ∞

1−2η

→ ∞

and

log(

)

→

→ ∞

, then there exist constants

> 0, c

> 0 and θ

> 0 such that



min

k∈A

ˆw

− max

k∈A

ˆw



6 2p



exp(−c

n) + exp



−



+ nC

exp(−M

)



for n suﬃciently large.

剩余19页未读，继续阅读

评论收藏

内容反馈

高维生存数据的无模型特征筛选

评论0

最新资源

高维生存数据的无模型特征筛选

评论0

最新资源

相关推荐

超高维可加模型下的特征筛选

论文研究-高维少样本数据的特征压缩.pdf

高维生存数据的强大特征筛选

深度学习模型融合正则化方法在高维数据特征筛选中的应用研究.pdf

高维数据分类中的特征降维研究

多维数据集中高维数据可视化算法研究

高维数据支持向量机特征选择

高维纵向数据分析中的降维方法研究

高维大数据分析的无监督异常检测方法.pdf

JMbayes:使用MCMC的纵向和生存数据联合模型

负荷曲线数据中概念漂移的聚类生存模型

论文研究-基于小世界模型的高维数据查询算法.pdf

一种面向高维数据的迭代式Lasso特征选择方法

高维数据处理论文

论文研究-一种面向高维数据的迭代式Lasso特征选择方法.pdf

基于相似性保持和特征变换的高维数据聚类改进算法

高维数据几何结构与降维（国内唯一一本讲如何比较详尽的阐述高维数据如何降维的）

Visualization-of-data-and-functions.zip_曲面 特征_高维数据MATLAB

论文研究-高维数据特征降维研究综述.pdf

高维数据特征降维研究综述_胡洁.pdf

FS.zip_Fisher比率法进行特征筛选_数据筛选_特征筛选

js代码-筛选出旧模型中在新模型中没有的数据

一种适用于高维大数据集的数据分类方法.pdf

高维数据可视化方法研究

高维数据SVM实现+降维可视化

高维数据挖掘技术研究

高维数据挖掘中特征选择的稳健方法

Visualization-of-data-and-functions.zip_曲面特征_高维数据MATLAB