藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文03.pdf资源-CSDN文库

阿里云

需积分: 5 90 浏览量 2023-08-30 20:54:07 上传评论收藏 1.22MB PDF 举报

资源推荐

资源详情

资源评论

Stochastic Training of Graph Convolutional Networks with Variance Reduction

Jianfei Chen

Jun Zhu

Le Song

2 3

Abstract

Graph convolutional networks (GCNs) are power-

ful deep neural networks for graph-structured data.

However, GCN computes the representation of a

node recursively from its neighbors, making the

receptive ﬁeld size grow exponentially with the

number of layers. Previous attempts on reducing

the receptive ﬁeld size by subsampling neighbors

do not have a convergence guarantee, and their

receptive ﬁeld size per node is still in the order

of hundreds. In this paper, we develop control

variate based algorithms which allow sampling

an arbitrarily small neighbor size. Furthermore,

we prove new theoretical guarantee for our algo-

rithms to converge to a local optimum of GCN.

Empirical results show that our algorithms enjoy

a similar convergence with the exact algorithm

using only two neighbors per node. The runtime

of our algorithms on a large Reddit dataset is only

one seventh of previous neighbor sampling algo-

rithms.

1. Introduction

Graph convolution networks (GCNs) (Kipf & Welling,

2017) generalize convolutional neural networks (CNNs) (Le-

Cun et al., 1995) to graph structured data. The “graph

convolution” operation applies same linear transformation

to all the neighbors of a node, followed by mean pooling

and nonlinearity. By stacking multiple graph convolution

layers, GCNs can learn node representations by utilizing

information from distant neighbors. GCNs and their vari-

ants (Hamilton et al., 2017a; Veli

ckovi

c et al., 2017) have

been applied to semi-supervised node classiﬁcation (Kipf &

Welling, 2017), inductive node embedding (Hamilton et al.,

2017a), link prediction (Kipf & Welling, 2016; Berg et al.,

2017) and knowledge graphs (Schlichtkrull et al., 2017),

Dept. of Comp. Sci. & Tech., TNList Lab, State Key Lab for

Intell. Tech. & Sys., Tsinghua University, Beijing, 100084, China

Georgia Institute of Technology

Ant Financial. Correspondence

to: Jun Zhu <dcszj@mail.tsinghua.edu.cn>.

Proceedings of the

International Conference on Machine

by the author(s).

outperforming multi-layer perceptron (MLP) models that

do not use the graph structure, and graph embedding ap-

proaches (Perozzi et al., 2014; Tang et al., 2015; Grover &

Leskovec, 2016) that do not use node features.

However, the graph convolution operation makes GCNs dif-

ﬁcult to be trained efﬁciently. The representation of a node

at layer

is computed recursively by the representations

of all its neighbors at layer

L − 1

. Therefore, the receptive

ﬁeld of a single node grows exponentially with respect to

the number of layers, as illustrated in Fig. 1(a). Due to the

large receptive ﬁeld size, Kipf & Welling (2017) propose

to train GCN by a batch algorithm, which computes the

representations of all the nodes altogether. However, batch

algorithms cannot handle large-scale datasets because of

their slow convergence and the requirement to ﬁt the entire

dataset in GPU memory.

Hamilton et al. (2017a) make an initial attempt to develop

stochastic training algorithms for GCNs via a scheme of

neighbor sampling (NS). Instead of considering all the neigh-

bors, they randomly subsample

(l)

neighbors at the

-th

layer. Therefore, they reduce the receptive ﬁeld size to

(l)

, as shown in Fig. 1(b). They ﬁnd that for two-layer

GCNs, keeping

(1)

= 10

and

(2)

= 25

neighbors can

achieve comparable performance with the original model.

However, there is no theoretical guarantee on the conver-

gence of the stochastic training algorithm with NS. More-

over, the time complexity of NS is still

(1)

(2)

= 250

times larger than training an MLP, which is unsatisfactory.

In this paper, we develop novel control variate-based

stochastic approximation algorithms for GCN. We utilize

the historical activations of nodes as a control variate. We

show that while the variance of the NS estimator depends

on the magnitude of the activation, the variance of our algo-

rithms only depends on the difference between the activation

and its historical value. Furthermore, our algorithms bring

new theoretical guarantees. At testing time, our algorithms

give exact and zero-variance predictions, and at training

time, our algorithms converge to a local optimum of GCN

regardless of the neighbor sampling size

(l)

. The the-

oretical results allow us to signiﬁcantly reduce the time

complexity by sampling only two neighbors per node, yet

still retain the quality of the model.

We empirically test our algorithms on six graph datasets, and

arXiv:1710.10568v3 [stat.ML] 1 Mar 2018

Stochastic Training of Graph Convolutional Networks with Variance Reduction

Input

Layer 1

Layer 2

(a) Exact

Input

Layer 1

Layer 2

(b) Neighbour sampling

Input

Layer 1

Layer 2

Latest activation

Historical activation

Input

GraphConv

Dropout

GraphConv

(1)

(2)

GraphConv

(1)



(2)



(d) CVD network

Figure 1. Two-layer graph convolutional networks, and the receptive ﬁeld of a single vertex.

Dataset V E Degree Degree 2

Citeseer 3,327 12,431 4 15

Cora 2,708 13,264 5 37

PubMed 19,717 108,365 6 60

NELL 65,755 318,135 5 1,597

PPI 14,755 458,973 31 970

Reddit 232,965 23,446,803 101 10,858

Table 1.

Number of vertexes, edges, and average number of 1-hop

and 2-hop neighbors per node for each dataset. Undirected edges

are counted twice and self-loops are counted once.

show that our techniques signiﬁcantly reduce the bias and

variance of the gradient from NS with the same receptive

ﬁeld size. Despite sampling only

(l)

= 2

neighbors, our

algorithms achieve the same predictive performance with the

exact algorithm in a comparable number of epochs on all the

datasets, i.e., we reduce the time complexity while having

almost no loss on the speed of convergence, which is the best

we can expect. On the largest Reddit dataset, the training

time of our algorithm is 7 times shorter than that of the best-

performing competitor among the exact algorithm (Kipf &

Welling, 2017), neighbor sampling (Hamilton et al., 2017a)

and importance sampling (Chen et al., 2018) algorithms.

2. Backgrounds

We now brieﬂy review graph convolutional networks

(GCNs), stochastic training, and the neighbor sampling (NS)

and importance sampling (IS) algorithms.

2.1. Graph Convolutional Networks

We present our algorithm with a GCN for semi-supervised

node classiﬁcation (Kipf & Welling, 2017). However, the

algorithm is neither limited to the task nor the model. Our

algorithm is applicable to other models (Hamilton et al.,

2017a) and tasks (Kipf & Welling, 2016; Berg et al., 2017;

Schlichtkrull et al., 2017; Hamilton et al., 2017b) that in-

volve computing the average activation of neighbors.

In the node classiﬁcation task, we have an undirected graph

G = (V, E)

with

V = |V|

vertices and

E = |E|

edges,

where each vertex

consists of a feature vector

and a

label

. We observe the labels for some vertices

. The

goal is to predict the labels for the rest vertices

:= V\V

The edges are represented as a symmetric

V ×V

adjacency

matrix

, where

is the weight of the edge between

and

, and the propagation matrix

is a normalized version

A = A + I

, and

P =

−

A graph convolution layer is deﬁned as

(l+1)

= P H

(l)

, H

(l+1)

= σ(Z

l+1

), (1)

where

(l)

is the activation matrix in the

-th layer, whose

each row is the activation of a graph node.

(0)

= X

is the

input feature matrix,

(l)

is a trainable weight matrix, and

σ(·)

is an activation function. Denote

|·|

as the cardinality

of a set. The training loss is deﬁned as

L =

v∈V

f(y

, z

(L)

), (2)

where

f(·, ·)

is a loss function. A graph convolution layer

propagates information to nodes from their neighbors by

computing the neighbor averaging

P H

(l)

. Let

n(u)

the set of neighbors of node

, and

n(u)

be its cardi-

nality, the neighbor averaging of node

(P H

(l)

)

v=1

(l)

v∈n(u)

(l)

, is a weighted sum of

neighbors’ activations. Then, a fully-connected layer is ap-

plied on all the nodes, with a shared weight matrix

(l)

across all the nodes.

We denote the receptive ﬁeld of node

at layer

as all

the activations

(l)

on layer

needed for computing

(L)

If the layer is not explicitly mentioned, it means layer 0.

The receptive ﬁeld of node

is all its

-hop neighbors,

i.e., nodes that are reachable from

within

hops, as

illustrated in Fig. 1(a). When

P = I

, GCN reduces to a

multi-layer perceptron (MLP) model which does not use the

graph structure. For MLP, the receptive ﬁeld of a node u is

just the node itself.

2.2. Stochastic Training

It is generally expensive to compute the batch gradient

∇L =

v∈V

∇f(y

, z

(L)

)

, which involves iterat-

ing over the entire labeled set of nodes. A possible solution

Stochastic Training of Graph Convolutional Networks with Variance Reduction

is to approximate the batch gradient by a stochastic gradient

v∈V

∇f(y

, z

(L)

), (3)

where

⊂ V

is a minibatch of labeled nodes. However,

this gradient is still expensive to compute, due to the large

receptive ﬁeld size. For instance, as shown in Table 1, the

number of 2-hop neighbors on the NELL dataset is averagely

1,597, which means computing the gradient for a single node

in a 2-layer GCN involves touching

1, 597/65, 755 ≈ 2.4%

nodes of the entire graph.

In subsequent sections, two other stochasticity will be intro-

duced besides the random selection of the minibatch: the

random sampling of neighbors (Sec. 2.3) and the random

dropout of features (Sec. 5).

2.3. Neighbor Sampling

To reduce the receptive ﬁeld size, Hamilton et al. (2017a)

propose a neighbor sampling (NS) algorithm. NS randomly

chooses

(l)

neighbors for each node at layer

and devel-

ops an estimator

(l)

(P H

(l)

)

based on Monte-Carlo

approximation:

(P H

(l)

)

≈ NS

(l)

n(u)

(l)

v∈

(l)

(u)

(l)

where

(l)

(u) ⊂ n(u)

is a subset of

(l)

random neigh-

bors. Therefore, NS reduces the receptive ﬁeld size from all

the

-hop neighbors to the number of sampled neighbors,

l=1

(l)

. We refer

(l)

as the NS estimator of

(P H

(l)

)

and (P H

(l)

)

itself as the exact estimator.

Neighbor sampling can also be written in a matrix form as

(l+1)

(l)

, H

(l+1)

= σ(Z

(l+1)

), (4)

where the propagation matrix

is replaced by a sparser

unbiased estimator

(l)

, i.e.,

(l)

= P

, where

(l)

n(u)

(l)

v ∈

(l)

(u)

, and

(l)

= 0

otherwise. Hamilton

et al. (2017a) propose to perform the approximate forward

propagation as Eq. (4), and do stochastic gradient descent

(SGD) with the auto-differentiation gradient. The approxi-

mated gradient has two sources of randomness: the random

selection of minibatch

⊂ V

, and the random selection

of neighbors.

Though

(l)

is an unbiased estimator of

σ(

(l)

)

is not an unbiased estimator of

σ(P H

(l)

)

, due to the non-linearity of

σ(·)

. In the

sequel, both the prediction

(L)

and gradient

∇f(y

, z

(L)

)

obtained by NS are biased, and the convergence of SGD

is not guaranteed, unless the sample size

(l)

goes to

inﬁnity. Because of the biased gradient, the sample

size

(l)

needs to be large for NS, to keep comparable

predictive performance with the exact algorithm. Hamilton

et al. (2017a) choose

(1)

= 10

and

(2)

= 25

, and the

receptive ﬁeld size

(1)

× D

(2)

= 250

is much larger than

that of MLP, which is 1, so the training is still expensive.

2.4. Importance Sampling

FastGCN (Chen et al., 2018) is another sampling-based

algorithm similar as NS. Instead of sampling neighbors

for each node, FastGCN directly subsample the receptive

ﬁeld for each layer altogether. Formally, it approximates

(P H

(l)

)

with S samples v

, . . . , v

∈ V as

(P H

(l)

)

= V

v=1

(l)

≈

∼q( v)

(l)

/q(v

where the importance distribution

q(v) ∝

u=1

n(v)

(u,v)∈E

n(u)

, according the deﬁnition of

Sec. 2.1. We refer to this estimator as importance sam-

pling (IS). Chen et al. (2018) show that IS performs better

than using a uniform sample distribution

q(v) ∝ 1

. NS

can be viewed as an IS estimator with the importance dis-

tribution

q(v) ∝

(u,v)∈E

n(u)

, because each node

has

probability

n(u)

to choose the neighbor

. Though IS may

have a smaller variance than NS, it still only guarantees

the convergence as the sample size

goes to inﬁnity. Em-

pirically, we ﬁnd IS to work even worse than NS because

sometimes it can select many neighbors for one node, and

no neighbor for another, in which case the activation of the

latter node is just meaningless zero.

3. Control Variate Based Algorithm

We present a novel control variate based algorithm that uti-

lizes historical activations to reduce the estimator variance.

3.1. Control Variate Based Estimator

While computing the neighbor average

v∈n(u)

(l)

we cannot afford to evaluate all the

(l)

terms because they

need to be computed recursively, i.e., we again need the

activations h

(l−1)

of all of v’s neighbors w.

Our idea is to maintain the history

(l)

for each

(l)

as an

affordable approximation. Each time when

(l)

is computed,

we update

(l)

with

(l)

. We expect

(l)

and

(l)

to be simi-

lar if the model weights do not change too fast during the

training. Formally, let

∆h

(l)

= h

(l)

−

(l)

, we approximate

(P H

(l)

)

v∈n(u)

∆h

(l)

v∈n(u)

(l)

≈ CV

(l)

n(u)

(l)

v∈

(l)

(u)

∆h

(l)

v∈n(u)

(l)

, (5)

where we represent

(l)

as the sum of

∆h

(l)

and

(l)

, and

Stochastic Training of Graph Convolutional Networks with Variance Reduction

we only apply Monte-Carlo approximation on the

∆h

(l)

term. Averaging over all the

(l)

’s is still affordable because

they do not need to be computed recursively. Since we ex-

pect

(l)

and

(l)

to be close,

∆h

will be small and

(l)

should have a smaller variance than

(l)

. Particularly,

if the model weight is kept ﬁxed,

(l)

should eventually

equal with

(l)

, so that

(l)

= 0 +

v∈n(u)

(l)

v∈n(u)

(l)

= (P H

(l)

)

, i.e., the estimator has zero

variance. This estimator is referred as CV. We will com-

pare the variance of NS and CV estimators in Sec. 3.2 and

show that the variance of CV will be eventually zero dur-

ing the training in Sec. 4. The term

(l)

− NS

(l)

v∈n(u)

(l)

−

n(u)

(l)

v∈

(l)

(u)

(l)

is a control

variate (Ripley, 2009, Chapter 5) added to the neighbor

sampling estimator NS

(l)

, to reduce its variance.

In matrix form, let

(l)

be the matrix formed by stacking

(l)

, then CV can be written as

(l+1)



(l)

−

(l)

) + P

(l)



(l)

. (6)

3.2. Variance Analysis

We analyze the variance of the estimators assuming all the

features are 1-dimensional. The analysis can be extended to

multiple dimensions by treating each dimension separately.

We further assume that

(l)

(u)

is created by sampling

(l)

neighbors without replacement from

n(u)

. The following

proposition is proven in Appendix A:

Proposition 1.

(l)

(u)

contains

(l)

samples from

n(u)

without replacement,

then

Var

(l)

(u)

n(u)

(l)

v∈

(l)

(u)

(l)

∈n(u)

− x

)

, where

(l)

= 1 − (D

(l)

− 1)/(n(u) −1).

By Proposition 1, we have

Var

(l)

(u)

(l)

∈n(u)

(l)

−P

(l)

)

, which is

the total distance of the weighted activations of all pairs of

neighbors, and is zero iff

is identical for all neighbors,

in which case any neighbor contains all the information of

the entire neighborhood.

The variance of the CV estimator is

Var

(l)

(u)

(l)

∈n(u)

∆h

(l)

− P

∆h

(l)

)

which replaces

(l)

∆h

(l)

. Since

∆h

(l)

is usually much

smaller than

(l)

, the CV estimator enjoys much smaller

variance than the NS estimator. Furthermore, as we will

show in Sec. 4.2,

∆h

(l)

converges to zero during training,

so we achieve not only variance reduction but variance

elimination, as the variance vanishes eventually.

3.3. Implementation Details and Time Complexity

Training with the CV estimator is similar as with the NS

estimator (Hamilton et al., 2017a). Particularly, each

iteration of the algorithm involves the following steps:

Stochastic GCN with Variance Reduction

1. Randomly select a minibatch V

∈ V

of nodes;

Build a computation graph that only contains the acti-

vations

(l)

and

(l)

needed for the current minibatch;

Get the predictions by forward propagation as Eq. (6);

Get the gradients by backward propagation, and up-

date the parameters by SGD;

5. Update the historical activations.

Step 3 and 4 are handled automatically by frameworks such

de TensorFlow (Abadi et al., 2016). The computational

graph at Step 2 is deﬁned by the receptive ﬁeld

(l)

and the

propagation matrices

(l)

at each layer. The receptive ﬁeld

(l)

speciﬁes the activations

(l)

of which nodes should be

computed for the current minibatch, according to Eq. (6).

We can construct

(l)

and

(l)

from top to bottom, by

randomly adding

(l)

neighbors for each node in

(l+1)

starting with

(L)

= V

. We assume

(l)

is always needed

to compute

(l+1)

, i.e.,

is always selected as a neighbor of

itself. The receptive ﬁelds are illustrated in Fig. 1(c), where

red nodes are in receptive ﬁelds, whose activations

(l)

are

needed, and the histories

(l)

of blue nodes are also needed.

Finally, in Step 5, we update

(l)

with

(l)

for each

v ∈ r

(l)

We have the pseudocode for the training in Appendix D.

GCN has two main types of computation, namely, the sparse-

dense matrix multiplication (SPMM) such as

P H

(l)

, and the

dense-dense matrix multiplication (GEMM) such as

(l)

We assume that the node feature is

-dimensional and the

ﬁrst hidden layer is A-dimensional.

For batch GCN, the time complexity is

O(EK)

for SPMM

and

O(V KA)

for GEMM. For our stochastic training al-

gorithm with control variates, the dominant SPMM com-

putation is the average of neighbor history

(0)

for the

nodes in

(1)

, whose size is

O(|V

l=2

(l)

)

. For exam-

ple, in a 2-layer GCN where we sample

(l)

= 2

neigh-

bors for each node,

l=2

(l)

= 2

. Therefore, the time

complexity of SPMM is

O(V DK

l=2

(l)

)

per epoch,

where

is the average degree of nodes in

(1)

The

dominant GEMM computation is the ﬁrst fully-connected

layer on all the nodes in

(1)

, whose time complexity is

O(V KA

l=2

(l)

) per epoch.

V D 6= E

because the probability of each node to present in

(1)

is different. Nodes of higher degree have larger probability to

present. We can also subsample neighbors’ history if D is large.

Stochastic Training of Graph Convolutional Networks with Variance Reduction

4. Theoretical Results

Besides smaller variance, CV also has stronger theoretical

guarantees than NS. In this section, we present two theo-

rems. One states that if the model parameters are ﬁxed,

e.g., during testing, CV produces exact predictions after

epochs; and the other establishes the convergence towards a

local optimum regardless of the neighbor sampling size.

In this section, we assume that the algorithm is run by

epochs. In each epoch, we randomly partition the vertex set

minibatches

, . . . , V

, and in the

-th iteration, we

run a forward pass to compute the predictions for nodes in

, an optional back propagation to compute the gradients,

and update the history. Note that in each epoch we scan all

the nodes instead of just training nodes, to ensure that the

history of each node is updated at least once per epoch.

We denote the model parameters in the

-th iteration as

. At training time,

is updated by SGD over time; at

testing time,

is kept ﬁxed. To distinguish, the activa-

tions produced by CV at iteration

are denoted as

(l)

CV,i

and

(l)

CV,i

, and the activations produced by the exact algo-

rithm (Eq. 1) are denoted as

(l)

and

(l)

. At iteration

, the

network computes the predictions and gradients for the mini-

batch

, where

CV,i

) :=

v∈V

∇f(y

, z

(L)

CV,i,v

)

and

) :=

v∈V

∇f(y

, z

(L)

i,v

)

are the stochas-

tic gradients computed by CV and the exact algorithm.

∇L(W

) =

v∈V

∇f(y

, z

(L)

)

is the determinis-

tic batch gradient computed by the exact algorithm. The

subscript

may be omitted for the exact algorithm if

is a constant sequence. We let

[L] = {0, . . . , L}

and

[L]

= {1, . . . , L}

. The gradient

CV,i

)

has two

sources of randomness: the random selection of the mini-

batch

and randomness of the neighbors

, so we may

take expectation of

CV,i

)

w.r.t. either

, or both.

4.1. Exact Testing

The following theorem reveals the connection of the exact

and approximate predictions by CV.

Theorem 1.

For a constant sequence of

= W

and any

i > LI

(i.e., after

epochs), the activations computed

by CV are exact, i.e.,

(l)

CV,i

= Z

(l)

for each

l ∈ [L]

and

(l)

CV,i

= H

(l)

for each l ∈ [L −1].

Theorem 1 shows that at testing time, we can run forward

propagation with CV for

epoches and get exact predic-

tion. This outperforms NS, which cannot recover the exact

prediction unless the neighbor sample size goes to inﬁnity.

Comparing with directly making exact predictions by an

exact batch algorithm, CV is more scalable because it does

not need to load the entire graph into memory. The proof

can be found in Appendix B.

4.2. Convergence Guarantee

The following theorem shows that SGD training with the

approximated gradients

CV,i

)

still converges to a local

optimum, regardless of the neighbor sampling size

(l)

Therefore, we can choose arbitrarily small

(l)

without

worrying about the convergence.

Theorem 2.

Assume that (1) the activation

σ(·)

Lipschitz, (2) the gradient of the cost function

∇

f(y, z)

-Lipschitz and bounded, (3)

CV,V

(W )k

∞

kg(W )k

∞

and

k∇L(W )k

∞

are all bounded by

G > 0

for all

P , V

and

. (4) The loss

L(W )

-smooth, i.e.,

|L(W

)−L(W

)−

h∇L(W

), W

−W

i| ≤

− W

∀W

, W

, where

hA, Bi = tr(A

is the inner product of matrix

and ma-

trix

. Then, there exists

K > 0

, s.t.,

∀N > LI

, if we run

SGD for

R ≤ N

iterations, where

is chosen uniformly

from [N]

, we have

k∇L(W

≤ 2

L(W

) − L(W

∗

) + K + ρK

√

for the updates

i+1

= W

−γg

CV,i

)

and the step size

γ = min{

√

Particularly,

lim

N→∞

k∇L(W

= 0

. Therefore,

our algorithm converges to a local optimum as the max

number of iterations

goes to inﬁnity. The full proof is in

Appendix C. For short, we show that

CV,i

)

is unbiased

i → ∞

, and then show that SGD with such asymptoti-

cally unbiased gradients converges to a local optimum.

5. Handling Dropout of Features

In this section, we consider introducing a third source of ran-

domness, the random dropout of features (Srivastava et al.,

2014). Let

Dropout

(X) = M ◦ X

be the dropout oper-

ation, where

∼ Bern(p)

are i.i.d. Bernoulli random

variables, and

◦

is the element-wise product. Let

be the

expectation over dropout masks.

With dropout, all the activations

(l)

are random vari-

ables whose randomness comes from dropout, even in

the exact algorithm Eq. (1). We want to design a

cheap estimator for the random variable

(P H

(l)

)

v∈n(u)

(l)

, based on a stochastic neighborhood

(l)

(u)

. An ideal estimator should have the same dis-

tribution with

(P H

(l)

)

. However, such an estimator

is difﬁcult to design. Instead, we develop an estimator

CVD

(l)

that eventually has the same mean and variance

with

(P H

(l)

)

, i.e.,

(l)

(u)

CVD

(l)

= E

(P H

(l)

)

and Var

(l)

(u)

Var

CVD

(l)

= Var

(P H

(l)

)

剩余29页未读，继续阅读

评论收藏

内容反馈

weixin_40191861_zj

粉丝: 62
资源: 1万+

藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文03.pdf

蚂蚁金服人工智能部研究员ICML贡献论文03.pdf

蚂蚁金服人工智能部研究员ICML贡献论文05.pdf

藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文04.pdf

蚂蚁金服人工智能部研究员ICML贡献论文04.pdf

《AGL：可扩展工业图机器学习系统》（蚂蚁金服人工智能部论文）.pdf

藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文02.pdf

藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文05.pdf

藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文07.pdf

藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文06.pdf

蚂蚁金服人工智能部研究员ICML贡献论文07.pdf

蚂蚁金服人工智能部研究员ICML贡献论文01.pdf

蚂蚁金服人工智能部研究员ICML贡献论文06.pdf

蚂蚁金服人工智能部研究员ICML贡献论文02.pdf

2019-icml-蚂蚁金服-Generative Adversarial User Model for Reinforceme

ICML2023_Tutorial.pdf

KakadeLangford-icml2002.pdf

ICML19-attention.pdf

ICML2020-2.zip

ICML2020-1.zip

relu_hybrid_icml2013_final.pdf

操作系统学习与考试系统(XOSCATS)

SquareLine-Studio 1.3.0安装包

王道操作系统课件 2024

C语言规范标准-C99(中文版)

ELF解析工具 v1.7（elf格式解析工具)

计算机组成原理：最详细笔记 word格式下载

KeepOutlookRunning.7z

dell r730xd 调速工具

Sim-EKB-Install-2022-11-27.zip

最新资源