0018-9545 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2444791, IEEE Transactions on Vehicular Technology
2
policy or a mix of two deterministic policies, equivalent to
choosing independently one of two deterministic policies at
each epoch by the toss of a (biased) coin. To solve the un-
constraint MDP problem, we derived an equivalent Bellman’s
equation with reduced state space. We showed by simulations
that the optimal policy derived by the brute-force offline value
iteration algorithm based on the equivalent Bellman’s equation
achieves significant gain compared to various baselines such as
the conventional CSI-only control and the throughput optimal
control (MaxWeight algorithm).
It is worth noting that the complexity of the brute-force
offline value iteration algorithm based on the reduced state
equivalent Bellman’s equation still grows exponentially with
the number of users in the network, limiting its application
in practical scenarios. In fact, it is well-known that there is
no simple solution for the infinite horizon average reward
MDP problem that delay-aware resource control belongs to,
because the brute-force value iterations or policy iterations
could not lead to any viable solution due to the curse of
dimensionality [8]–[11]. Moreover, our problem for network
assisted D2D communications is further complicated due to
the unique issues listed above. For example, the channel state
transition probabilities, which are used to derive the condi-
tional expectations of cost function and queue state transition
probabilities in the equivalent Bellman’s equation, are very
difficult to obtain when more than two links are allowed to
reuse the same time-frequency resource.
In the Part II of the paper, we address the curse of dimen-
sionality problem in solving the CMDP formulated in Part I,
so that a practical algorithm with acceptable computational
complexity and signaling overhead can be derived. To reduce
the complexity, we obtain a delay-optimal solution using ap-
proximate dynamic programming and online stochastic learn-
ing. Specifically, we approximate the value function in the
equivalent Bellman’s equation by a sum of per-queue value
functions. The per-queue value functions are estimated and
learned using an online stochastic learning algorithm based
on the real-time observations of the CSI and QSI, eliminating
the need of deriving the channel state transition probabilities.
Moreover, the Lagrangian Multipliers (LMs) for the constraint
optimization problem are updated simultaneously with the
value functions over different time scales. The optimal dy-
namic mode selection and resource allocation actions can be
determined by an algorithm that has a similar structure with
the MaxWeight algorithm in Lyapunov stability approach, with
the weight determined by the per-queue value functions instead
of the queue lengths. We prove the almost-sure convergence
of the proposed algorithm. We also show by simulations that
our proposed scheme achieves significant gain compared to
various baselines such as the conventional CSI-only control
and the throughput optimal control (MaxWeight algorithm).
Together with Part I, this pair of works provide a general
framework for the dynamic constrained optimization of mode
selection and resource allocation in D2D communications
under bursty traffic model, where the general form of the
optimal policy and a practical algorithm with simple structure
and near-optimal performance are given.
The organization of the paper is as follows. We recall the
general network model for network assisted D2D communi-
cations as well as the MDP problem formulation for dynamic
mode selection and resource allocation in Section II. In Section
III, we derive a low complexity learning algorithm, which
updates the per-queue value functions based on real-time
observations of CSI and QSI, as well as a resource allocation
algorithm with similar structure as the MaxWeight algorithm.
In Section IV, we discuss the performance simulations. Finally,
we summarize the main results in Section V.
II. NETWORK MODEL AND PREVIOUS RESULTS
A. Network Model
Consider a Frequency Division Duplex (FDD) OFDMA
cellular network with D2D communications capability, where
there are D D2D UE pairs, C
u
cellular UEs (CUEs) with
uplink communications and C
d
CUEs with downlink com-
munications in a single cell. A D2D UE pair consists of
a source D2D UE (src. DUE) and a destination D2D UE
(dest. DUE) within direct over-the-air communications range
with each other, which is formed through the various neigh-
bor/peer/service discovery mechanisms proposed in literature.
The whole uplink or downlink spectrum is divided into N
F
equal size subchannels. A subchannel in the uplink (resp.
downlink) spectrum shall be referred to as uplink (resp.
downlink) subchannel in the rest of the paper. Moreover, we
assume that D2D links share uplink resources with cellular
uplinks. Time is slotted and each time slot has an equal length.
The above OFDMA cellular network with D2D commu-
nications can be formulated as a general network model
with a set N of nodes and a set L of transmission links.
Define N := {0, 1, . . . , N}, where node 0 represents the base
station (BS) and nodes 1, . . . , 2D represent the DUEs, nodes
2D + 1, . . . , 2D + C
u
represent the uplink CUEs, and nodes
2D + C
u
+ 1, . . . , N = 2D + C
u
+ C
d
represent the downlink
CUEs. We use i or j to denote the index of a node within N
(i.e., i, j ∈ N ) in the rest of the paper. Each transmission link
represents a communication channel for direct transmission
from a given node i to another node j, and is labeled by
(i, j) (where i, j ∈ N ). All data that enter the network
are associated with a particular connection which defines the
source and destination of the data. Let C
D
= {1, . . . , D},
C
Cu
= {D + 1, . . . , D + C
u
}, and C
Cd
= {D + C
u
+
1, . . . , D + C
u
+ C
d
} represent the set of D2D connections,
cellular uplink connections and cellular downlink connections,
respectively. Define C := {1, . . . , C} = C
D
C
Cu
C
Cd
(with
C = D+C
u
+C
d
) as the set of all connections in the network.
We use c to denote the index of a connection within C (i.e.,
c ∈ C) in the rest of the paper.
The data from connection c is transmitted hop by hop along
the route(s) of the connection to its destination node. Each
node i along the route(s) of connection c maintains a queue
q
(c)
i
for storing its data except for the destination node, since
the data is considered to exit the network once it reaches the
destination. Define Θ as the set of queues in the system. We
assume each queue has a finite capacity of N
Q
< ∞ (in
number of bits or packets). The set of queues can be divided
into two non-overlapping disjoint sets, i.e., uplink queues Θ
u