.. _gettingstarted:
===============
Getting Started
===============
These tutorials do not attempt to make up for a graduate or undergraduate course
in machine learning, but we do make a rapid overview of some important concepts
(and notation) to make sure that we're on the same page. You'll also need to
download the datasets mentioned in this chapter in order to run the example code of
the up-coming tutorials.
.. _download:
.. index:: Download:
Download
========
On each learning algorithm page, you will be able to download the corresponding files. If you want to download all of them at the same time, you can clone the git repository of the tutorial::
git clone https://github.com/lisa-lab/DeepLearningTutorials.git
.. _datasets:
.. index:: Datasets
Datasets
========
.. index:: MNIST Dataset
MNIST Dataset
+++++++++++++
(`mnist.pkl.gz <http://deeplearning.net/data/mnist/mnist.pkl.gz>`_)
The `MNIST <http://yann.lecun.com/exdb/mnist>`_ dataset consists of handwritten
digit images and it is divided in 60,000 examples for the training set and
10,000 examples for testing. In many papers as well as in this tutorial, the
official training set of 60,000 is divided into an actual training set of 50,000
examples and 10,000 validation examples (for selecting hyper-parameters like
learning rate and size of the model). All digit images have been size-normalized and
centered in a fixed size image of 28 x 28 pixels. In the original dataset
each pixel of the image is represented by a value between 0 and 255, where
0 is black, 255 is white and anything in between is a different shade of grey.
Here are some examples of MNIST digits:
|0| |1| |2| |3| |4| |5|
.. |0| image:: images/mnist_0.png
.. |1| image:: images/mnist_1.png
.. |2| image:: images/mnist_2.png
.. |3| image:: images/mnist_3.png
.. |4| image:: images/mnist_4.png
.. |5| image:: images/mnist_5.png
For convenience we pickled the dataset to make it easier to use in python.
It is available for download `here <http://deeplearning.net/data/mnist/mnist.pkl.gz>`_.
The pickled file represents a tuple of 3 lists : the training set, the
validation set and the testing set. Each of the three lists is a pair
formed from a list of images and a list of class labels for each of the
images. An image is represented as numpy 1-dimensional array of 784 (28
x 28) float values between 0 and 1 (0 stands for black, 1 for white).
The labels are numbers between 0 and 9 indicating which digit the image
represents. The code block below shows how to load the dataset.
.. code-block:: python
import cPickle, gzip, numpy
# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)
f.close()
When using the dataset, we usually divide it in minibatches (see
:ref:`opt_SGD`). We encourage you to store the dataset into shared
variables and access it based on the minibatch index, given a fixed
and known batch size. The reason behind shared variables is
related to using the GPU. There is a large overhead when copying data
into the GPU memory. If you would copy data on request ( each minibatch
individually when needed) as the code will do if you do not use shared
variables, due to this overhead, the GPU code will not be much faster
then the CPU code (maybe even slower). If you have your data in
Theano shared variables though, you give Theano the possibility to copy
the entire data on the GPU in a single call when the shared variables are constructed.
Afterwards the GPU can access any minibatch by taking a slice from this
shared variables, without needing to copy any information from the CPU
memory and therefore bypassing the overhead.
Because the datapoints and their labels are usually of different nature
(labels are usually integers while datapoints are real numbers) we
suggest to use different variables for label and data. Also we recommend
using different variables for the training set, validation set and
testing set to make the code more readable (resulting in 6 different
shared variables).
Since now the data is in one variable, and a minibatch is defined as a
slice of that variable, it comes more natural to define a minibatch by
indicating its index and its size. In our setup the batch size stays constant
throughout the execution of the code, therefore a function will actually
require only the index to identify on which datapoints to work.
The code below shows how to store your data and how to
access a minibatch:
.. code-block:: python
def shared_dataset(data_xy):
""" Function that loads the dataset into shared variables
The reason we store our dataset in shared variables is to allow
Theano to copy it into the GPU memory (when code is run on GPU).
Since copying data into the GPU is slow, copying a minibatch everytime
is needed (the default behaviour if the data is not in a shared
variable) would lead to a large decrease in performance.
"""
data_x, data_y = data_xy
shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
# When storing data on the GPU it has to be stored as floats
# therefore we will store the labels as ``floatX`` as well
# (``shared_y`` does exactly that). But during our computations
# we need them as ints (we use labels as index, and if they are
# floats it doesn't make sense) therefore instead of returning
# ``shared_y`` we will have to cast it to int. This little hack
# lets us get around this issue
return shared_x, T.cast(shared_y, 'int32')
test_set_x, test_set_y = shared_dataset(test_set)
valid_set_x, valid_set_y = shared_dataset(valid_set)
train_set_x, train_set_y = shared_dataset(train_set)
batch_size = 500 # size of the minibatch
# accessing the third minibatch of the training set
data = train_set_x[2 * batch_size: 3 * batch_size]
label = train_set_y[2 * batch_size: 3 * batch_size]
The data has to be stored as floats on the GPU ( the right
``dtype`` for storing on the GPU is given by ``theano.config.floatX``).
To get around this shortcomming for the labels, we store them as float,
and then cast it to int.
.. note::
If you are running your code on the GPU and the dataset you are using
is too large to fit in memory the code will crash. In such a case you
should store the data in a shared variable. You can however store a
sufficiently small chunk of your data (several minibatches) in a shared
variable and use that during training. Once you got through the chunk,
update the values it stores. This way you minimize the number of data
transfers between CPU memory and GPU memory.
.. index:: Notation
Notation
========
.. index:: Dataset notation
Dataset notation
++++++++++++++++
We label data sets as :math:`\mathcal{D}`. When the distinction is important, we
indicate train, validation, and test sets as: :math:`\mathcal{D}_{train}`,
:math:`\mathcal{D}_{valid}` and :math:`\mathcal{D}_{test}`. The validation set
is used to perform model selection and hyper-parameter selection, whereas
the test set is used to evaluate the final generalization error and
compare different algorithms in an unbiased way.
The tutorials mostly deal with classification problems, where each data set
:math:`\mathcal{D}` is an indexed set of pairs :math:`(x^{(i)},y^{(i)})`. We
use superscripts to distinguish training set examples: :math:`x^{(i)} \in
\mathcal{R}^D` is thus the i-th training example of dimensionality :math:`D`. Similarly,
:math:`y^{(i)} \in \{0, ..., L\}` is the i-th label assigned to input
:math:`x^{(i)}`. It is straightforward to extend these examples to
ones where :math:`y^{(i)}` has other types (e.g. Gaussian for regression,
or groups of multinomials for predicting multiple
没有合适的资源?快使用搜索试试~ 我知道了~
Theano深度学习教程Python代码
3星 · 超过75%的资源 需积分: 10 89 下载量 71 浏览量
2015-10-29
21:59:17
上传
评论
收藏 13.86MB GZ 举报
温馨提示
共103个文件
png:25个
txt:24个
py:21个
Theano深度学习教程(Deep Learning Tutorials)Python代码。教程中文翻译版请见我CSDN博客。
资源推荐
资源详情
资源评论
收起资源包目录
Theano深度学习教程Python代码 (103个子文件)
config 278B
description 73B
do_nightly_build 2KB
exclude 240B
.gitignore 179B
HEAD 199B
HEAD 199B
HEAD 32B
HEAD 23B
.hgignore 28B
layout.html 617B
pack-b280bd3279245ce09c750fb79d9094a3dc73ca74.idx 102KB
index 7KB
3wolfmoon.jpg 125KB
Makefile 31B
master 199B
master 41B
training_colorpatches_16x16_demo.mat 5.06MB
pack-b280bd3279245ce09c750fb79d9094a3dc73ca74.pack 7.82MB
packed-refs 256B
3wolfmoon_output.png 130KB
filters_corruption_0.png 79KB
filters_at_epoch_14.png 78KB
filters_corruption_30.png 69KB
samples.png 63KB
mylenet.png 49KB
cnn_explained.png 25KB
mlp.png 23KB
sample1.png 22KB
sample2.png 22KB
rnnrbm.png 21KB
conv_1D_nn.png 17KB
DBN3.png 15KB
bm.png 15KB
lstm_memorycell.png 14KB
lstm.png 13KB
rbm.png 13KB
sparse_1D_nn.png 12KB
markov_chain.png 9KB
mnist_5.png 531B
mnist_1.png 526B
mnist_0.png 516B
mnist_2.png 487B
mnist_4.png 473B
mnist_3.png 417B
lstm.py 22KB
rbm.py 20KB
SdA.py 19KB
DBN.py 17KB
logistic_sgd.py 17KB
hmc.py 15KB
dA.py 14KB
mlp.py 14KB
rnnslu.py 13KB
convolutional_mlp.py 12KB
cA.py 12KB
rnnrbm.py 11KB
test.py 11KB
logistic_cg.py 10KB
conf.py 7KB
imdb.py 5KB
utils.py 5KB
imdb_preprocess.py 3KB
test_hmc.py 2KB
docgen.py 2KB
__init__.py 0B
README.rst 2KB
pre-rebase.sample 5KB
update.sample 4KB
pre-commit.sample 2KB
pre-push.sample 1KB
prepare-commit-msg.sample 1KB
commit-msg.sample 896B
applypatch-msg.sample 452B
pre-applypatch.sample 398B
post-update.sample 189B
download.sh 1KB
rnnrbm.svg 73KB
gettingstarted.txt 29KB
lenet.txt 24KB
rbm.txt 23KB
rnnslu.txt 21KB
hmc.txt 16KB
mlp.txt 13KB
dA.txt 12KB
DBN.txt 12KB
logreg.txt 12KB
lstm.txt 11KB
SdA.txt 8KB
rnnrbm.txt 7KB
utilities.txt 6KB
index.txt 4KB
references.txt 4KB
6_benchmarking_pybrain.txt 3KB
LICENSE.txt 2KB
LICENSE.txt 1KB
4_RBM_scan.txt 419B
1_SdA_performance.txt 242B
2_RBM_cost_fn.txt 233B
contents.txt 227B
共 103 条
- 1
- 2
资源评论
- qq_155404372017-10-02还没看,,,,
及时澍雨Timely
- 粉丝: 110
- 资源: 21
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功