构建简单数据管道，为什么tf.data要比feed_dict更好？

在大多数面向初学者的tensorflow教程里，作者通常会建议读者在会话中用feed_dict为模型导入数据——feed_dict是一个字典，能为占位符馈送数据。但是，其实tf提供了另一种更好的、更简单的方法：只需使用tf.dataapi，你就能用几行代码搞定高性能数据管道。
那么tf.data的优势具体在哪里呢？如下图所示，虽然feed_dict的灵活性大家有目共睹，但每当我们需要等待cpu把数据馈送进来时，gpu就一直处于闲置状态，也就是程序运行效率太低。
而tf.data管道没有这个问题，它能提前抓取下个batch的数据，降低总体闲置时间。在这个基础上，如果我们采用并行数据导入，或者事先进行数据预处理，整个过程就更快了。
在5分钟内实现小型图像管道
要构建一个简单数据管道，首先我们需要两个对象：一个用于存储数据集的tf.data.dataset，以及一个允许我们逐个从数据集中提取样本的tf.data.iterator。
对于tf.data.dataset，它在图像管道中是这样的：
[
[tensor(image), tensor(label)],
[tensor(image), tensor(label)],
...
]
之后我们就可以用tf.data.iterator逐个检索图像标签对。在实践中，多个图像标签对通常会组成元素序列，方便迭代器进行提取。
至于数据集，datasetapi有两种创建数据集的方法，其一是从源（如python中的文件名列表）创建数据集，其二是可以直接在现有数据集上应用转换，下面是一些示例：
dataset(list of image files) → dataset(actual images)
dataset(6400 images) → dataset(64 batches with 100 images each)
dataset(list of audio files) → dataset(shuffled list of audio files)
定义计算图
小型图像管道的大致情况如下图所示：
所有代码都和模型、损失、优化器等一起放在我们的计算图定义中。首先，我们要从文件列表中创建一个张量。
# define list of files
files = ['a.png', 'b.png', 'c.png', 'd.png']
# create a dataset from filenames
dataset = tf.data.dataset.from_tensor_slices(files)
之后是定义一个函数来从其路径加载图像（作为张量），并调用tf.data.dataset.map()把函数用于数据集中的所有元素（文件路径）。如果想并行调用函数，你也可以设置num_parallel_calls=n里的map()参数。
# source
def load_image(path):
image_string = tf.read_file(path)
# don't use tf.image.decode_image, or the output shape will be undefined
image = tf.image.decode_jpeg(image_string, channels=3)
# this will convert to float values in [0, 1]
image = tf.image.convert_image_dtype(image, tf.float32)
image = tf.image.resize_images(image, [image_size, image_size])
return image
# apply the function load_image to each filename in the dataset
dataset = dataset.map(load_image, num_parallel_calls=8)
然后是用tf.data.dataset.batch()创建batch：
# create batches of 64 images each
dataset = dataset.batch(64)
如果想减少gpu闲置时间，我们可以在管道末尾添加tf.data.dataset.prefetch(buffer_size)，其中buffer_size这个参数表示预抓取的batch数，我们一般设buffer_size=1，但在某些情况下，尤其是处理每个batch耗时不同时，我们也可以适当扩大一点。
dataset = dataset.prefetch(buffer_size=1)
最后，我们再创建一个迭代器遍历数据集。虽然迭代器的选择有很多，但对于大多数任务，我们还是建议选择可以初始化的迭代器。
iterator = dataset.make_initializable_iterator()
调用tf.data.iterator.get_next()创建占位符张量，每次评估时，tensorflow都会填充下一batch的图像。
batch_of_images = iterator.get_next()
如果写到这里，你突然想换回feed_dict的方法，你可以用batch_of_images把之前的占位符全都替换掉。
运行会话
现在，我们就可以向往常一样运行模型了。但在每个epoch前，记得先评估iterator.initializer的op和tf.errors.outofrangeerror有没有抛出异常。
with tf.session() as session:
for i in range(epochs):
session.run(iterator.initializer)
try:
# go through the entire dataset
whiletrue:
image_batch = session.run(batch_of_images)
except tf.errors.outofrangeerror:
print('end of epoch.')
nvidia-smi这个命令可以帮我们监控gpu利用率，找到数据管道中的瓶颈。正常情况下，gpu的平均利用率应该高于70%-80%。
更完整的数据管道
shuffle
在dataset里，tf.data.dataset.shuffle()是一个比较常用的方法，它可以用来打乱数据集中的数据顺序。它的参数buffer_size指定的是一次打乱的元素数量，一般情况下，我们建议把这个参数值设大一点，最好一次性就能把整个数据集洗牌，因为如果参数过小，它可能会造成意料之外的偏差。
dataset = tf.data.dataset.from_tensor_slices(files)
dataset = dataset.shuffle(len(files))
数据增强
数据增强是扩大数据集的一种常用方式，这方面常用的函数有tf.image.random_flip_left_right()、tf.image.random_brightness()和tf.image.random_saturation()：
# source
def train_preprocess(image):
image = tf.image.random_flip_left_right(image)
image = tf.image.random_brightness(image, max_delta=32.0 / 255.0)
image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
# make sure the image is still in [0, 1]
image = tf.clip_by_value(image, 0.0, 1.0)
return image
标签
要想在图像上加载标签（或其他元数据），我们只需在创建初始数据集时就把它们包含在内：
# files is a python list of image filenames
# labels is a numpy array with label data for each image
dataset = tf.data.dataset.from_tensor_slices((files, labels))
确保应用于数据集的所有.map()函数都允许标签数据通过：
def load_image(path, label):
# load image
return image, label
dataset = dataset.map(load_image)