Tensorflow dataset pipeline

Useful resources for learning and creating a Tensorflow Dataset. See also code snippets below. Building a data pipeline: https://cs230.stanford.edu/blog/datapipeline/ tf.data API, Build TensorFlow input pipelines: https://www.tensorflow.org/guide/data tf.data API, Consuming sets of files: https://www.tensorflow.org/guide/data#consuming_sets_of_files Better performance with the tf.data API: https://www.tensorflow.org/guide/data_performance Keras Sequence Generator: https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly Loading and Preprocessing Data with TensorFlow: https://canvas.education.lu.se/courses/3766/pages/chapter-13-loading-and-preprocessing-data-with-tensorflow?module_item_id=109789 tf.data.Dataset generators with parallelization: the easy way: https://medium.com/@acordier/tf-data-dataset-generators-with-parallelization-the-easy-way-b5c5f7d2a18 Feed numpy files (.npz) which contain features X and a label y: import tensorflow as tf import numpy as np def read_npy_file(item): x, y = np.load(item.decode()) return x.astype(np.float32), file_list = ['/foo/bar.npz', '/foo/baz.npz'] dataset = tf.data.Dataset.from_tensor_slices(file_list) dataset = dataset.map( lambda item: tuple( tf.py_func(func=read_npy_file, inp=[item], Tout=[tf.float32,]) ) ) # Read numpy files (.npz), extract labels and return a new tf.data.Dataset def get_dataset(file_names_list, num_classes=2): """Creates a new TensorFlow Dataset ---------- Parameters: file_names_list: list of file paths num_classes: int Returns: (Tensor, Tensor) """ # Load the numpy files def map_func(file_path): np_data = np.load(file_path) x_data = np_data["x"] y_label = np_data["y"] return x_data.astype(np.float32), tf.one_hot(indices=y_label, depth=num_classes) # Map function numpy_func = lambda item: tf.numpy_function(map_func, [item], [tf.float32, tf.float32]) # Create a new tensorflow dataset dataset = tf.data.Dataset.from_tensor_slices(file_list) # Use map to load the numpy files in parallel dataset = dataset.map(numpy_func, num_parallel_calls=tf.data.AUTOTUNE) return dataset The following code snippet has been taken from https://medium.com/@acordier/tf-data-dataset-generators-with-parallelization-the-easy-way-b5c5f7d2a18 ...

September 2, 2022 · 3 min