Shamil's Notepad

Tensorflow dataset pipeline

Useful resources for learning and creating a Tensorflow Dataset. See also code snippets below. Building a data pipeline: https://cs230.stanford.edu/blog/datapipeline/ tf.data API, Build TensorFlow input pipelines: https://www.tensorflow.org/guide/data tf.data API, Consuming sets of files: https://www.tensorflow.org/guide/data#consuming_sets_of_files Better performance with the tf.data API: https://www.tensorflow.org/guide/data_performance Keras Sequence Generator: https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly Loading and Preprocessing Data with TensorFlow: https://canvas.education.lu.se/courses/3766/pages/chapter-13-loading-and-preprocessing-data-with-tensorflow?module_item_id=109789 tf.data.Dataset generators with parallelization: the easy way: https://medium.com/@acordier/tf-data-dataset-generators-with-parallelization-the-easy-way-b5c5f7d2a18 Feed numpy files (.npz) which contain features X and a label y: import tensorflow as tf import numpy as np def read_npy_file(item): x, y = np.load(item.decode()) return x.astype(np.float32), file_list = ['/foo/bar.npz', '/foo/baz.npz'] dataset = tf.data.Dataset.from_tensor_slices(file_list) dataset = dataset.map( lambda item: tuple( tf.py_func(func=read_npy_file, inp=[item], Tout=[tf.float32,]) ) ) # Read numpy files (.npz), extract labels and return a new tf.data.Dataset def get_dataset(file_names_list, num_classes=2): """Creates a new TensorFlow Dataset ---------- Parameters: file_names_list: list of file paths num_classes: int Returns: (Tensor, Tensor) """ # Load the numpy files def map_func(file_path): np_data = np.load(file_path) x_data = np_data["x"] y_label = np_data["y"] return x_data.astype(np.float32), tf.one_hot(indices=y_label, depth=num_classes) # Map function numpy_func = lambda item: tf.numpy_function(map_func, [item], [tf.float32, tf.float32]) # Create a new tensorflow dataset dataset = tf.data.Dataset.from_tensor_slices(file_list) # Use map to load the numpy files in parallel dataset = dataset.map(numpy_func, num_parallel_calls=tf.data.AUTOTUNE) return dataset The following code snippet has been taken from https://medium.com/@acordier/tf-data-dataset-generators-with-parallelization-the-easy-way-b5c5f7d2a18 ...

Elapsed time

Python script to measure the elapsed time in seconds from timeit import default_timer as timer from datetime import timedelta start = timer() # Simulate a long lasting calculation/process for i in range(0, 99999): k = 0 for m in range(0, 999): k = (i - 1) + 10 # End of simulation end = timer() td = timedelta(seconds=end-start) print("Elapsed time:", td) # Output # Elapsed time: 00:01:15.375 print("Elapsed time:", td.seconds, "seconds") # Output # Elapsed time: 75 seconds

Pandas Dataframe resampling

Resampling pandas dataframe to calculate mean/median for a selected window size. import pandas as pd millisecond = "ms" idx = pd.date_range('1/1/2022', periods=100, freq=millisecond) series = pd.Series(list(range(100, 200)), index=idx) df = pd.DataFrame({'s': series}) window_size = 10 df.resample(f"{window_size}{millisecond}").mean() df.resample(f"{window_size}{millisecond}").median() df.resample(f"{window_size}{millisecond}").last()

Pandas Multi-Index

Pandas DataFrame create multiindex using existing columns import pandas as pd df = pd.read_csv("./data/dataset.csv") df = df.set_index(["INSTANCES", "TIMEPOINTS"], inplace=False)

Read tempfile

Read content of file using the tempfile.NamedTemporaryFile class. import tempfile from sktime.datasets import load_from_tsfile_to_dataframe tmp_file = tempfile.NamedTemporaryFile(delete=False) df_tmp = None try: tmp_file.write(content) tmp_file.seek(0) print(tmp_file.name) df_tmp = load_from_tsfile_to_dataframe( tmp_file.name, return_separate_X_and_y=False ) finally: tmp_file.close() os.unlink(tmp_file.name)