Python

Model training duration

Model training evaluation using Matplotlib / seaborn scatter plot, colors on condition and custom color palette. fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 5)) a = pd.Series(np.random.randint(60, 180, 25)) b = pd.Series(np.random.randint(55, 160, 25)) x_min = min(min(a), min(b)) y_max = max(max(a), max(b)) sns.scatterplot(a, b, ax=ax1) ax1.plot([x_min, y_max], [x_min, y_max], ":", color="grey") ax1.set_title("Model training runtime (Experiment #2)", size=16) ax1.set_xlabel("User-defined runtime (sec.)", size=14) ax1.set_ylabel("Actual runtime (sec.)", size=14) data=pd.DataFrame({"a":a, "diff":(b-a), "cond":((b-a) <= 0) * 1}) sns.scatterplot(x="a", y="diff", data=data, ax=ax2, hue="cond", palette={0: "tab:orange", 1: "tab:green"}, legend = False) ax2.axhline(y=0, xmin=a.index.min(), xmax=a.index.max(), linestyle=":", color="grey") ax2.set_title("Runtime difference in seconds (lower is better)", size=16) ax2.set_ylabel("Runtime difference (sec.)", size=14) ax2.set_xlabel("User-defined runtime (sec.)", size=14) plt.show() Output:

Mlflow artifacts

How to save mlflow runs and artifacts into an external folder Install the mlflow lib: pip install mlflow Create a folder artifacts: mkdir mlflow-artifacts Create a folder sqlite database(s): mkdir mlflow-dbs Start the mlflow UI in terminal: mlflow ui \ --backend-store-uri sqlite:///mlflow-dbs/db-20220822.sqlite \ --default-artifact-root mlflow-artifacts/ The mlflow UI will be served on http://127.0.0.1:5000 Create a project folder: your-project Switch to the project directory: cd your-project Connect to the UI and run experiments: # experiment.py import mlflow mlflow.set_tracking_uri("http://127.0.0.1:5000") mlflow.set_experiment("experiment-001") with mlflow.start_run(): mlflow.log_param("num_layers", 5) mlflow.log_metric("accuracy", 0.75) Run the script: python experiment.py Autologging contents to an active fluent run, which may be user-created: ...

Tensorflow dataset pipeline

Useful resources for learning and creating a Tensorflow Dataset. See also code snippets below. Building a data pipeline: https://cs230.stanford.edu/blog/datapipeline/ tf.data API, Build TensorFlow input pipelines: https://www.tensorflow.org/guide/data tf.data API, Consuming sets of files: https://www.tensorflow.org/guide/data#consuming_sets_of_files Better performance with the tf.data API: https://www.tensorflow.org/guide/data_performance Keras Sequence Generator: https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly Loading and Preprocessing Data with TensorFlow: https://canvas.education.lu.se/courses/3766/pages/chapter-13-loading-and-preprocessing-data-with-tensorflow?module_item_id=109789 tf.data.Dataset generators with parallelization: the easy way: https://medium.com/@acordier/tf-data-dataset-generators-with-parallelization-the-easy-way-b5c5f7d2a18 Feed numpy files (.npz) which contain features X and a label y: import tensorflow as tf import numpy as np def read_npy_file(item): x, y = np.load(item.decode()) return x.astype(np.float32), file_list = ['/foo/bar.npz', '/foo/baz.npz'] dataset = tf.data.Dataset.from_tensor_slices(file_list) dataset = dataset.map( lambda item: tuple( tf.py_func(func=read_npy_file, inp=[item], Tout=[tf.float32,]) ) ) # Read numpy files (.npz), extract labels and return a new tf.data.Dataset def get_dataset(file_names_list, num_classes=2): """Creates a new TensorFlow Dataset ---------- Parameters: file_names_list: list of file paths num_classes: int Returns: (Tensor, Tensor) """ # Load the numpy files def map_func(file_path): np_data = np.load(file_path) x_data = np_data["x"] y_label = np_data["y"] return x_data.astype(np.float32), tf.one_hot(indices=y_label, depth=num_classes) # Map function numpy_func = lambda item: tf.numpy_function(map_func, [item], [tf.float32, tf.float32]) # Create a new tensorflow dataset dataset = tf.data.Dataset.from_tensor_slices(file_list) # Use map to load the numpy files in parallel dataset = dataset.map(numpy_func, num_parallel_calls=tf.data.AUTOTUNE) return dataset The following code snippet has been taken from https://medium.com/@acordier/tf-data-dataset-generators-with-parallelization-the-easy-way-b5c5f7d2a18 ...

Elapsed time

Python script to measure the elapsed time in seconds from timeit import default_timer as timer from datetime import timedelta start = timer() # Simulate a long lasting calculation/process for i in range(0, 99999): k = 0 for m in range(0, 999): k = (i - 1) + 10 # End of simulation end = timer() td = timedelta(seconds=end-start) print("Elapsed time:", td) # Output # Elapsed time: 00:01:15.375 print("Elapsed time:", td.seconds, "seconds") # Output # Elapsed time: 75 seconds

Pandas Dataframe resampling

Resampling pandas dataframe to calculate mean/median for a selected window size. import pandas as pd millisecond = "ms" idx = pd.date_range('1/1/2022', periods=100, freq=millisecond) series = pd.Series(list(range(100, 200)), index=idx) df = pd.DataFrame({'s': series}) window_size = 10 df.resample(f"{window_size}{millisecond}").mean() df.resample(f"{window_size}{millisecond}").median() df.resample(f"{window_size}{millisecond}").last()