Experiment

class skassist.library.Experiment(experiment_folder)

Manages the root folder of an experiment.

An experiments folder manages the dataset and cross-validation splits associated with it. Evaluation and result calcualtion can be initiated for all models with the evaluate()

Attributes:

experiments (list): A list of experiments found in the library folder.

path (str): Path to the root directory.

Todo

  • Implement LibEstimator base class that all models must inherit from. The base class ensures that the needed properties are implemented.
  • Offer optional arguments in Experiment.add() for the LibEstimator properties when a user doesn’t want to inherit from the base class!?
classmethod New(name, df, skf, features, lib_path, description='')

Factory method for creating a new Experiment instance given a name, dataset, cross-validation mask and a list of features. The path to the library in which the experiment is created must be given.

Args:
name (str):
A name for the experiment. Will be used together with the timestamp for storing the experiment.
df (pandas.DataFrame):
The dataset as a Pandas DataFrame.
skf (numpy.ndarray):
An array of indices, each being one cross-validation split.
features (list):
A list of column names that are to be used as features during training.
lib_path (str):
Path to the library in which the experiment is created.
description (str):
A descriptive string of the dataset, experiment or changes to make finding stuff later easier.
add(estimator)

Adds a model to the experiment.

Args:
estimator (BaseEstimator):
A name for the experiment. Will be used together with the timestamp for storing the experiment.
calc_results(scoring_function, name, max_workers=1, verbose=1, te_split_idx=1)

Calculate result for all models in this experiment. Calls calc_result() of each Model.

Args:
scoring_function (function()):
A python function that calculates results given a model, its predictions and the true labels. See scoring_function().
name (str):
A name for the result series. If a series with the given names exists, only missing results will be computed. Existing results are not deleted.
max_workers (int):
The number of models for which to concurrently calculate the results.
max_workers=1 is usually faster as the overhead of the ProcessPoolExecutor is too large. Could try ThreadPoolExecutor.

verbose (int): Level of print output. 0 is no output.

te_split_idx (int): Index of split that the model is evaluated on.

find(boolean_func)

Iterator function, yielding all models matching boolean_func().

Args:
boolean_func (boolean_func()):
A function that takes a Model and returns a boolean indicating a match.
findone(boolean_func)

Return the first model matching boolean_func().

Args:
boolean_func (boolean_func()):
A function that takes a Model and returns a boolean indicating a match.