Derived Features#

Objectives: what you will take away#

  • Definitions & Understanding of how derived features work and how they can be used to accomplish novel tasks for both time-series and non time-series tasks

  • How-To use derived features in and out of time-series workflows

Prerequisites: before you begin#

Data#

Our example dataset for this guide is the well-known Adult dataset, accessible via the pmlb package installed in the prerequisites using the fetch_data() function.

Concepts & Terminology#

This guide will explain the concept of Derived Features, which include both derived action features and derived context features. Derived features allow you extra control to create additional features based on existing features. This provides additional feature engineering flexibility. To follow along, you should be familiar with the following concepts:

Derived Feature Codes#

The way in which each derived feature is derived is determined by what is called a derived feature code. This is a snippet of Amalgam style code that determines how the derivation should be performed. One such example of a derived feature code is:

(* #hours-per-week 0 52)

This would use the multiplication opcode (*) to derive a feature that is 52 times the "hours-per-week" feature for each case. For a full list of opcodes that are available for derived feature codes, refer to the Amalgam Language Documentation.

Note

When referencing feature values in a derived feature code (e.g., #hours-per-week 0) the value returned is offset by the number which follows the reference. When not using a time-series Trainee, this value can only be 0. If you are using a time-series Trainee, however, this value can be larger which will then refer to that many cases previously in the time-series (e.g., #hours-per-week 1 would refer to the previous case in a time-series, if it were part of a time-series).

How-To Guide#

For this guide, we will add a feature called "hours-per-year" to the Adult dataset. This dataset already contains a feature called "hours-per-week", so the relationship between the feature we have and the feature we want is mathematical in nature, and so should not be broken when making predictions. Adding derived features to a Trainee can be done either before or after training.

Adding Derived Features Before Training#

To add a derived feature to a Trainee before training, simply modify the feature attributes:

features = infer_feature_attributes(df)
hpy_features = {
    "type": "continuous",
    "auto_derive_on_train": True,
    "derived_feature_code": "(* #hours-per-week 0 52)"
}
features["hours-per-year"] = hpy_features
trainee.train(df, features=features)
trainee.analyze()

That’s quite a lot of code, so let’s break it down. After inferring feature attributes, we set up the feature attributes for our derived feature.

hpy_features = {
    "type": "continuous",
    "auto_derive_on_train": True,
    "derived_feature_code": "(* #hours-per-week 0 52)"
}

First, we note that this feature is continuous. Second, we set this feature to be auto-derived on train. This means that the feature will be computed using its derived feature code as soon as cases are trained into the model. If we did not do this, we would have to manually specify it in the derived_features parameter to Trainee.train() to ensure that the feature is created by the Trainee. Finally, we set the derived feature code. This is a small piece of Amalgam-like code which determines how to derive the feature. In this case, we use the multiplication opcode (*) to multiply each case’s value of hours-per-week (#hours-per-week 0, where the 0 is an offset and means the current case) by 52, the number of weeks in a year.

Adding Derived Features After Training#

The process of adding a derived feature to a Trainee that has already been trained is quite simple. It can be handled with a single call to Trainee.add_feature():

# trainee ``t`` has already been trained and analyzed
hpy_features = {
    "type": "continuous",
    "auto_derive_on_train": True,
    "derived_feature_code": "(* #hours-per-week 0 52)"
}
trainee.add_feature("hours-per-year", feature_attributes=hpy_features)

Using Derived Features in Reacts#

Once a model has one or more derived features, they can be used in reacts:

Using a derived feature as an action feature#
reaction = trainee.react(
    contexts=df[trainee..features.get_names(without=["hours-per-week", "hours-per-year"])],
    action_features=["hours-per-week", "hours-per-year"],
    derived_action_features=["hours-per-year"],
)
print(reaction["action"])
Using a derived feature as a context feature#
reaction = trainee.react(
    contexts=df[trainee..features.get_names(without=["target"])],
    derived_context_features=["hours-per-year"],
    action_features=["target"]
)
print(reaction["action"])

Note that both derived_action_features and derived_context_features must be a subset of action_features and context_features, respectively. A derived context feature is derived from the contexts that are being input to a react(), while a derived action feature is derived from the actions that are output from a react().

Derived Features for Time-Series#

Derived features are used in time-series Trainee s and are automatically created by infer_feature_attributes() when the time_feature_name and id_feature_name parameters are supplied. When Trainee.react_series() is used, the lag features are used as derived context features and the delta/rate features are used as derived action features. Since Trainee.react() supports more explainability details than Trainee.react_series(), this can be useful to replicate the behavior of react_series() using react().

For more information on time-series, see the API Reference and the time-series user guide

API References#