Derived Features#
Objectives: what you will take away#
Definitions & Understanding of how derived features work and how they can be used to accomplish novel tasks for both time-series and non time-series tasks
How-To use derived features in and out of time-series workflows
Prerequisites: before you begin#
You have successfully installed Howso Engine
Data#
Our example dataset for this guide is the well-known Adult
dataset, accessible via the pmlb
package installed
in the prerequisites using the fetch_data()
function.
Concepts & Terminology#
This guide will explain the concept of Derived Features, which include both derived action features and derived context features. Derived features allow you extra control to create additional features based on existing features. This provides additional feature engineering flexibility. To follow along, you should be familiar with the following concepts:
Derived Feature Codes#
The way in which each derived feature is derived is determined by what is called a derived feature code. This is a snippet of Amalgam style code that determines how the derivation should be performed. One such example of a derived feature code is:
(* #hours-per-week 0 52)
This would use the multiplication opcode (*
) to derive a feature that is 52 times the "hours-per-week"
feature for each case.
For a full list of opcodes that are available for derived feature codes, refer to the Amalgam Language Documentation.
Note
When referencing feature values in a derived feature code (e.g., #hours-per-week 0
) the value returned is offset by the number
which follows the reference. When not using a time-series Trainee, this value can only be 0. If you are using a time-series Trainee,
however, this value can be larger which will then refer to that many cases previously in the time-series (e.g., #hours-per-week 1
would refer to the previous case in a time-series, if it were part of a time-series).
How-To Guide#
For this guide, we will add a feature called "hours-per-year"
to the Adult
dataset. This dataset already contains a feature
called "hours-per-week"
, so the relationship between the feature we have and the feature we want is mathematical in nature, and
so should not be broken when making predictions. Adding derived features to a Trainee
can be done either before or after
training.
Adding Derived Features Before Training#
To add a derived feature to a Trainee
before training, simply modify the feature attributes:
features = infer_feature_attributes(df)
hpy_features = {
"type": "continuous",
"auto_derive_on_train": True,
"derived_feature_code": "(* #hours-per-week 0 52)"
}
features["hours-per-year"] = hpy_features
trainee.train(df, features=features)
trainee.analyze()
That’s quite a lot of code, so let’s break it down. After inferring feature attributes, we set up the feature attributes for our derived feature.
hpy_features = {
"type": "continuous",
"auto_derive_on_train": True,
"derived_feature_code": "(* #hours-per-week 0 52)"
}
First, we note that this feature is continuous. Second, we set this feature to be auto-derived on train. This means that the feature will be computed
using its derived feature code as soon as cases are trained into the model. If we did not do this, we would have to manually specify it in the derived_features
parameter to Trainee.train()
to ensure that the feature is created by the Trainee
. Finally, we set the derived feature code. This is a
small piece of Amalgam-like code which determines how to derive the feature. In this case, we use the multiplication opcode (*
) to multiply each case’s
value of hours-per-week (#hours-per-week 0
, where the 0 is an offset and means the current case) by 52, the number of weeks in a year.
Adding Derived Features After Training#
The process of adding a derived feature to a Trainee
that has already been trained is quite simple. It can be handled with a single call to
Trainee.add_feature()
:
# trainee ``t`` has already been trained and analyzed
hpy_features = {
"type": "continuous",
"auto_derive_on_train": True,
"derived_feature_code": "(* #hours-per-week 0 52)"
}
trainee.add_feature("hours-per-year", feature_attributes=hpy_features)
Using Derived Features in Reacts#
Once a model has one or more derived features, they can be used in reacts:
reaction = trainee.react(
contexts=df[trainee..features.get_names(without=["hours-per-week", "hours-per-year"])],
action_features=["hours-per-week", "hours-per-year"],
derived_action_features=["hours-per-year"],
)
print(reaction["action"])
reaction = trainee.react(
contexts=df[trainee..features.get_names(without=["target"])],
derived_context_features=["hours-per-year"],
action_features=["target"]
)
print(reaction["action"])
Note that both derived_action_features
and derived_context_features
must be a subset of action_features
and context_features
, respectively.
A derived context feature is derived from the contexts that are being input to a react()
, while a derived action feature is derived from the
actions that are output from a react()
.
Derived Features for Time-Series#
Derived features are used in time-series Trainee
s and are automatically created by infer_feature_attributes()
when the
time_feature_name
and id_feature_name
parameters are supplied. When Trainee.react_series()
is used, the lag features are used as derived
context features and the delta/rate features are used as derived action features. Since Trainee.react()
supports more explainability details than
Trainee.react_series()
, this can be useful to replicate the behavior of react_series()
using react()
.
For more information on time-series, see the API Reference and the time-series user guide