orf
: ordered random forests
Description
A Python implementation of the Ordered Forest estimator as developed in
Lechner & Okasa (2019). The Ordered Forest flexibly estimates the conditional
probabilities of models with ordered categorical outcomes (so-called ordered
choice models). Additionally to common machine learning algorithms the orf
package provides functions for estimating marginal effects as well as
statistical inference thereof and thus provides similar output as in standard
econometric models for ordered choice. The core forest algorithm relies on the
fast forest implementation from the
scikit-learn
(Pedregosa et al., 2011) library. For the R version of the
orf
package (Lechner & Okasa, 2020), see
CRAN repository.
Installation
To install the orf
package run
pip install orf
in the terminal. orf
requires the following dependencies:
- numpy (>=1.21.0)
- pandas (>=1.3.5)
- scipy (>=1.7.2)
- scikit-learn (>=1.0.2)
- joblib (>=1.0.1)
- plotnine (>=0.8.0)
In case of an installation failure due to dependency issues or conflicts with Anaconda distribution, consider installing the package in a virtual environment.
The implementation relies on Python 3 and is compatible with version 3.8, 3.9 and 3.10.
Examples
The following examples demonstrate the basic usage of the orf
package with
default settings for the Ordered Forest estimator.
# load orf package
import orf
# get example data
features, outcome = orf.make_ordered_regression(seed=123)
# estimate Ordered Forest with default settings
oforest = orf.OrderedForest()
oforest.fit(X=features, y=outcome)
# show summary of the orf estimation
oforest.summary()
# evaluate the prediction performance
oforest.performance()
# plot the estimated probability distributions
oforest.plot()
# predict ordered probabilities
oforest.predict()
# evaluate marginal effects
oforest.margins()
Release Notes
- Version 0.1.0: Initial release of orf python package
- Version 0.1.1: Resolved minor issues regarding the requirements of the package depedencies and updated the description accordingly.
- Version 0.2.0: Major update of BaseOrderedForest class. Tree building is now entirely based on scikit-learn and no longer depends on the econML package. This promises faster estimation and a more stable installation process.
Authors
Michael Lechner, Fabian Muny & Gabriel Okasa
References
- Lechner, M., & Okasa, G. (2019). Random Forest Estimation of the Ordered Choice Model. arXiv preprint arXiv:1907.02436. https://arxiv.org/abs/1907.02436
- Lechner, M., & Okasa, G. (2020). orf: Ordered Random Forests. R package version 0.1.3, https://CRAN.R-project.org/package=orf
- Pedregosa et al. (2011). Scikit-learn: Machine Learning in Python. JMLR 12, pp. 2825-2830.
Classes
class OrderedForest (n_estimators=1000, min_samples_leaf=5, max_features=None, replace=False, sample_fraction=0.5, honesty=True, honesty_fraction=0.5, inference=False, n_jobs=-1, random_state=None)
-
Ordered Random Forests class labeled
OrderedForest
. Initializes parameters for estimation.Parameters
n_estimators
:int
- The number of trees in the forest. The default is 1000.
min_samples_leaf
:int
-
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least
min_samples_leaf
training samples in each of the left and right branches. This may have the effect of smoothing the model.- If int, then consider
min_samples_leaf
as the minimum number. - If float, then
min_samples_leaf
is a fraction andceil(min_samples_leaf * n_samples)
are the minimum number of samples for each node.
The default is 5.
- If int, then consider
max_features
:float, int
orNoneType
-
The number of features to consider when looking for the best split:
- If int, then consider
max_features
features at each split. - If float, then
max_features
is a fraction andround(max_features * n_features)
features are considered at each split. - If None, then
max_features=ceil(sqrt(n_features))
.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than
max_features
features.The default is None.
- If int, then consider
replace
:bool
- If True, sampling with replacement (i.e. bootstrap) is used to grow the trees, otherwise subsampling without replacement is used. The default is False.
sample_fraction
:float
- Subsampling rate, i.e. the share of samples to draw from the training data to build each tree. The default is 0.5.
honesty
:bool
- If True honest forest is built using sample splitting. The default is True.
honesty_fraction
:float
- Share of observations belonging to honest sample not used for growing the forest. The default is 0.5.
inference
:bool
- If True, weight-based inference (i.e. variance estimation and uncertainty quantification of the estimates) is conducted. Note, that this is a computationally intensive procedure and slows down the program. The default is False.
n_jobs
:int
orNone
-
The number of parallel jobs to be used for multithreading in
.fit()
,.predict()
and.margins()
. Followsjoblib
semantics:n_jobs=-1
means all - 1 available cpu cores.n_jobs=None
andn_jobs=1
means no parallelism.
The default is -1.
random_state
:int, None
ornumpy.random.RandomState object
- Random seed used to initialize the pseudo-random number
generator. See
numpy
documentation for details. The default is None.
Returns
None. Initializes parameters for OrderedForest.
Notes
OrderedForest
includes methods to.fit()
the model,.predict()
the probabilities and evaluate marginal effects via.margins()
. Furthermore, it provides functions to interpret the estimation outputs such as.summary()
,.plot()
and.performance()
.The Ordered Forest estimates the conditional ordered choice probabilities, i.e.
P[Y=m|X=x]
. Additionally, weight-based inference for the probability predictions can be conducted as well. If inference is desired, the Ordered Forest must be estimated with honesty and subsampling. If prediction only is desired, estimation without honesty and with bootstrapping is recommended for optimal prediction performance.In order to estimate the Ordered Forest users must supply the data in form of array-like matrix of features
X
and array-like vector of outcomesy
to the.fit()
function. These data inputs are also the only inputs that must be specified by the user without any defaults. Further optional arguments for theOrderedForest
class include the classical forest hyperparameters such as number of trees,n_estimators
, number of randomly selected features at split,max_features
, and the minimum leaf size,min_samples_leaf
. The forest building scheme is regulated by thereplace
argument, meaning bootstrapping ifreplace=True
or subsampling ifreplace=False
. For the case of subsampling, thesample_fraction
argument regulates the subsampling rate. Further, an honest forest is estimated if thehonesty
argument is set toTrue
, which is also the default. Similarly, the fraction of the sample used for the honest estimation is regulated by thehonesty_fraction
argument. The default setting conducts a 50:50 sample split, which is also generally advised to follow for optimal performance. The inference procedure of the Ordered Forest is based on the forest weights and is controlled by theinference
argument. Note, that such weight-based inference is a computationally demanding exercise due to the estimation of the forest weights and as such longer computation time is to be expected. To speed up the estimationsn_jobs
provides option for multithreading from thejoblib
library. Lastly, therandom_state
argument allows to set the seed for replicability.For further details, see examples below.
Examples
# load orf package import orf # initialize Ordered Forest with default parameters oforest = orf.OrderedForest() # initialize Ordered Forest with own tuning parameters oforest = orf.OrderedForest(n_estimators = 2000, min_samples_leaf = 10, max_features = 3) # initialize Ordered Forest with bootstrapping and without honesty oforest = orf.OrderedForest(replace = True, honesty = False) # initialize Ordered Forest with subsampling and with honesty oforest = orf.OrderedForest(replace = False, honesty = True) # initialize Ordered Forest with subsampling and with honesty # with own tuning for subsample fraction and honesty fraction oforest = orf.OrderedForest(replace = False, sample_fraction = 0.5, honesty = True, honesty_fraction = 0.5) # initialize Ordered Forest with subsampling, honesty and # inference (for inference, subsampling and honesty are required) oforest = orf.OrderedForest(replace = False, honesty = True, inference = True) # initialize Ordered Forest with all custom settings oforest = orf.OrderedForest(n_estimators = 2000, min_samples_leaf = 10, max_features = 3, replace = True, sample_fraction = 1, honesty = False, honesty_fraction = 0, inference = False)
Ancestors
- orf._OrderedRandomForest.OrderedRandomForest
- orf._BaseOrderedForest.BaseOrderedForest
- sklearn.base.BaseEstimator
Methods
def fit(self, X, y)
-
Estimation of the ordered choice model via the Ordered Forest estimator of class
OrderedForest
.Parameters
X
:array-like
ofshape (n_samples, n_features)
- The training input samples (i.e. the matrix of covariates).
Internally, its dtype will be converted to
dtype=np.float32
. y
:array-like
ofshape (n_samples,)
- The ordinal outcome values as integers ranging from
1
up tonclass
Returns
self
:object
- The fitted estimator.
Notes
.fit()
estimates the ordered choice model via the Ordered Forest estimator and outputs the conditional ordered choice probabilities, i.e.P[Y=m|X=x]
. The user must supply the data in form of array-like matrix of featuresX
and array-like vector of outcomesy
of ordered classes.Examples
# load orf package import orf # get example data features, outcome = orf.make_ordered_regression(seed=123) # initialize Ordered Forest with default parameters oforest = orf.OrderedForest() # estimate Ordered Forest oforest.fit(X=features, y=outcome) # initialize Ordered Forest with own tuning parameters oforest = orf.OrderedForest(n_estimators = 2000, min_samples_leaf = 10, max_features = 3) # estimate Ordered Forest oforest.fit(X=features, y=outcome) # initialize Ordered Forest with bootstrapping and without honesty oforest = orf.OrderedForest(replace = True, honesty = False) # estimate Ordered Forest oforest.fit(X=features, y=outcome) # initialize Ordered Forest with subsampling and with honesty oforest = orf.OrderedForest(replace = False, honesty = True) # estimate Ordered Forest oforest.fit(X=features, y=outcome) # initialize Ordered Forest with subsampling and with honesty # with own tuning for subsample fraction and honesty fraction oforest = orf.OrderedForest(replace = False, sample_fraction = 0.5, honesty = True, honesty_fraction = 0.5) # estimate Ordered Forest oforest.fit(X=features, y=outcome) # initialize Ordered Forest with subsampling, honesty and # inference (for inference, subsampling and honesty are required) oforest = orf.OrderedForest(replace = False, honesty = True, inference = True) # estimate Ordered Forest oforest.fit(X=features, y=outcome) # initialize Ordered Forest with all custom settings oforest = orf.OrderedForest(n_estimators = 2000, min_samples_leaf = 10, max_features = 3, replace = True, sample_fraction = 1, honesty = False, honesty_fraction = 0, inference = False) # estimate Ordered Forest oforest.fit(X=features, y=outcome)
def margins(self, X=None, X_cat=None, X_eval=None, eval_point='mean', window=0.1, verbose=True)
-
Evaluation of marginal effects based on the estimated Ordered Forest of class
OrderedForest
.Parameters
X
:array-like
ofshape (n_samples, n_features)
orNoneType
- Matrix of new features/covariates or
None
if covariates from fit function should be used. If new data provided, it must have the same number of features as theX
in the.fit()
function. X_cat
:list
ortuple
orNoneType
- List or tuple indicating the columns with categorical covariates,
i.e.
X_cat=(1,)
orX_cat=[1]
if the second column includes categorical values. If not defined, covariates with integer values and less than 10 unique values are considered to be categorical as default. X_eval
:list
ortuple
orNoneType
- List or tuple indicating the columns with covariates for which the,
marginal effect should be evaluated, i.e.
X_eval=(0,)
orX_eval=[0]
if the effect for the covariate in the first column should be evaluated. This can significantly speed up the program. If not defined, all covariates are considered as default. eval_point
:string
- Defining evaluation point for marginal effects. This can be one
of
"mean"
,"atmean"
, or"atmedian"
. The default is"mean"
. window
:float
- The share of the standard deviation of
X
to be used for evaluation of the marginal effect. The default is0.1
. verbose
:bool
- Should the results printed to console? The default is True.
Returns
result
:dict
- Dictionary containing results of marginal effects estimation.
Use
result.get("...")
with"effects"
,"variances"
,"std_errors"
,"t-values"
,"p-values"
,"ci-up"
or"ci-down"
to extract arrays of marginal effects, variances, standard errors, t-values, p-values or upper and lower confidence intervals, respectively. All of these arrays are of shape(n_samples, nclass)
.
Notes
.margins()
evaluates marginal effects at the mean, at the median, or the mean marginal effects, depending on theeval_point
argument. For a greater flexibility in the marginal effects comptation, the argumentX_eval
controls for which features the marginal effects should be evaluated. If not defined, the marginal effects of all features are computed which might be computationally expensive. Additionally, the evaluation window for the marginal effects can be regulated through thewindow
argument. Furthermore, the user might specify which features/covariates should be explicitly handled as categorical via theX_cat
argument. Moreover, new test data for which marginal effects should be evaluated can be supplied via theX
argument as long as it lies within the support of the trainingX
data. In addition to the estimation of the marginal effects, the weight-based inference for the effects is supported as well, this is inherited from theOrderedForest
class arguments. Note, that the inference procedure is a computationally exhausting exercise due to the computation of the forest weights. It is advised to increase the number of subsampling replications in the suppliedOrderedForest
object as the estimation of the marginal effects is a more demanding exercise than a simple Ordered Forest estimation/prediction.Examples
# load packages import orf # get example data features, outcome = orf.make_ordered_regression(seed=123) # estimate Ordered Forest oforest = orf.OrderedForest().fit(X=features, y=outcome) # estimate default (mean) marginal effects for all features marg = oforest.margins() # return mean marginal effects as array print(marg.get("effects")) # estimate mean marginal effects, explicitly defining the second # column of the features as categorical marg = oforest.margins(X_cat=[1]) # estimate mean marginal effects for the first and third column of X marg = oforest.margins(X_eval=[0,2]) # estimate marginal effects at the mean and at the median marg_atmean = oforest.margins(eval_point="atmean") marg_atmedian = oforest.margins(eval_point="atmedian") # estimate Ordered Forest using inference oforest = orf.OrderedForest(inference=True).fit(X=features, y=outcome) # estimate mean marginal effects for the first column of X marg = oforest.margins(X_eval=[0]) # return marginal effects as array print(marg.get("effects")) # return variances as array print(marg.get("variances")) # return standard errors as array print(marg.get("std_errors")) # return t-values as array print(marg.get("t-values")) # return p-values as array print(marg.get("p-values")) # return upper confidence intervals as array print(marg.get("ci-up")) # return lower confidence intervals as array print(marg.get("ci-down"))
def performance(self)
-
Print the prediction performance of the Ordered Forest object of class
OrderedForest
.Parameters
None.
Returns
None. Prints MSE, RPS, Classification accuracy and confusion matrix.
Notes
.performance()` evaluates the probability and class predictions in terms of Ranked Probability Score (RPS), Mean Squared Error (MSE) and Classification Accuracy (CA). In addition, it prints the confusion matrix.
Examples
# load package import orf # get example data features, outcome = orf.make_ordered_regression(seed=123) # estimate Ordered Forest oforest = orf.OrderedForest().fit(X=features, y=outcome) # print the prediction performance measures oforest.performance()
def plot(self)
-
Plot the probability distributions estimated by the Ordered Forest object of class
OrderedForest
.Parameters
None.
Returns
fig
:object
oftype ggplot
- Plot of probability distributions.
Notes
.plot()
generates probability distributions, i.e. density plots of the estimated ordered probabilities by the Ordered Forest for each outcome class considered. The plots effectively visualize the estimated probability density in contrast to a real observed ordered outcome class and as such provide a visual inspection of the overall in-sample estimation accuracy. The dashed lines locate the means of the respective probability distributions.Examples
# load package import orf # get example data features, outcome = orf.make_ordered_regression(seed=123) # estimate Ordered Forest oforest = orf.OrderedForest().fit(X=features, y=outcome) # plot the estimated probability distributions oforest.plot()
def predict(self, X=None, prob=True)
-
Prediction for new observations based on the estimated Ordered Forest of class
OrderedForest
.Parameters
X
:array-like
ofshape (n_samples, n_features)
orNoneType
- Matrix of new features/covariates or
None
if covariates from fit function should be used. If new data provided, it must have the same number of features as theX
in the.fit()
function. prob
:bool
- If True, ordered probabilities are predicted. Otherwise, ordered classes are predicted instead. Note that inference is only available for probability predictions. The default is True.
Returns
result
:dict
- Dictionary containing prediction results. Use
result.get("predictions")
to extract array of predictions andresult.get("variances")
to extract array of variances. Both of these arrays are of shape(n_samples, nclass)
.
Notes
.predict()
estimates the conditional ordered choice probabilities, i.e.P[Y=m|X=x]
for new data points (array-like matrix of featuresX
containing new test observations) based on the estimated Ordered Forest object of classOrderedForest
. Furthermore, weight-based inference for the probability predictions can be conducted as well, this is inherited from theOrderedForest
class arguments. If inference is desired, the supplied Ordered Forest must be estimated with honesty and subsampling. If only prediction is desired, estimation without honesty and with bootstrapping is recommended for optimal predictive performance. In addition to the probability predictions, class predictions can be estimated as well settingprob=False
. In this case, for each observation the class with the highest predicted probability is returned.Examples
# load packages import orf from sklearn.model_selection import train_test_split # get example data features, outcome = orf.make_ordered_regression(seed=123) # generate train and test set X_train, X_test, y_train, y_test = train_test_split( features, outcome, test_size=0.2, random_state=123) # estimate Ordered Forest oforest = orf.OrderedForest().fit(X=X_train, y=y_train) # predict the probabilities with the estimated Ordered Forest pred = oforest.predict(X=X_test) # return predictions as array print(pred.get("predictions")) # predict the classes with estimated Ordered Forest pred_class = oforest.predict(X=X_test, prob=False) # return predictions as array pred_class.get("predictions") # estimate Ordered Forest using inference oforest = orf.OrderedForest(inference=True).fit(X=X_train, y=y_train) # predict the probabilities together with variances pred_inf = oforest.predict(X=X_test) # return predictions as array print(pred_inf.get("predictions")) # return variances as array print(pred_inf.get("variances"))
def summary(self, item=None)
-
Summary of estimated Ordered Forest object of class
OrderedForest
.Parameters
item
:dict
orNoneType
- Object that should be summarized: Either prediction or margins output or None. If None, then forest parameters will be printed. The default is None.
Returns
None. Prints summary to console.
Notes
.summary()
provides a short summary of the Ordered Forest estimation, including the input information regarding the values of hyperparameters as well as the output information regarding the prediction accuracy.Examples
# load package import orf # get example data features, outcome = orf.make_ordered_regression(seed=123) # estimate Ordered Forest oforest = orf.OrderedForest().fit(X=features, y=outcome) # print summary of estimation oforest.summary() # predict the probabilities with the estimated Ordered Forest pred = oforest.predict() # print summary of the Ordered Forest predictions oforest.summary(pred) # estimate marginal effects for first feature marg = oforest.margins(X_eval = [0]) # print summary of the marginal effects oforest.summary(marg)
Functions
def make_ordered_regression(n_samples=1000, y_classes=3, p_cont=1, p_cat=1, cat_classes=3, p_binary=1, noise=True, seed=None)
-
Generate example data for Ordered Forest estimation.
Parameters
n_samples
:integer
- The number of observations. The default is 1000.
y_classes
:integer
- The number of classes of the outcome variable. The default is 3.
p_cont
:integer
- The number of continuous covariates drawn from a normal distribution. The default is 1.
p_cat
:integer
- The number of categorical covariates drawn from a binomial distribution. The default is 1.
cat_classes
:integer
- The number of classes of the categorical variable(s). The default is 3.
p_binary
:integer
- The number of binary covariates drawn from a binomial distribution. The default is 1.
noise
:boolean
- Whether to include a continuous noise variable that does not influence the outcome. The default is True.
seed
:integer
orNoneType
- Set seed for reproducability. The default is None.
Returns
X
:ndarray
- The generated covariates/features.
y
:ndarray
- The generated outcomes.
Notes
This functions generates an example dataset of size
n_sample
, consisting of an ordered outcome variable withy_classes
classes and an array of features of different types. The data-generating process (DGP) may include continuous (p_cont
), binary (p_binary
) and categorical (p_cat
) features. In addition, it is possible to include a noise variable in the DGP that does not affect the outcome variable by specifyingnoise=True
.Example
# load orf package import orf # generate data consisting of 2000 observations and 4 outcome classes features, outcome = orf.make_ordered_regression(n_samples=2000, y_classes=4, seed=123)