stpredictions.models.OK3 package

Subpackages

Submodules

stpredictions.models.OK3.base module

class stpredictions.models.OK3.base.StructuredOutputMixin

Bases: object

Mixin to mark estimators that support structured prediction.

stpredictions.models.OK3.kernel module

class stpredictions.models.OK3.kernel.Gaussian_Kernel(gamma=1)

Bases: stpredictions.models.OK3.kernel.Kernel

evaluate(y1, y2)
get_Gram_matrix(list_y_1, list_y_2=None)
get_sq_norms(list_y)
class stpredictions.models.OK3.kernel.Gini_Kernel

Bases: stpredictions.models.OK3.kernel.Mean_Dirac_Kernel

Identique au ‘Mean_Dirac_Kernel’, mais permet de signaler que le décodage ne se fait pas parmi un candidates set mais est une recherche exhaustive.

class stpredictions.models.OK3.kernel.Kernel(name)

Bases: object

evaluate(obj1, obj2)
get_Gram_matrix(objects_1, objects_2=None)
get_name()
get_sq_norms(objects)
class stpredictions.models.OK3.kernel.Laplacian_Kernel(gamma=1)

Bases: stpredictions.models.OK3.kernel.Kernel

evaluate(y1, y2)
get_Gram_matrix(list_y_1, list_y_2=None)
get_sq_norms(list_y)
class stpredictions.models.OK3.kernel.Linear_Kernel

Bases: stpredictions.models.OK3.kernel.Kernel

evaluate(y1, y2)
get_Gram_matrix(list_y_1, list_y_2=None)
get_sq_norms(list_y)
class stpredictions.models.OK3.kernel.MSE_Kernel

Bases: stpredictions.models.OK3.kernel.Linear_Kernel

Identique au ‘Linear_Kernel’, mais permet de signaler que le décodage ne se fait pas parmi un candidates set mais est une solution exacte.

class stpredictions.models.OK3.kernel.Mean_Dirac_Kernel

Bases: stpredictions.models.OK3.kernel.Kernel

evaluate(y1, y2)
get_Gram_matrix(list_y_1, list_y_2=None)
get_sq_norms(list_y)

stpredictions.models.OK3.structured_object module

class stpredictions.models.OK3.structured_object.StructuredObject

Bases: object

get_name()
is_equal_to(other)
similarity_with(other, kernel)

Module contents

The sklearn.tree module includes decision tree-based models for classification and regression.

class stpredictions.models.OK3.BaseKernelizedOutputTree(*, criterion, splitter, max_depth, min_samples_split, min_samples_leaf, min_weight_fraction_leaf, max_features, max_leaf_nodes, random_state, min_impurity_decrease, min_impurity_split, ccp_alpha=0.0, kernel)

Bases: stpredictions.models.OK3.base.StructuredOutputMixin, sklearn.base.BaseEstimator

Base class for regression trees with a kernel in the output space.

Warning: This class should not be used directly. Use derived classes instead.

apply(X, check_input=True)

Return the index of the leaf that each sample is predicted as.

X{array-like, sparse matrix} of shape (n_samples, n_features)

The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

check_inputbool, default=True

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

X_leavesarray-like of shape (n_samples,)

For each datapoint x in X, return the index of the leaf x ends up in. Leaves are numbered within [0; self.tree_.node_count), possibly with gaps in the numbering.

cost_complexity_pruning_path(X, y, sample_weight=None)

Compute the pruning path during Minimal Cost-Complexity Pruning.

See minimal_cost_complexity_pruning for details on the pruning process.

X{array-like, sparse matrix} of shape (n_samples, n_features)

The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

The target values (class labels) as integers or strings.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node.

ccp_pathBunch

Dictionary-like object, with the following attributes.

ccp_alphasndarray

Effective alphas of subtree during pruning.

impuritiesndarray

Sum of the impurities of the subtree leaves for the corresponding alpha value in ccp_alphas.

decision_path(X, check_input=True)

Return the decision path in the tree.

X{array-like, sparse matrix} of shape (n_samples, n_features)

The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

check_inputbool, default=True

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

indicatorsparse matrix of shape (n_samples, n_nodes)

Return a node indicator CSR matrix where non zero elements indicates that the samples goes through the nodes.

decode(X, candidates=None, check_input=True, return_top_k=1)

Synonyme de predict

decode_tree(candidates=None, return_top_k=1)

Decode each leaves predictions of the tree, AND store the array of the decoded outputs as an attribut of the estimator : self.leaves_preds.

ATTENTION, les prédictions correspondant aux noeuds qui ne sont pas des feuilles n’ont aucu sens : elles sont arbitraires. Elles n’ont volontairement pas été calculées pour question d’économie de temps.

candidatesarray of shape (nb_candidates, vectorial_repr_len), default=None

The candidates outputs for the minimisation problem of decoding the predictions in the Hilbert space. If not given or None, it will be set to the output training matrix.

return_top_kint, default=1

Indicate how many decoded outputs to return for each example (or for each leaf). Default is one : select the output that gives the minimum “distance” with the predicted value in the Hilbert space. But it is useful to be able to return for example the 5 best candidates in order to evaluate a top 5 accuracy metric.

leaves_predsarray-like of shape (node_count,vector_length)

For each leaf, return the vectorial representation of the output in ‘candidates’ that minimizes the “distance” with the “exact” prediction in the Hilbert space.

leaves_preds[i*return_top_k : (i+1)*return_top_k] is the non-ordered list od the decoded outputs of the node i among candidates.

property feature_importances_

Return the feature importances.

The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See sklearn.inspection.permutation_importance() as an alternative.

feature_importances_ndarray of shape (n_features,)

Normalized total reduction of criteria by feature (Gini importance).

fit(X, y, sample_weight=None, check_input=True, X_idx_sorted='deprecated', kernel=None, in_ensemble=False, Gram_y=None)
kernelOptional input. If not provided, the kernel attribute of the

estimatoris used. If provided, the kernel attribute of the estimator is updated.

Some possible values :

“gini_clf”y is a matrix of labels for multilabel classification.

shape (n_train_samples, n_labels) We have to compute the corresponding gram matrix, equivalent to the use of a Classification Tree with the gini index as impurity. Exact solution search is performed.

“mse_reg”y is a matrix of real vectors for multiouput regression.

shape (n_train_samples, n_outputs) We have to compute the corresponding gram matrix, equivalent to the use of a Regression Tree with the mse as impurity. Exact solution search is performed.

“mean_dirac”y is a matrix or vectors (vectorial representation of structured objects).

shape (n_train_samples, vector_length) The similarity between two objects is computed with a mean dirac equality kernel.

“linear”y is a matrix or vectors (vectorial representation of structured objects).

shape (n_train_samples, vector_length) The similarity between two objects is computed with a linear kernel.

“gaussian”y is a matrix or vectors (vectorial representation of structured objects).

shape (n_train_samples, vector_length) The similarity between two objects is computed with a gaussian kernel.

in_ensembleboolean, default=False

flag to set to true when the estimator is used with an ensemble method, if True, the Gram matrix of the outputs is also given (as K_y) and doesn’t have to be computed by the tree –> avoid this heavy calculation for all trees.

Gram_ythe output gram matrix, default=None

Allows to avoid the Gram matrix calculation if we already have it (useful when in_ensemble=True)

get_depth()

Return the depth of the decision tree.

The depth of a tree is the maximum distance between the root and any leaf.

self.tree_.max_depthint

The maximum depth of the tree.

get_leaves_weights()

Gives the weighted training samples in each leaf

A (n_nodes, n_training_samples) array which gives for each node (line number) and for each training sample its weight in the node (O if the sample doesn’t fall in the node, and a non-negative value depending on ‘sample_weight’ otherwise.)

get_n_leaves()

Return the number of leaves of the decision tree.

self.tree_.n_leavesint

Number of leaves.

predict(X, candidates=None, check_input=True, return_top_k=1)

Predict structured objects for X.

The predicted structured objects based on X are returned. Performs an argmin research algorithm amongst the possible outputs

X{array-like, sparse matrix} of shape (n_samples, n_features)

The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

candidatesarray of shape (nb_candidates, vectorial_repr_len), default=None

The candidates outputs for the minimisation problem of decoding the predictions in the Hilbert space. If not given or None, it will be set to the output training matrix.

check_inputbool, default=True

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

return_top_kint, default=1

Indicate how many decoded outputs to return for each example (or for each leaf). Default is one : select the output that gives the minimum “distance” with the predicted value in the Hilbert space. But it is useful to be able to return for example the 5 best candidates in order to evaluate a top 5 accuracy metric.

output

array of shape (n_samples, vectorial_repr_len) containing the vectorial representations of the structured output objects (found in the set of candidates, or if it is not given, among the training outputs).

predict_weights(X, check_input=True)

Predict the output for X as weighted combinations of training outputs It is kind of the representation in the Hilbert space.

A (len(X), n_training_samples) array which gives for each test example (line number) and for each training sample its weight in the node (O if the sample doesn’t fall in the same leaf as the test example, and a non-negative value depending on ‘sample_weight’ otherwise.)

r2_score_in_Hilbert(X, y, sample_weight=None)

Calcule le score R2 SANS décodage

Return the coefficient of determination R^2 of the prediction in the Hilbert space. The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum().

Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True outputs for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

scorefloat

R2 score of the predictions in the Hilbert space wrt. the embedded values of y.

score(X, y, candidates=None, metric='accuracy', sample_weight=None)

Calcule le score après décodage

Return either

-the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum().

(IF self.kernel=”mse_reg”)

-the mean accuracy of the predictions if metric=”accuracy”. (Requires that all labels match to count as positive in case of multilabel)

-the mean hamming score of the predictions if metric=”hamming” (Well suited for multilabel classification)

-the mean top k accuracy score if metric=”top_”+str(k) It works with all wanted value of k.

It is possible to set the ‘sample_weight’ parameter for all these metrics.

All this score metrics are highly dependent on the candidates set because it deals with the decoded predictions (which are among this set). If you want to compute a score only based on the tree structure, you can use the following method ‘r2_score_in_Hilbert’.

Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True outputs for X.

candidatesarray-like of shape (nb_candidates, n_outputs)

Possible decoded outputs for X

metricstr, default=”accuracy”

The way to compute the score

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

scorefloat

Chosen score between self.predict(X) and y.

class stpredictions.models.OK3.BaseOKForest(base_estimator, n_estimators=100, *, estimator_params=(), bootstrap=False, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, max_samples=None, kernel='linear')

Bases: stpredictions.models.OK3.base.StructuredOutputMixin, sklearn.ensemble._base.BaseEnsemble

Base class for forests of ok-trees.

Warning: This class should not be used directly. Use derived classes instead.

apply(X)

Apply trees in the forest to X, return leaf indices.

X{array-like, sparse matrix} of shape (n_samples, n_features)

The input samples. Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

X_leavesndarray of shape (n_samples, n_estimators)

For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in.

decision_path(X)

Return the decision path in the forest.

New in version 0.18.

X{array-like, sparse matrix} of shape (n_samples, n_features)

The input samples. Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

indicatorsparse matrix of shape (n_samples, n_nodes)

Return a node indicator matrix where non zero elements indicates that the samples goes through the nodes. The matrix is of CSR format.

n_nodes_ptrndarray of shape (n_estimators + 1,)

The columns from indicator[n_nodes_ptr[i]:n_nodes_ptr[i+1]] gives the indicator value for the i-th estimator.

property feature_importances_

The impurity-based feature importances.

The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See sklearn.inspection.permutation_importance() as an alternative.

feature_importances_ndarray of shape (n_features,)

The values of this array sum to 1, unless all trees are single node trees consisting of only the root node, in which case it will be an array of zeros.

fit(X, y, sample_weight=None)

Build a forest of ok-trees from the training set (X, y).

X{array-like, sparse matrix} of shape (n_samples, n_features)

The training input samples. Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csc_matrix.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

The target values (class labels in classification, real numbers in regression).

sample_weightarray-like of shape (n_samples,), default=None

Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, splits are also ignored if they would result in any single class carrying a negative weight in either child node.

self : object

class stpredictions.models.OK3.ExtraOK3Regressor(*, criterion='mse', splitter='random', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', random_state=None, min_impurity_decrease=0.0, min_impurity_split=None, max_leaf_nodes=None, ccp_alpha=0.0, kernel='linear')

Bases: stpredictions.models.OK3._classes.OK3Regressor

An extremely randomized tree regressor.

Extra-trees differ from classic decision trees in the way they are built. When looking for the best split to separate the samples of a node into two groups, random splits are drawn for each of the max_features randomly selected features and the best split among those is chosen. When max_features is set 1, this amounts to building a totally random decision tree.

Warning: Extra-trees should only be used within ensemble methods.

criterion{“mse”, “friedman_mse”, “mae”}, default=”mse”

The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.

splitter{“random”, “best”}, default=”random”

The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

max_depthint, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint or float, default=2

The minimum number of samples required to split an internal node:

  • If int, then consider min_samples_split as the minimum number.

  • If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

min_samples_leafint or float, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

  • If int, then consider min_samples_leaf as the minimum number.

  • If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

min_weight_fraction_leaffloat, default=0.0

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

max_featuresint, float, {“auto”, “sqrt”, “log2”} or None, default=”auto”

The number of features to consider when looking for the best split:

  • If int, then consider max_features features at each split.

  • If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

  • If “auto”, then max_features=n_features.

  • If “sqrt”, then max_features=sqrt(n_features).

  • If “log2”, then max_features=log2(n_features).

  • If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

random_stateint, RandomState instance, default=None

Used to pick randomly the max_features used at each split. See Glossary for details.

min_impurity_decreasefloat, default=0.0

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

The weighted impurity decrease equation is the following:

N_t / N * (impurity - N_t_R / N_t * right_impurity
                    - N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

min_impurity_splitfloat, (default=0)

Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.

Deprecated since version 0.19: min_impurity_split has been deprecated in favor of min_impurity_decrease in 0.19. The default value of min_impurity_split has changed from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Use min_impurity_decrease instead.

max_leaf_nodesint, default=None

Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

ccp_alphanon-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See minimal_cost_complexity_pruning for details.

kernelstring, or tuple (string, params) or instance of the Kernel class, default=”linear”

The type of kernel to use to compare the output data. Changing this parameter changes also implicitely the nature of the Hilbert space in which the output data are embedded. The string describes the type of Kernel to use (defined in Kernel.py), The optional params given are here to set particular parameters values for the chosen kernel type.

max_features_int

The inferred value of max_features.

n_features_int

The number of features when fit is performed.

feature_importances_ndarray of shape (n_features,)

Return impurity-based feature importances (the higher, the more important the feature).

Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See sklearn.inspection.permutation_importance() as an alternative.

tree_Tree

The underlying Tree object. Please refer to help(sklearn.tree._tree.Tree) for attributes of Tree object and sphx_glr_auto_examples_tree_plot_unveil_tree_structure.py for basic usage of these attributes.

ExtraTreeClassifier : An extremely randomized tree classifier. sklearn.ensemble.ExtraTreesClassifier : An extra-trees classifier. sklearn.ensemble.ExtraTreesRegressor : An extra-trees regressor.

The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.

1

P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006.

>>> from sklearn.datasets import load_diabetes
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.ensemble import BaggingRegressor
>>> from sklearn.tree import ExtraTreeRegressor
>>> X, y = load_diabetes(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, random_state=0)
>>> extra_tree = ExtraTreeRegressor(random_state=0)
>>> reg = BaggingRegressor(extra_tree, random_state=0).fit(
...     X_train, y_train)
>>> reg.score(X_test, y_test)
0.33...
class stpredictions.models.OK3.OK3Regressor(*, criterion='mse', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, ccp_alpha=0.0, kernel='linear')

Bases: stpredictions.models.OK3._classes.BaseKernelizedOutputTree

A decision tree regressor for the OK3 method.

criterion{“mse”}, default=”mse”

The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, “friedman_mse”, which uses mean squared error with Friedman’s improvement score for potential splits, and “mae” for the mean absolute error, which minimizes the L1 loss using the median of each terminal node.

splitter{“best”, “random”}, default=”best”

The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

max_depthint, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint or float, default=2

The minimum number of samples required to split an internal node:

  • If int, then consider min_samples_split as the minimum number.

  • If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

min_samples_leafint or float, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

  • If int, then consider min_samples_leaf as the minimum number.

  • If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

min_weight_fraction_leaffloat, default=0.0

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

max_featuresint, float or {“auto”, “sqrt”, “log2”}, default=None

The number of features to consider when looking for the best split:

  • If int, then consider max_features features at each split.

  • If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

  • If “auto”, then max_features=n_features.

  • If “sqrt”, then max_features=sqrt(n_features).

  • If “log2”, then max_features=log2(n_features).

  • If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

random_stateint, RandomState instance, default=None

Controls the randomness of the estimator. The features are always randomly permuted at each split, even if splitter is set to "best". When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer. See Glossary for details.

max_leaf_nodesint, default=None

Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

min_impurity_decreasefloat, default=0.0

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

The weighted impurity decrease equation is the following:

N_t / N * (impurity - N_t_R / N_t * right_impurity
                    - N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

min_impurity_splitfloat, (default=0)

Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.

Deprecated since version 0.19: min_impurity_split has been deprecated in favor of min_impurity_decrease in 0.19. The default value of min_impurity_split has changed from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Use min_impurity_decrease instead.

ccp_alphanon-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See minimal_cost_complexity_pruning for details.

kernelstring, or tuple (string, params) or instance of the Kernel class, default=”linear”

The type of kernel to use to compare the output data. Changing this parameter changes also implicitely the nature of the Hilbert space in which the output data are embedded. The string describes the type of Kernel to use (defined in Kernel.py), The optional params given are here to set particular parameters values for the chosen kernel type.

feature_importances_ndarray of shape (n_features,)

The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [4]_.

Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See sklearn.inspection.permutation_importance() as an alternative.

max_features_int

The inferred value of max_features.

n_features_int

The number of features when fit is performed.

tree_Tree

The underlying Tree object. Please refer to help(sklearn.tree._tree.Tree) for attributes of Tree object and sphx_glr_auto_examples_tree_plot_unveil_tree_structure.py for basic usage of these attributes.

leaves_predsarray of shape (n_nodes, n_components),

where n_nodes is the number of nodes of the grown tree and n_components is the number of values used to represent an output.

This array stores for each leaf of the tree, the decoded predictions in Y.

The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.

1

Pierre Geurts, Louis Wehenkel, Florence d’Alché-Buc. “Kernelizing the output of tree-based methods.” Proc. of the 23rd International Conference on Machine Learning, 2006, United States. pp.345–352,￿10.1145/1143844.1143888￿.

>>> from sklearn.datasets import load_diabetes
>>> from sklearn.model_selection import cross_val_score
>>> from ??? import OK3Regressor
>>> X, y = load_diabetes(return_X_y=True)
>>> regressor = OK3Regressor(random_state=0)
>>> cross_val_score(regressor, X, y, cv=10)
...                    
...
array([-0.39..., -0.46...,  0.02...,  0.06..., -0.50...,
       0.16...,  0.11..., -0.73..., -0.30..., -0.00...])
fit(X, y, sample_weight=None, check_input=True, X_idx_sorted='deprecated', kernel=None, in_ensemble=False, Gram_y=None)

Build a decision tree regressor from the training set (X, y).

X{array-like, sparse matrix} of shape (n_samples, n_features)

The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

The target values (real numbers). Use dtype=np.float64 and order='C' for maximum efficiency.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node.

check_inputbool, default=True

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

X_idx_sorteddeprecated, default=”deprecated”

This parameter is deprecated and has no effect. It will be removed in v0.26.

kernelstring, or tuple (string, params) or instance of the Kernel class, default=”linear”

The type of kernel to use to compare the output data. Changing this parameter changes also implicitely the nature of the Hilbert space in which the output data are embedded. The string describes the type of Kernel to use (defined in Kernel.py), The optional params given are here to set particular parameters values for the chosen kernel type. This parameter can be set also here in the fit method instead of __init__.

selfOK3Regressor

Fitted estimator.

class stpredictions.models.OK3.OKForestRegressor(base_estimator, n_estimators=100, *, estimator_params=(), bootstrap=False, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, max_samples=None, kernel='linear')

Bases: stpredictions.models.OK3._forest.BaseOKForest

Base class for forest of ok-trees-based regressors.

Warning: This class should not be used directly. Use derived classes instead.

predict(X, candidates=None, return_top_k=1, precomputed_weights=None)

Predict structured objects for X.

The predicted structured objects based on X are returned. Performs an argmin research algorithm amongst the possible outputs

X{array-like, sparse matrix} of shape (n_samples, n_features)

The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.

candidatesarray of shape (nb_candidates, vectorial_repr_len), default=None

The candidates outputs for the minimisation problem of decoding the predictions in the Hilbert space. If not given or None, it will be set to the output training matrix.

check_inputbool, default=True

Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

return_top_kint, default=1

Indicate how many decoded outputs to return for each example (or for each leaf). Default is one : select the output that gives the minimum “distance” with the predicted value in the Hilbert space. But it is useful to be able to return for example the 5 best candidates in order to evaluate a top 5 accuracy metric.

output

array of shape (n_samples, vectorial_repr_len) containing the vectorial representations of the structured output objects (found in the set of candidates, or if it is not given, among the training outputs).

predict_weights(X)

Predict weights (on the training samples) for X.

The predicted weights of an input sample are computed as the mean predicted weights of the trees in the forest.

X{array-like, sparse matrix} of shape (n_samples, n_features)

The input samples. Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

A (X.shape[0], n_training_samples) array which gives for each test example (line number) and for each training sample its weight in the node (O if the sample doesn’t fall in any of the same leaves as the test example, and a non-negative value depending on ‘sample_weight’ otherwise.)

r2_score_in_Hilbert(X, y, sample_weight=None)

Calcule le score R2 SANS décodage

Return the coefficient of determination R^2 of the prediction in the Hilbert space. The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum().

Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True outputs for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

scorefloat

R2 score of the predictions in the Hilbert space wrt. the embedded values of y.

score(X, y, candidates=None, metric='accuracy', sample_weight=None, precomputed_weights=None)

Calcule le score après décodage

Return either

-the coefficient of determination R^2 of the prediction. The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). (IF self.kernel=”mse_reg”)

-the mean accuracy of the predictions if metric=”accuracy”. (Requires that all labels match to count as positive in case of multilabel)

-the mean hamming score of the predictions if metric=”hamming” (Well suited for multilabel classification)

-the mean top k accuracy score if metric=”top_”+str(k) It works with all wanted value of k.

It is possible to set the ‘sample_weight’ parameter for all these metrics.

All this score metrics are highly dependent on the candidates set because it deals with the decoded predictions (which are among this set). If you want to compute a score only based on the tree structure, you can use the following method ‘r2_score_in_Hilbert’.

Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True outputs for X.

candidatesarray-like of shape (nb_candidates, n_outputs)

Possible decoded outputs for X

metricstr, default=”accuracy”

The way to compute the score

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

scorefloat

Chosen score between self.predict(X) and y.

class stpredictions.models.OK3.RandomOKForestRegressor(n_estimators=100, *, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None, kernel='linear')

Bases: stpredictions.models.OK3._forest.OKForestRegressor

A random ok-forest regressor.

A random forest is a meta estimator that fits a number of decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

Read more in the User Guide.

n_estimatorsint, default=100

The number of trees in the forest.

criterion{“mse”}, default=”mse”

The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion

max_depthint, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint or float, default=2

The minimum number of samples required to split an internal node:

  • If int, then consider min_samples_split as the minimum number.

  • If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

Changed in version 0.18: Added float values for fractions.

min_samples_leafint or float, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

  • If int, then consider min_samples_leaf as the minimum number.

  • If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

Changed in version 0.18: Added float values for fractions.

min_weight_fraction_leaffloat, default=0.0

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”

The number of features to consider when looking for the best split:

  • If int, then consider max_features features at each split.

  • If float, then max_features is a fraction and round(max_features * n_features) features are considered at each split.

  • If “auto”, then max_features=n_features.

  • If “sqrt”, then max_features=sqrt(n_features).

  • If “log2”, then max_features=log2(n_features).

  • If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

max_leaf_nodesint, default=None

Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

min_impurity_decreasefloat, default=0.0

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

The weighted impurity decrease equation is the following:

N_t / N * (impurity - N_t_R / N_t * right_impurity
                    - N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

min_impurity_splitfloat, default=None

Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.

bootstrapbool, default=True

Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.

oob_scorebool, default=False

whether to use out-of-bag samples to estimate the R^2 on unseen data.

n_jobsint, default=None

The number of jobs to run in parallel. fit(), predict(), decision_path() and apply() are all parallelized over the trees. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

random_stateint or RandomState, default=None

Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). See Glossary for details.

verboseint, default=0

Controls the verbosity when fitting and predicting.

warm_startbool, default=False

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the Glossary.

ccp_alphanon-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See minimal_cost_complexity_pruning for details.

max_samplesint or float, default=None

If bootstrap is True, the number of samples to draw from X to train each base estimator.

  • If None (default), then draw X.shape[0] samples.

  • If int, then draw max_samples samples.

  • If float, then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0, 1).

base_estimator_DecisionTreeRegressor

The child estimator template used to create the collection of fitted sub-estimators.

estimators_list of DecisionTreeRegressor

The collection of fitted sub-estimators.

feature_importances_ndarray of shape (n_features,)

The impurity-based feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See sklearn.inspection.permutation_importance() as an alternative.

n_features_int

The number of features when fit is performed.

n_outputs_int

The number of outputs when fit is performed.

oob_score_float

Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when oob_score is True.

oob_prediction_ndarray of shape (n_samples,)

Prediction computed with out-of-bag estimate on the training set. This attribute exists only when oob_score is True.

OK3Regressor, ExtraOKTreesRegressor

The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.

The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data, max_features=n_features and bootstrap=False, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.

The default value max_features="auto" uses n_features rather than n_features / 3. The latter was originally suggested in [1], whereas the former was more recently justified empirically in [2].

1
  1. Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.

2

P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006.

>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.datasets import make_regression
>>> X, y = make_regression(n_features=4, n_informative=2,
...                        random_state=0, shuffle=False)
>>> regr = RandomOKForestRegressor(max_depth=2, random_state=0, kernel="linear")
>>> regr.fit(X, y)
RandomForestRegressor(...)
>>> print(regr.predict([[0, 0, 0, 0]]))
[-8.32987858]