stpredictions.models.OK3 package
Subpackages
- stpredictions.models.OK3.tests package
- Submodules
- stpredictions.models.OK3.tests.digits_image_completion module
- stpredictions.models.OK3.tests.exemple_utilisation module
- stpredictions.models.OK3.tests.exemple_utilisation_forest module
- stpredictions.models.OK3.tests.test_classification module
- stpredictions.models.OK3.tests.test_classification_forest module
- stpredictions.models.OK3.tests.test_export module
- stpredictions.models.OK3.tests.test_forest_clf_and_reg module
- stpredictions.models.OK3.tests.test_regression module
- stpredictions.models.OK3.tests.test_regression_forest module
- stpredictions.models.OK3.tests.test_tree_clf_and_reg module
- Module contents
Submodules
stpredictions.models.OK3.base module
- class stpredictions.models.OK3.base.StructuredOutputMixin
Bases:
objectMixin to mark estimators that support structured prediction.
stpredictions.models.OK3.kernel module
- class stpredictions.models.OK3.kernel.Gaussian_Kernel(gamma=1)
Bases:
stpredictions.models.OK3.kernel.Kernel- evaluate(y1, y2)
- get_Gram_matrix(list_y_1, list_y_2=None)
- get_sq_norms(list_y)
- class stpredictions.models.OK3.kernel.Gini_Kernel
Bases:
stpredictions.models.OK3.kernel.Mean_Dirac_KernelIdentique au ‘Mean_Dirac_Kernel’, mais permet de signaler que le décodage ne se fait pas parmi un candidates set mais est une recherche exhaustive.
- class stpredictions.models.OK3.kernel.Kernel(name)
Bases:
object- evaluate(obj1, obj2)
- get_Gram_matrix(objects_1, objects_2=None)
- get_name()
- get_sq_norms(objects)
- class stpredictions.models.OK3.kernel.Laplacian_Kernel(gamma=1)
Bases:
stpredictions.models.OK3.kernel.Kernel- evaluate(y1, y2)
- get_Gram_matrix(list_y_1, list_y_2=None)
- get_sq_norms(list_y)
- class stpredictions.models.OK3.kernel.Linear_Kernel
Bases:
stpredictions.models.OK3.kernel.Kernel- evaluate(y1, y2)
- get_Gram_matrix(list_y_1, list_y_2=None)
- get_sq_norms(list_y)
- class stpredictions.models.OK3.kernel.MSE_Kernel
Bases:
stpredictions.models.OK3.kernel.Linear_KernelIdentique au ‘Linear_Kernel’, mais permet de signaler que le décodage ne se fait pas parmi un candidates set mais est une solution exacte.
- class stpredictions.models.OK3.kernel.Mean_Dirac_Kernel
Bases:
stpredictions.models.OK3.kernel.Kernel- evaluate(y1, y2)
- get_Gram_matrix(list_y_1, list_y_2=None)
- get_sq_norms(list_y)
stpredictions.models.OK3.structured_object module
Module contents
The sklearn.tree module includes decision tree-based models for
classification and regression.
- class stpredictions.models.OK3.BaseKernelizedOutputTree(*, criterion, splitter, max_depth, min_samples_split, min_samples_leaf, min_weight_fraction_leaf, max_features, max_leaf_nodes, random_state, min_impurity_decrease, min_impurity_split, ccp_alpha=0.0, kernel)
Bases:
stpredictions.models.OK3.base.StructuredOutputMixin,sklearn.base.BaseEstimatorBase class for regression trees with a kernel in the output space.
Warning: This class should not be used directly. Use derived classes instead.
- apply(X, check_input=True)
Return the index of the leaf that each sample is predicted as.
- X{array-like, sparse matrix} of shape (n_samples, n_features)
The input samples. Internally, it will be converted to
dtype=np.float32and if a sparse matrix is provided to a sparsecsr_matrix.- check_inputbool, default=True
Allow to bypass several input checking. Don’t use this parameter unless you know what you do.
- X_leavesarray-like of shape (n_samples,)
For each datapoint x in X, return the index of the leaf x ends up in. Leaves are numbered within
[0; self.tree_.node_count), possibly with gaps in the numbering.
- cost_complexity_pruning_path(X, y, sample_weight=None)
Compute the pruning path during Minimal Cost-Complexity Pruning.
See minimal_cost_complexity_pruning for details on the pruning process.
- X{array-like, sparse matrix} of shape (n_samples, n_features)
The training input samples. Internally, it will be converted to
dtype=np.float32and if a sparse matrix is provided to a sparsecsc_matrix.- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
The target values (class labels) as integers or strings.
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node.
- ccp_path
Bunch Dictionary-like object, with the following attributes.
- ccp_alphasndarray
Effective alphas of subtree during pruning.
- impuritiesndarray
Sum of the impurities of the subtree leaves for the corresponding alpha value in
ccp_alphas.
- decision_path(X, check_input=True)
Return the decision path in the tree.
- X{array-like, sparse matrix} of shape (n_samples, n_features)
The input samples. Internally, it will be converted to
dtype=np.float32and if a sparse matrix is provided to a sparsecsr_matrix.- check_inputbool, default=True
Allow to bypass several input checking. Don’t use this parameter unless you know what you do.
- indicatorsparse matrix of shape (n_samples, n_nodes)
Return a node indicator CSR matrix where non zero elements indicates that the samples goes through the nodes.
- decode(X, candidates=None, check_input=True, return_top_k=1)
Synonyme de predict
- decode_tree(candidates=None, return_top_k=1)
Decode each leaves predictions of the tree, AND store the array of the decoded outputs as an attribut of the estimator : self.leaves_preds.
ATTENTION, les prédictions correspondant aux noeuds qui ne sont pas des feuilles n’ont aucu sens : elles sont arbitraires. Elles n’ont volontairement pas été calculées pour question d’économie de temps.
- candidatesarray of shape (nb_candidates, vectorial_repr_len), default=None
The candidates outputs for the minimisation problem of decoding the predictions in the Hilbert space. If not given or None, it will be set to the output training matrix.
- return_top_kint, default=1
Indicate how many decoded outputs to return for each example (or for each leaf). Default is one : select the output that gives the minimum “distance” with the predicted value in the Hilbert space. But it is useful to be able to return for example the 5 best candidates in order to evaluate a top 5 accuracy metric.
- leaves_predsarray-like of shape (node_count,vector_length)
For each leaf, return the vectorial representation of the output in ‘candidates’ that minimizes the “distance” with the “exact” prediction in the Hilbert space.
leaves_preds[i*return_top_k : (i+1)*return_top_k] is the non-ordered list od the decoded outputs of the node i among candidates.
- property feature_importances_
Return the feature importances.
The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.
Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See
sklearn.inspection.permutation_importance()as an alternative.- feature_importances_ndarray of shape (n_features,)
Normalized total reduction of criteria by feature (Gini importance).
- fit(X, y, sample_weight=None, check_input=True, X_idx_sorted='deprecated', kernel=None, in_ensemble=False, Gram_y=None)
- kernelOptional input. If not provided, the kernel attribute of the
estimatoris used. If provided, the kernel attribute of the estimator is updated.
Some possible values :
- “gini_clf”y is a matrix of labels for multilabel classification.
shape (n_train_samples, n_labels) We have to compute the corresponding gram matrix, equivalent to the use of a Classification Tree with the gini index as impurity. Exact solution search is performed.
- “mse_reg”y is a matrix of real vectors for multiouput regression.
shape (n_train_samples, n_outputs) We have to compute the corresponding gram matrix, equivalent to the use of a Regression Tree with the mse as impurity. Exact solution search is performed.
- “mean_dirac”y is a matrix or vectors (vectorial representation of structured objects).
shape (n_train_samples, vector_length) The similarity between two objects is computed with a mean dirac equality kernel.
- “linear”y is a matrix or vectors (vectorial representation of structured objects).
shape (n_train_samples, vector_length) The similarity between two objects is computed with a linear kernel.
- “gaussian”y is a matrix or vectors (vectorial representation of structured objects).
shape (n_train_samples, vector_length) The similarity between two objects is computed with a gaussian kernel.
- in_ensembleboolean, default=False
flag to set to true when the estimator is used with an ensemble method, if True, the Gram matrix of the outputs is also given (as K_y) and doesn’t have to be computed by the tree –> avoid this heavy calculation for all trees.
- Gram_ythe output gram matrix, default=None
Allows to avoid the Gram matrix calculation if we already have it (useful when in_ensemble=True)
- get_depth()
Return the depth of the decision tree.
The depth of a tree is the maximum distance between the root and any leaf.
- self.tree_.max_depthint
The maximum depth of the tree.
- get_leaves_weights()
Gives the weighted training samples in each leaf
A (n_nodes, n_training_samples) array which gives for each node (line number) and for each training sample its weight in the node (O if the sample doesn’t fall in the node, and a non-negative value depending on ‘sample_weight’ otherwise.)
- get_n_leaves()
Return the number of leaves of the decision tree.
- self.tree_.n_leavesint
Number of leaves.
- predict(X, candidates=None, check_input=True, return_top_k=1)
Predict structured objects for X.
The predicted structured objects based on X are returned. Performs an argmin research algorithm amongst the possible outputs
- X{array-like, sparse matrix} of shape (n_samples, n_features)
The input samples. Internally, it will be converted to
dtype=np.float32and if a sparse matrix is provided to a sparsecsr_matrix.- candidatesarray of shape (nb_candidates, vectorial_repr_len), default=None
The candidates outputs for the minimisation problem of decoding the predictions in the Hilbert space. If not given or None, it will be set to the output training matrix.
- check_inputbool, default=True
Allow to bypass several input checking. Don’t use this parameter unless you know what you do.
- return_top_kint, default=1
Indicate how many decoded outputs to return for each example (or for each leaf). Default is one : select the output that gives the minimum “distance” with the predicted value in the Hilbert space. But it is useful to be able to return for example the 5 best candidates in order to evaluate a top 5 accuracy metric.
- output
array of shape (n_samples, vectorial_repr_len) containing the vectorial representations of the structured output objects (found in the set of candidates, or if it is not given, among the training outputs).
- predict_weights(X, check_input=True)
Predict the output for X as weighted combinations of training outputs It is kind of the representation in the Hilbert space.
A (len(X), n_training_samples) array which gives for each test example (line number) and for each training sample its weight in the node (O if the sample doesn’t fall in the same leaf as the test example, and a non-negative value depending on ‘sample_weight’ otherwise.)
- r2_score_in_Hilbert(X, y, sample_weight=None)
Calcule le score R2 SANS décodage
Return the coefficient of determination R^2 of the prediction in the Hilbert space. The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum().
- Xarray-like of shape (n_samples, n_features)
Test samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
True outputs for X.
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
- scorefloat
R2 score of the predictions in the Hilbert space wrt. the embedded values of y.
- score(X, y, candidates=None, metric='accuracy', sample_weight=None)
Calcule le score après décodage
Return either
-the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum().
(IF self.kernel=”mse_reg”)
-the mean accuracy of the predictions if metric=”accuracy”. (Requires that all labels match to count as positive in case of multilabel)
-the mean hamming score of the predictions if metric=”hamming” (Well suited for multilabel classification)
-the mean top k accuracy score if metric=”top_”+str(k) It works with all wanted value of k.
It is possible to set the ‘sample_weight’ parameter for all these metrics.
All this score metrics are highly dependent on the candidates set because it deals with the decoded predictions (which are among this set). If you want to compute a score only based on the tree structure, you can use the following method ‘r2_score_in_Hilbert’.
- Xarray-like of shape (n_samples, n_features)
Test samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
True outputs for X.
- candidatesarray-like of shape (nb_candidates, n_outputs)
Possible decoded outputs for X
- metricstr, default=”accuracy”
The way to compute the score
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
- scorefloat
Chosen score between self.predict(X) and y.
- class stpredictions.models.OK3.BaseOKForest(base_estimator, n_estimators=100, *, estimator_params=(), bootstrap=False, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, max_samples=None, kernel='linear')
Bases:
stpredictions.models.OK3.base.StructuredOutputMixin,sklearn.ensemble._base.BaseEnsembleBase class for forests of ok-trees.
Warning: This class should not be used directly. Use derived classes instead.
- apply(X)
Apply trees in the forest to X, return leaf indices.
- X{array-like, sparse matrix} of shape (n_samples, n_features)
The input samples. Internally, its dtype will be converted to
dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix.
- X_leavesndarray of shape (n_samples, n_estimators)
For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in.
- decision_path(X)
Return the decision path in the forest.
New in version 0.18.
- X{array-like, sparse matrix} of shape (n_samples, n_features)
The input samples. Internally, its dtype will be converted to
dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix.
- indicatorsparse matrix of shape (n_samples, n_nodes)
Return a node indicator matrix where non zero elements indicates that the samples goes through the nodes. The matrix is of CSR format.
- n_nodes_ptrndarray of shape (n_estimators + 1,)
The columns from indicator[n_nodes_ptr[i]:n_nodes_ptr[i+1]] gives the indicator value for the i-th estimator.
- property feature_importances_
The impurity-based feature importances.
The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.
Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See
sklearn.inspection.permutation_importance()as an alternative.- feature_importances_ndarray of shape (n_features,)
The values of this array sum to 1, unless all trees are single node trees consisting of only the root node, in which case it will be an array of zeros.
- fit(X, y, sample_weight=None)
Build a forest of ok-trees from the training set (X, y).
- X{array-like, sparse matrix} of shape (n_samples, n_features)
The training input samples. Internally, its dtype will be converted to
dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparsecsc_matrix.- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
The target values (class labels in classification, real numbers in regression).
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, splits are also ignored if they would result in any single class carrying a negative weight in either child node.
self : object
- class stpredictions.models.OK3.ExtraOK3Regressor(*, criterion='mse', splitter='random', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', random_state=None, min_impurity_decrease=0.0, min_impurity_split=None, max_leaf_nodes=None, ccp_alpha=0.0, kernel='linear')
Bases:
stpredictions.models.OK3._classes.OK3RegressorAn extremely randomized tree regressor.
Extra-trees differ from classic decision trees in the way they are built. When looking for the best split to separate the samples of a node into two groups, random splits are drawn for each of the max_features randomly selected features and the best split among those is chosen. When max_features is set 1, this amounts to building a totally random decision tree.
Warning: Extra-trees should only be used within ensemble methods.
- criterion{“mse”, “friedman_mse”, “mae”}, default=”mse”
The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.
- splitter{“random”, “best”}, default=”random”
The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.
- max_depthint, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- min_samples_splitint or float, default=2
The minimum number of samples required to split an internal node:
If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
- min_samples_leafint or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least
min_samples_leaftraining samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.If int, then consider min_samples_leaf as the minimum number.
If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
- min_weight_fraction_leaffloat, default=0.0
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
- max_featuresint, float, {“auto”, “sqrt”, “log2”} or None, default=”auto”
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
If “auto”, then max_features=n_features.
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than
max_featuresfeatures.- random_stateint, RandomState instance, default=None
Used to pick randomly the max_features used at each split. See Glossary for details.
- min_impurity_decreasefloat, default=0.0
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)
where
Nis the total number of samples,N_tis the number of samples at the current node,N_t_Lis the number of samples in the left child, andN_t_Ris the number of samples in the right child.N,N_t,N_t_RandN_t_Lall refer to the weighted sum, ifsample_weightis passed.- min_impurity_splitfloat, (default=0)
Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
Deprecated since version 0.19:
min_impurity_splithas been deprecated in favor ofmin_impurity_decreasein 0.19. The default value ofmin_impurity_splithas changed from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Usemin_impurity_decreaseinstead.- max_leaf_nodesint, default=None
Grow a tree with
max_leaf_nodesin best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.- ccp_alphanon-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than
ccp_alphawill be chosen. By default, no pruning is performed. See minimal_cost_complexity_pruning for details.- kernelstring, or tuple (string, params) or instance of the Kernel class, default=”linear”
The type of kernel to use to compare the output data. Changing this parameter changes also implicitely the nature of the Hilbert space in which the output data are embedded. The string describes the type of Kernel to use (defined in Kernel.py), The optional params given are here to set particular parameters values for the chosen kernel type.
- max_features_int
The inferred value of max_features.
- n_features_int
The number of features when
fitis performed.- feature_importances_ndarray of shape (n_features,)
Return impurity-based feature importances (the higher, the more important the feature).
Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See
sklearn.inspection.permutation_importance()as an alternative.- tree_Tree
The underlying Tree object. Please refer to
help(sklearn.tree._tree.Tree)for attributes of Tree object and sphx_glr_auto_examples_tree_plot_unveil_tree_structure.py for basic usage of these attributes.
ExtraTreeClassifier : An extremely randomized tree classifier. sklearn.ensemble.ExtraTreesClassifier : An extra-trees classifier. sklearn.ensemble.ExtraTreesRegressor : An extra-trees regressor.
The default values for the parameters controlling the size of the trees (e.g.
max_depth,min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.- 1
P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006.
>>> from sklearn.datasets import load_diabetes >>> from sklearn.model_selection import train_test_split >>> from sklearn.ensemble import BaggingRegressor >>> from sklearn.tree import ExtraTreeRegressor >>> X, y = load_diabetes(return_X_y=True) >>> X_train, X_test, y_train, y_test = train_test_split( ... X, y, random_state=0) >>> extra_tree = ExtraTreeRegressor(random_state=0) >>> reg = BaggingRegressor(extra_tree, random_state=0).fit( ... X_train, y_train) >>> reg.score(X_test, y_test) 0.33...
- class stpredictions.models.OK3.OK3Regressor(*, criterion='mse', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, ccp_alpha=0.0, kernel='linear')
Bases:
stpredictions.models.OK3._classes.BaseKernelizedOutputTreeA decision tree regressor for the OK3 method.
- criterion{“mse”}, default=”mse”
The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, “friedman_mse”, which uses mean squared error with Friedman’s improvement score for potential splits, and “mae” for the mean absolute error, which minimizes the L1 loss using the median of each terminal node.
- splitter{“best”, “random”}, default=”best”
The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.
- max_depthint, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- min_samples_splitint or float, default=2
The minimum number of samples required to split an internal node:
If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
- min_samples_leafint or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least
min_samples_leaftraining samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.If int, then consider min_samples_leaf as the minimum number.
If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
- min_weight_fraction_leaffloat, default=0.0
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
- max_featuresint, float or {“auto”, “sqrt”, “log2”}, default=None
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
If “auto”, then max_features=n_features.
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than
max_featuresfeatures.- random_stateint, RandomState instance, default=None
Controls the randomness of the estimator. The features are always randomly permuted at each split, even if
splitteris set to"best". Whenmax_features < n_features, the algorithm will selectmax_featuresat random at each split before finding the best split among them. But the best found split may vary across different runs, even ifmax_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting,random_statehas to be fixed to an integer. See Glossary for details.- max_leaf_nodesint, default=None
Grow a tree with
max_leaf_nodesin best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.- min_impurity_decreasefloat, default=0.0
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)
where
Nis the total number of samples,N_tis the number of samples at the current node,N_t_Lis the number of samples in the left child, andN_t_Ris the number of samples in the right child.N,N_t,N_t_RandN_t_Lall refer to the weighted sum, ifsample_weightis passed.- min_impurity_splitfloat, (default=0)
Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
Deprecated since version 0.19:
min_impurity_splithas been deprecated in favor ofmin_impurity_decreasein 0.19. The default value ofmin_impurity_splithas changed from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Usemin_impurity_decreaseinstead.- ccp_alphanon-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than
ccp_alphawill be chosen. By default, no pruning is performed. See minimal_cost_complexity_pruning for details.- kernelstring, or tuple (string, params) or instance of the Kernel class, default=”linear”
The type of kernel to use to compare the output data. Changing this parameter changes also implicitely the nature of the Hilbert space in which the output data are embedded. The string describes the type of Kernel to use (defined in Kernel.py), The optional params given are here to set particular parameters values for the chosen kernel type.
- feature_importances_ndarray of shape (n_features,)
The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [4]_.
Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See
sklearn.inspection.permutation_importance()as an alternative.- max_features_int
The inferred value of max_features.
- n_features_int
The number of features when
fitis performed.- tree_Tree
The underlying Tree object. Please refer to
help(sklearn.tree._tree.Tree)for attributes of Tree object and sphx_glr_auto_examples_tree_plot_unveil_tree_structure.py for basic usage of these attributes.- leaves_predsarray of shape (n_nodes, n_components),
where n_nodes is the number of nodes of the grown tree and n_components is the number of values used to represent an output.
This array stores for each leaf of the tree, the decoded predictions in Y.
The default values for the parameters controlling the size of the trees (e.g.
max_depth,min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.- 1
Pierre Geurts, Louis Wehenkel, Florence d’Alché-Buc. “Kernelizing the output of tree-based methods.” Proc. of the 23rd International Conference on Machine Learning, 2006, United States. pp.345–352,10.1145/1143844.1143888.
>>> from sklearn.datasets import load_diabetes >>> from sklearn.model_selection import cross_val_score >>> from ??? import OK3Regressor >>> X, y = load_diabetes(return_X_y=True) >>> regressor = OK3Regressor(random_state=0) >>> cross_val_score(regressor, X, y, cv=10) ... ... array([-0.39..., -0.46..., 0.02..., 0.06..., -0.50..., 0.16..., 0.11..., -0.73..., -0.30..., -0.00...])- fit(X, y, sample_weight=None, check_input=True, X_idx_sorted='deprecated', kernel=None, in_ensemble=False, Gram_y=None)
Build a decision tree regressor from the training set (X, y).
- X{array-like, sparse matrix} of shape (n_samples, n_features)
The training input samples. Internally, it will be converted to
dtype=np.float32and if a sparse matrix is provided to a sparsecsc_matrix.- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
The target values (real numbers). Use
dtype=np.float64andorder='C'for maximum efficiency.- sample_weightarray-like of shape (n_samples,), default=None
Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node.
- check_inputbool, default=True
Allow to bypass several input checking. Don’t use this parameter unless you know what you do.
- X_idx_sorteddeprecated, default=”deprecated”
This parameter is deprecated and has no effect. It will be removed in v0.26.
- kernelstring, or tuple (string, params) or instance of the Kernel class, default=”linear”
The type of kernel to use to compare the output data. Changing this parameter changes also implicitely the nature of the Hilbert space in which the output data are embedded. The string describes the type of Kernel to use (defined in Kernel.py), The optional params given are here to set particular parameters values for the chosen kernel type. This parameter can be set also here in the fit method instead of __init__.
- selfOK3Regressor
Fitted estimator.
- class stpredictions.models.OK3.OKForestRegressor(base_estimator, n_estimators=100, *, estimator_params=(), bootstrap=False, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, max_samples=None, kernel='linear')
Bases:
stpredictions.models.OK3._forest.BaseOKForestBase class for forest of ok-trees-based regressors.
Warning: This class should not be used directly. Use derived classes instead.
- predict(X, candidates=None, return_top_k=1, precomputed_weights=None)
Predict structured objects for X.
The predicted structured objects based on X are returned. Performs an argmin research algorithm amongst the possible outputs
- X{array-like, sparse matrix} of shape (n_samples, n_features)
The input samples. Internally, it will be converted to
dtype=np.float32and if a sparse matrix is provided to a sparsecsr_matrix.- candidatesarray of shape (nb_candidates, vectorial_repr_len), default=None
The candidates outputs for the minimisation problem of decoding the predictions in the Hilbert space. If not given or None, it will be set to the output training matrix.
- check_inputbool, default=True
Allow to bypass several input checking. Don’t use this parameter unless you know what you do.
- return_top_kint, default=1
Indicate how many decoded outputs to return for each example (or for each leaf). Default is one : select the output that gives the minimum “distance” with the predicted value in the Hilbert space. But it is useful to be able to return for example the 5 best candidates in order to evaluate a top 5 accuracy metric.
- output
array of shape (n_samples, vectorial_repr_len) containing the vectorial representations of the structured output objects (found in the set of candidates, or if it is not given, among the training outputs).
- predict_weights(X)
Predict weights (on the training samples) for X.
The predicted weights of an input sample are computed as the mean predicted weights of the trees in the forest.
- X{array-like, sparse matrix} of shape (n_samples, n_features)
The input samples. Internally, its dtype will be converted to
dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix.
A (X.shape[0], n_training_samples) array which gives for each test example (line number) and for each training sample its weight in the node (O if the sample doesn’t fall in any of the same leaves as the test example, and a non-negative value depending on ‘sample_weight’ otherwise.)
- r2_score_in_Hilbert(X, y, sample_weight=None)
Calcule le score R2 SANS décodage
Return the coefficient of determination R^2 of the prediction in the Hilbert space. The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum().
- Xarray-like of shape (n_samples, n_features)
Test samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
True outputs for X.
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
- scorefloat
R2 score of the predictions in the Hilbert space wrt. the embedded values of y.
- score(X, y, candidates=None, metric='accuracy', sample_weight=None, precomputed_weights=None)
Calcule le score après décodage
Return either
-the coefficient of determination R^2 of the prediction. The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). (IF self.kernel=”mse_reg”)
-the mean accuracy of the predictions if metric=”accuracy”. (Requires that all labels match to count as positive in case of multilabel)
-the mean hamming score of the predictions if metric=”hamming” (Well suited for multilabel classification)
-the mean top k accuracy score if metric=”top_”+str(k) It works with all wanted value of k.
It is possible to set the ‘sample_weight’ parameter for all these metrics.
All this score metrics are highly dependent on the candidates set because it deals with the decoded predictions (which are among this set). If you want to compute a score only based on the tree structure, you can use the following method ‘r2_score_in_Hilbert’.
- Xarray-like of shape (n_samples, n_features)
Test samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
True outputs for X.
- candidatesarray-like of shape (nb_candidates, n_outputs)
Possible decoded outputs for X
- metricstr, default=”accuracy”
The way to compute the score
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
- scorefloat
Chosen score between self.predict(X) and y.
- class stpredictions.models.OK3.RandomOKForestRegressor(n_estimators=100, *, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None, kernel='linear')
Bases:
stpredictions.models.OK3._forest.OKForestRegressorA random ok-forest regressor.
A random forest is a meta estimator that fits a number of decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.
Read more in the User Guide.
- n_estimatorsint, default=100
The number of trees in the forest.
- criterion{“mse”}, default=”mse”
The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion
- max_depthint, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- min_samples_splitint or float, default=2
The minimum number of samples required to split an internal node:
If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
Changed in version 0.18: Added float values for fractions.
- min_samples_leafint or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least
min_samples_leaftraining samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.If int, then consider min_samples_leaf as the minimum number.
If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
Changed in version 0.18: Added float values for fractions.
- min_weight_fraction_leaffloat, default=0.0
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
- max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and round(max_features * n_features) features are considered at each split.
If “auto”, then max_features=n_features.
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than
max_featuresfeatures.- max_leaf_nodesint, default=None
Grow trees with
max_leaf_nodesin best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.- min_impurity_decreasefloat, default=0.0
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)
where
Nis the total number of samples,N_tis the number of samples at the current node,N_t_Lis the number of samples in the left child, andN_t_Ris the number of samples in the right child.N,N_t,N_t_RandN_t_Lall refer to the weighted sum, ifsample_weightis passed.- min_impurity_splitfloat, default=None
Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
- bootstrapbool, default=True
Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
- oob_scorebool, default=False
whether to use out-of-bag samples to estimate the R^2 on unseen data.
- n_jobsint, default=None
The number of jobs to run in parallel.
fit(),predict(),decision_path()andapply()are all parallelized over the trees.Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors. See Glossary for more details.- random_stateint or RandomState, default=None
Controls both the randomness of the bootstrapping of the samples used when building trees (if
bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (ifmax_features < n_features). See Glossary for details.- verboseint, default=0
Controls the verbosity when fitting and predicting.
- warm_startbool, default=False
When set to
True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the Glossary.- ccp_alphanon-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than
ccp_alphawill be chosen. By default, no pruning is performed. See minimal_cost_complexity_pruning for details.- max_samplesint or float, default=None
If bootstrap is True, the number of samples to draw from X to train each base estimator.
If None (default), then draw X.shape[0] samples.
If int, then draw max_samples samples.
If float, then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0, 1).
- base_estimator_DecisionTreeRegressor
The child estimator template used to create the collection of fitted sub-estimators.
- estimators_list of DecisionTreeRegressor
The collection of fitted sub-estimators.
- feature_importances_ndarray of shape (n_features,)
The impurity-based feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.
Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See
sklearn.inspection.permutation_importance()as an alternative.- n_features_int
The number of features when
fitis performed.- n_outputs_int
The number of outputs when
fitis performed.- oob_score_float
Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when
oob_scoreis True.- oob_prediction_ndarray of shape (n_samples,)
Prediction computed with out-of-bag estimate on the training set. This attribute exists only when
oob_scoreis True.
OK3Regressor, ExtraOKTreesRegressor
The default values for the parameters controlling the size of the trees (e.g.
max_depth,min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data,
max_features=n_featuresandbootstrap=False, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting,random_statehas to be fixed.The default value
max_features="auto"usesn_featuresrather thann_features / 3. The latter was originally suggested in [1], whereas the former was more recently justified empirically in [2].- 1
Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.
- 2
P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006.
>>> from sklearn.ensemble import RandomForestRegressor >>> from sklearn.datasets import make_regression >>> X, y = make_regression(n_features=4, n_informative=2, ... random_state=0, shuffle=False) >>> regr = RandomOKForestRegressor(max_depth=2, random_state=0, kernel="linear") >>> regr.fit(X, y) RandomForestRegressor(...) >>> print(regr.predict([[0, 0, 0, 0]])) [-8.32987858]