lightgbm.LGBMRegressor — LightGBM 3.3.2 documentation (2023)

class lightgbm.LGBMRegressor(boosting_type='gbdt', num_leaves=31, max_depth=- 1, learning_rate=0.1, n_estimators=100, subsample_for_bin=200000, objective=None, class_weight=None, min_split_gain=0.0, min_child_weight=0.001, min_child_samples=20, subsample=1.0, subsample_freq=0, colsample_bytree=1.0, reg_alpha=0.0, reg_lambda=0.0, random_state=None, n_jobs=- 1, silent='warn', importance_type='split', **kwargs)[source]

Bases: lightgbm.compat._LGBMRegressorBase, lightgbm.sklearn.LGBMModel

LightGBM regressor.

__init__(boosting_type='gbdt', num_leaves=31, max_depth=- 1, learning_rate=0.1, n_estimators=100, subsample_for_bin=200000, objective=None, class_weight=None, min_split_gain=0.0, min_child_weight=0.001, min_child_samples=20, subsample=1.0, subsample_freq=0, colsample_bytree=1.0, reg_alpha=0.0, reg_lambda=0.0, random_state=None, n_jobs=- 1, silent='warn', importance_type='split', **kwargs)

Construct a gradient boosting model.

Parameters
  • boosting_type (str, optional (default='gbdt')) – ‘gbdt’, traditional Gradient Boosting Decision Tree.‘dart’, Dropouts meet Multiple Additive Regression Trees.‘goss’, Gradient-based One-Side Sampling.‘rf’, Random Forest.

  • num_leaves (int, optional (default=31)) – Maximum tree leaves for base learners.

  • max_depth (int, optional (default=-1)) – Maximum tree depth for base learners, <=0 means no limit.

  • learning_rate (float, optional (default=0.1)) – Boosting learning rate.You can use callbacks parameter of fit method to shrink/adapt learning ratein training using reset_parameter callback.Note, that this will ignore the learning_rate argument in training.

  • n_estimators (int, optional (default=100)) – Number of boosted trees to fit.

  • subsample_for_bin (int, optional (default=200000)) – Number of samples for constructing bins.

  • objective (str, callable or None, optional (default=None)) – Specify the learning task and the corresponding learning objective ora custom objective function to be used (see note below).Default: ‘regression’ for LGBMRegressor, ‘binary’ or ‘multiclass’ for LGBMClassifier, ‘lambdarank’ for LGBMRanker.

  • class_weight (dict, 'balanced' or None, optional (default=None)) – Weights associated with classes in the form {class_label: weight}.Use this parameter only for multi-class classification task;for binary classification task you may use is_unbalance or scale_pos_weight parameters.Note, that the usage of all these parameters will result in poor estimates of the individual class probabilities.You may want to consider performing probability calibration(https://scikit-learn.org/stable/modules/calibration.html) of your model.The ‘balanced’ mode uses the values of y to automatically adjust weightsinversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).If None, all classes are supposed to have weight one.Note, that these weights will be multiplied with sample_weight (passed through the fit method)if sample_weight is specified.

  • min_split_gain (float, optional (default=0.)) – Minimum loss reduction required to make a further partition on a leaf node of the tree.

  • min_child_weight (float, optional (default=1e-3)) – Minimum sum of instance weight (hessian) needed in a child (leaf).

  • min_child_samples (int, optional (default=20)) – Minimum number of data needed in a child (leaf).

  • subsample (float, optional (default=1.)) – Subsample ratio of the training instance.

  • subsample_freq (int, optional (default=0)) – Frequency of subsample, <=0 means no enable.

  • colsample_bytree (float, optional (default=1.)) – Subsample ratio of columns when constructing each tree.

  • reg_alpha (float, optional (default=0.)) – L1 regularization term on weights.

  • reg_lambda (float, optional (default=0.)) – L2 regularization term on weights.

  • random_state (int, RandomState object or None, optional (default=None)) – Random number seed.If int, this number is used to seed the C++ code.If RandomState object (numpy), a random integer is picked based on its state to seed the C++ code.If None, default seeds in C++ code are used.

  • n_jobs (int, optional (default=-1)) – Number of parallel threads.

  • silent (bool, optional (default=True)) – Whether to print messages while running boosting.

  • importance_type (str, optional (default='split')) – The type of feature importance to be filled into feature_importances_.If ‘split’, result contains numbers of times the feature is used in a model.If ‘gain’, result contains total gains of splits which use the feature.

  • **kwargs

    Other parameters for the model.Check http://lightgbm.readthedocs.io/en/latest/Parameters.html for more parameters.

Note

A custom objective function can be provided for the objective parameter.In this case, it should have the signatureobjective(y_true, y_pred) -> grad, hess orobjective(y_true, y_pred, group) -> grad, hess:

y_truearray-like of shape = [n_samples]

The target values.

y_predarray-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)

The predicted values.Predicted values are returned before any transformation,e.g. they are raw margin instead of probability of positive class for binary task.

grouparray-like

Group/query data.Only used in the learning-to-rank task.sum(group) = n_samples.For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups,where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.

gradarray-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)

The value of the first order derivative (gradient) of the losswith respect to the elements of y_pred for each sample point.

hessarray-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)

The value of the second order derivative (Hessian) of the losswith respect to the elements of y_pred for each sample point.

For multi-class task, the y_pred is group by class_id first, then group by row_id.If you want to get i-th row y_pred in j-th class, the access way is y_pred[j * num_data + i]and you should group grad and hess in this way as well.

Methods

__init__([boosting_type,num_leaves,...])

Construct a gradient boosting model.

fit(X,y[,sample_weight,init_score,...])

Build a gradient boosting model from the training set (X, y).

get_params([deep])

Get parameters for this estimator.

predict(X[,raw_score,start_iteration,...])

Return the predicted value for each sample.

set_params(**params)

Set the parameters of this estimator.

Attributes

best_iteration_

The best iteration of fitted model if early_stopping() callback has been specified.

best_score_

The best score of fitted model.

booster_

The underlying Booster of this model.

evals_result_

The evaluation results if validation sets have been specified.

feature_importances_

The feature importances (the higher, the more important).

feature_name_

The names of features.

n_features_

The number of features of fitted model.

n_features_in_

The number of features of fitted model.

objective_

The concrete objective used while fitting this model.

property best_iteration_

The best iteration of fitted model if early_stopping() callback has been specified.

Type

int or None

property best_score_

The best score of fitted model.

Type

dict

property booster_

The underlying Booster of this model.

Type

Booster

property evals_result_

The evaluation results if validation sets have been specified.

Type

dict or None

property feature_importances_

The feature importances (the higher, the more important).

Note

importance_type attribute is passed to the functionto configure the type of importance values to be extracted.

Type

array of shape = [n_features]

property feature_name_

The names of features.

Type

array of shape = [n_features]

fit(X, y, sample_weight=None, init_score=None, eval_set=None, eval_names=None, eval_sample_weight=None, eval_init_score=None, eval_metric=None, early_stopping_rounds=None, verbose='warn', feature_name='auto', categorical_feature='auto', callbacks=None, init_model=None)[source]

Build a gradient boosting model from the training set (X, y).

Parameters
  • X (array-like or sparse matrix of shape = [n_samples, n_features]) – Input feature matrix.

  • y (array-like of shape = [n_samples]) – The target values (class labels in classification, real numbers in regression).

  • sample_weight (array-like of shape = [n_samples] or None, optional (default=None)) – Weights of training data.

  • init_score (array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task) or shape = [n_samples, n_classes] (for multi-class task) or None, optional (default=None)) – Init score of training data.

  • eval_set (list or None, optional (default=None)) – A list of (X, y) tuple pairs to use as validation sets.

  • eval_names (list of str, or None, optional (default=None)) – Names of eval_set.

  • eval_sample_weight (list of array, or None, optional (default=None)) – Weights of eval data.

  • eval_init_score (list of array, or None, optional (default=None)) – Init score of eval data.

  • eval_metric (str, callable, list or None, optional (default=None)) – If str, it should be a built-in evaluation metric to use.If callable, it should be a custom evaluation metric, see note below for more details.If list, it can be a list of built-in metrics, a list of custom evaluation metrics, or a mix of both.In either case, the metric from the model parameters will be evaluated and used as well.Default: ‘l2’ for LGBMRegressor, ‘logloss’ for LGBMClassifier, ‘ndcg’ for LGBMRanker.

  • early_stopping_rounds (int or None, optional (default=None)) – Activates early stopping. The model will train until the validation score stops improving.Validation score needs to improve at least every early_stopping_rounds round(s)to continue training.Requires at least one validation data and one metric.If there’s more than one, will check all of them. But the training data is ignored anyway.To check only the first metric, set the first_metric_only parameter to Truein additional parameters **kwargs of the model constructor.

  • verbose (bool or int, optional (default=True)) –

    Requires at least one evaluation data.If True, the eval metric on the eval set is printed at each boosting stage.If int, the eval metric on the eval set is printed at every verbose boosting stage.The last boosting stage or the boosting stage found by using early_stopping_rounds is also printed.

    Example

    With verbose = 4 and at least one item in eval_set,an evaluation metric is printed every 4 (instead of 1) boosting stages.

  • feature_name (list of str, or 'auto', optional (default='auto')) – Feature names.If ‘auto’ and data is pandas DataFrame, data columns names are used.

  • categorical_feature (list of str or int, or 'auto', optional (default='auto')) – Categorical features.If list of int, interpreted as indices.If list of str, interpreted as feature names (need to specify feature_name as well).If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used.All values in categorical features should be less than int32 max value (2147483647).Large values could be memory consuming. Consider using consecutive integers starting from zero.All negative values in categorical features will be treated as missing values.The output cannot be monotonically constrained with respect to a categorical feature.

  • callbacks (list of callable, or None, optional (default=None)) – List of callback functions that are applied at each iteration.See Callbacks in Python API for more information.

  • init_model (str, pathlib.Path, Booster, LGBMModel or None, optional (default=None)) – Filename of LightGBM model, Booster instance or LGBMModel instance used for continue training.

Returns

self – Returns self.

Return type

object

Note

Custom eval function expects a callable with following signatures:func(y_true, y_pred), func(y_true, y_pred, weight) orfunc(y_true, y_pred, weight, group)and returns (eval_name, eval_result, is_higher_better) orlist of (eval_name, eval_result, is_higher_better):

y_truearray-like of shape = [n_samples]

The target values.

y_predarray-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)

The predicted values.In case of custom objective, predicted values are returned before any transformation,e.g. they are raw margin instead of probability of positive class for binary task in this case.

weightarray-like of shape = [n_samples]

The weight of samples.

grouparray-like

Group/query data.Only used in the learning-to-rank task.sum(group) = n_samples.For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups,where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.

eval_namestr

The name of evaluation function (without whitespace).

eval_resultfloat

The eval result.

is_higher_betterbool

Is eval result higher better, e.g. AUC is is_higher_better.

For multi-class task, the y_pred is group by class_id first, then group by row_id.If you want to get i-th row y_pred in j-th class, the access way is y_pred[j * num_data + i].

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, optional (default=True)) – If True, will return the parameters for this estimator andcontained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

property n_features_

The number of features of fitted model.

Type

int

property n_features_in_

The number of features of fitted model.

Type

int

property objective_

The concrete objective used while fitting this model.

Type

str or callable

predict(X, raw_score=False, start_iteration=0, num_iteration=None, pred_leaf=False, pred_contrib=False, **kwargs)

Return the predicted value for each sample.

Parameters
  • X (array-like or sparse matrix of shape = [n_samples, n_features]) – Input features matrix.

  • raw_score (bool, optional (default=False)) – Whether to predict raw scores.

  • start_iteration (int, optional (default=0)) – Start index of the iteration to predict.If <= 0, starts from the first iteration.

  • num_iteration (int or None, optional (default=None)) – Total number of iterations used in the prediction.If None, if the best iteration exists and start_iteration <= 0, the best iteration is used;otherwise, all iterations from start_iteration are used (no limits).If <= 0, all iterations from start_iteration are used (no limits).

  • pred_leaf (bool, optional (default=False)) – Whether to predict leaf index.

  • pred_contrib (bool, optional (default=False)) –

    Whether to predict feature contributions.

    Note

    If you want to get more explanations for your model’s predictions using SHAP values,like SHAP interaction values,you can install the shap package (https://github.com/slundberg/shap).Note that unlike the shap package, with pred_contrib we return a matrix with an extracolumn, where the last column is the expected value.

  • **kwargs – Other parameters for the prediction.

Returns
  • predicted_result (array-like of shape = [n_samples] or shape = [n_samples, n_classes]) – The predicted values.

  • X_leaves (array-like of shape = [n_samples, n_trees] or shape = [n_samples, n_trees * n_classes]) – If pred_leaf=True, the predicted leaf of every tree for each sample.

  • X_SHAP_values (array-like of shape = [n_samples, n_features + 1] or shape = [n_samples, (n_features + 1) * n_classes] or list with n_classes length of such objects) – If pred_contrib=True, the feature contributions for each sample.

set_params(**params)

Set the parameters of this estimator.

Parameters

**params – Parameter names with their new values.

Returns

self – Returns self.

Return type

object

FAQs

How can I improve my LightGBM accuracy? ›

In order to get better accuracy, one can use a large max_bin , use a small learning rate with large num_iterations , and use more training data. One can also use many num_leaves , but it may lead to overfitting. Speaking of overfitting, you can deal with it by: Increasing path_smooth.

Is LightGBM good for regression? ›

Light Gradient Boosting Machine (LightGBM) helps to increase the efficiency of a model, reduce memory usage, and is one of the fastest and most accurate libraries for regression tasks.

What is Max bin in LightGBM? ›

max_bin. Binning is a technique for representing data in a discrete view(histogram). Lightgbm uses a histogram based algorithm to find the optimal split point while creating a weak learner. Therefore, each continuous numeric feature (e.g. number of views for a video) should be split into discrete bins.

Is LightGBM better than random forest? ›

A properly-tuned LightGBM will most likely win in terms of performance and speed compared with random forest. GBM advantages : More developed. A lot of new features are developed for modern GBM model (xgboost, lightgbm, catboost) which affect its performance, speed, and scalability.

Is LightGBM better than XGBoost? ›

LightGBM is significantly faster than XGBoost but delivers almost equivalent performance.

Why is LightGBM so fast? ›

LightGBM achieves this by bundling features together. We generally work with high dimensionality data. Such data have many features which are mutually exclusive i.e they never take zero values simultaneously.

Can LightGBM run on GPU? ›

In LightGBM, the main computation cost during training is building the feature histograms. We use an efficient algorithm on GPU to accelerate this process. The implementation is highly modular, and works for all learning tasks (classification, ranking, regression, etc).

How do I stop overfitting LightGBM? ›

The min_data_in_leaf parameter is a way to reduce overfitting. It requires each leaf to have the specified number of observations so that the model does not become too specific. The validation loss is almost the same but the difference got smaller which means the degree of overfitting reduced.

Can LightGBM handle categorical variables? ›

Categorical Feature Support

LightGBM offers good accuracy with integer-encoded categorical features. LightGBM applies Fisher (1958) to find the optimal split over categories as described here. This often performs better than one-hot encoding. Use categorical_feature to specify the categorical features.

How does LightGBM algorithm work? ›

LightGBM implements a conventional Gradient Boosting Decision Tree (GBDT) algorithm with the addition of two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). These techniques are designed to significantly improve the efficiency and scalability of GBDT.

How do I use LightGBM in regression? ›

  1. Step 1 - Import the library. ...
  2. Step 2 - Setting up the Data for Classifier. ...
  3. Step 3 - Using LightGBM Classifier and calculating the scores. ...
  4. Step 4 - Setting up the Data for Regressor. ...
  5. Step 5 - Using LightGBM Regressor and calculating the scores. ...
  6. Step 6 - Ploting the model.
25 Apr 2022

Does LightGBM require scaling? ›

For all tree-based models like xgboost, lightgbm, random forest, they do not require feature scaling given the nature in which they compute their splits. However, when you perform regularization, one generally has to scale their features to account for any improper penalization.

What is LightGBM booster? ›

The "B" in "LightGBM" stands for "Boosting". The Booster class is the core model object for LightGBM. It holds the current state of the model and has methods for doing things like continuing the training process ( . update() ), creating predictions on new data ( .

How does LightGBM deal with categorical features? ›

LightGBM offers good accuracy with integer-encoded categorical features. LightGBM applies Fisher (1958) to find the optimal split over categories as described here. This often performs better than one-hot encoding. So we can assume that LightGBM does not one-hot encode these categorical features.

Is LightGBM faster than random forest? ›

We found that the Light Gradient Booted model is marginally more accurate with a 0.01 and 0.059 increase in the overall accuracy compared to Support Vector and Random Forests, respectively, but also performed around 25% quicker on average.

Is LightGBM an ensemble method? ›

LightGBM is an ensemble model of decision trees for classification and regression prediction. We demonstrate its utility in genomic selection-assisted breeding with a large dataset of inbred and hybrid maize lines.

Is XGBoost better than random forest? ›

One of the most important differences between XG Boost and Random forest is that the XGBoost always gives more importance to functional space when reducing the cost of a model while Random Forest tries to give more preferences to hyperparameters to optimize the model.

Can LightGBM handle missing values? ›

LightGBM, XGBoost, RuleFit

Driverless AI treats missing values natively. (I.e., a missing value is treated as a special value.)

Is LightGBM tree based? ›

LightGBM is a gradient boosting ensemble method that is used by the Train Using AutoML tool and is based on decision trees. As with other decision tree-based methods, LightGBM can be used for both classification and regression. LightGBM is optimized for high performance with distributed systems.

Is LightGBM prone to overfitting? ›

Overfitting: Light GBM split the tree leaf-wise which can lead to overfitting as it produces much complex trees. Compatibility with Datasets: Light GBM is sensitive to overfitting and thus can easily overfit small data.

What is LightGBM booster? ›

The "B" in "LightGBM" stands for "Boosting". The Booster class is the core model object for LightGBM. It holds the current state of the model and has methods for doing things like continuing the training process ( . update() ), creating predictions on new data ( .

Is LightGBM faster than random forest? ›

We found that the Light Gradient Booted model is marginally more accurate with a 0.01 and 0.059 increase in the overall accuracy compared to Support Vector and Random Forests, respectively, but also performed around 25% quicker on average.

Can LightGBM handle categorical variables? ›

Categorical Feature Support

LightGBM offers good accuracy with integer-encoded categorical features. LightGBM applies Fisher (1958) to find the optimal split over categories as described here. This often performs better than one-hot encoding. Use categorical_feature to specify the categorical features.

How XGBoost works? ›

XGBoost minimizes a regularized (L1 and L2) objective function that combines a convex loss function (based on the difference between the predicted and target outputs) and a penalty term for model complexity (in other words, the regression tree functions).

What is LightGBM model? ›

LightGBM, short for Light Gradient Boosting Machine, is a free and open source distributed gradient boosting framework for machine learning originally developed by Microsoft. It is based on decision tree algorithms and used for ranking, classification and other machine learning tasks.

How do you use LGBM? ›

The LGBM model can be installed by using the Python pip function and the command is “pip install lightbgm” LGBM also has a custom API support in it and using it we can implement both Classifier and regression algorithms where both the models operate in a similar fashion.

What is level wise tree growth? ›

The level-wise strategy grows the tree level by level. In this strategy, each node splits the data prioritizing the nodes closer to the tree root. The leaf-wise strategy grows the tree by splitting the data at the nodes with the highest loss change.

Why is LightGBM so fast? ›

LightGBM achieves this by bundling features together. We generally work with high dimensionality data. Such data have many features which are mutually exclusive i.e they never take zero values simultaneously.

Is LightGBM an ensemble method? ›

LightGBM is an ensemble model of decision trees for classification and regression prediction. We demonstrate its utility in genomic selection-assisted breeding with a large dataset of inbred and hybrid maize lines.

Which is better random forest or XGBoost? ›

If the field of study is bioinformatics or multiclass object detection, Random Forest is the best choice as it is easy to tune and works well even if there are lots of missing data and more noise. Overfitting will not happen easily. With accurate results, XGBoost is hard to work with if there are lots of noise.

Can LightGBM handle missing values? ›

LightGBM, XGBoost, RuleFit

Driverless AI treats missing values natively. (I.e., a missing value is treated as a special value.)

Does LightGBM require hot encoding? ›

LightGBM offers good accuracy with integer-encoded categorical features. LightGBM applies Fisher (1958) to find the optimal split over categories as described here. This often performs better than one-hot encoding. So we can assume that LightGBM does not one-hot encode these categorical features.

Which is the best way to encode categorical variables? ›

Target encoding is the method of converting a categorical value into the mean of the target variable. This type of encoding is a type of bayesian encoding method where bayesian encoders use target variables to encode the categorical value.

When should I not use XGBoost? ›

XGBoost can be avoided in following scenarios: Noisy Data: In case of noisy data, boosting models may overfit. In such cases, Random Forest can provide better results than boosting models, as Random Forest models reduce variance.
...
  1. Large-p, small-n cases in Tabular data. ...
  2. Computer Vision problems.
  3. NLP.

Is XGBoost the best? ›

For many cases, XGBoost is better than usual gradient boosting algorithms. The Python implementation gives access to a vast number of inner parameters to tweak for better precision and accuracy. Some important features of XGBoost are: Parallelization: The model is implemented to train with multiple CPU cores.

Why is XGBoost so powerful? ›

It has both linear model solver and tree learning algorithms. So, what makes it fast is its capacity to do parallel computation on a single machine. It also has additional features for doing cross-validation and finding important variables.

Top Articles
Latest Posts
Article information

Author: Fredrick Kertzmann

Last Updated: 02/01/2023

Views: 6550

Rating: 4.6 / 5 (66 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Fredrick Kertzmann

Birthday: 2000-04-29

Address: Apt. 203 613 Huels Gateway, Ralphtown, LA 40204

Phone: +2135150832870

Job: Regional Design Producer

Hobby: Nordic skating, Lacemaking, Mountain biking, Rowing, Gardening, Water sports, role-playing games

Introduction: My name is Fredrick Kertzmann, I am a gleaming, encouraging, inexpensive, thankful, tender, quaint, precious person who loves writing and wants to share my knowledge and understanding with you.