GBDT核心源码解析

【文章发布的比较早，新版sklearn已经使用Rust重写了，只能用来凑热闹了】

sklearn中对GBDT的实现是完全遵从论文 Greedy Function Approximation的，我们一起来看一下是怎么实现的。GBDT源码最核心的部分应该是对Loss Function的处理，因为除去Loss部分的代码其他的都是非常直觉且标准的程序逻辑，反正我们就从sklearn对loss的实现开始看吧～～

Loss Function 的实现

以二分类任务为例，loss采用Binomial Deviance，看这个loss很陌生，其实跟我们熟悉的negative log-likelihood / cross entropy 是一回事，因为是二分类问题嘛，模型最终输出其实就是$P(y=1|x)$，即样本$x$是正例的概率，我们把这个概率标记成$p(x)$，那么Binomial Deviance等于

$$\ell(y, F(x)) = -\left [ y\log(p(x)) + (1 – y)\log(1-p(x)) \right ]$$

就～跟Cross Entropy是一个东西～其中$p(x)$的形式不是固定的，只要是概率的形式就可以，sklearn中把$p(x)$定义为sigmoid函数

$$p(x) = \frac{1}{1 + e^{-F(x)}}$$

我们把它带入$\ell(y,F(x))$后一顿简化就变成这样了

$$\ell(y,F(x)) = yF(x) – \log(1 + e^{F(x)})$$

化简的时候，先把$\log$拆开再捯饬捯饬最后把$p(x)$带入会比较好化简～～它的负梯度方向

$$-\frac{\partial \ell(y,F(x))}{\partial F(x)} = y + \frac{1}{1 + e^{-F(x)}}$$

先不着急往下继续往下看公式，我们先看下Loss和负梯度的实现，BinomialDeviance Loss Function定义在src\ensemble_gb_losses.py 里，继承了GlassificationLossFunction:

class BinomialDeviance(ClassificationLossFunction):
    """Binomial deviance loss function for binary classification.

    Binary classification is a special case; here, we only need to
    fit one tree instead of ``n_classes`` trees.

    Parameters
    ----------
    n_classes : int
        Number of classes.
    """

Loss对应的代码定义在__call__方法里：

def __call__(self, y, raw_predictions, sample_weight=None):
 """Compute the deviance (= 2 * negative log-likelihood).
    logaddexp(0, v) == log(1.0 + exp(v))
 """
 raw_predictions = raw_predictions.ravel()
 if sample_weight is None:
  return -2 * np.mean(
   (y * raw_predictions) - np.logaddexp(0, raw_predictions)
  )
 else:
  return (
   -2
   / sample_weight.sum()
   * np.sum(
    sample_weight
    * ((y * raw_predictions) - np.logaddexp(0, raw_predictions))
   )
  )

注意，这里乘以2可能起到放大loss的作用，因为后面负梯度的计算中并没有把2放进去，负梯度的实现：

def negative_gradient(self, y, raw_predictions, **kargs):
    """Compute half of the negative gradient.
    """
    return y - expit(raw_predictions.ravel())

expit方法来自scipy中，计算sigmoid。

更新叶子结点

在上一篇文章GBDT原理中说过，在一轮迭代中regression tree会根据负梯度方向把样本划分到$J$个不相交的空间里去，之后分别对$J$个空间践行优化

$$\gamma_{jm} = \arg \min_{\gamma} \sum_{x_i \in R_{jm}} y_i (F_{m-1}(x_i) + \gamma) – \log ( 1 + e^{F_{m-1}(x_i) + \gamma})$$

$F_{m-1}(x_i)$在代码里面叫raw_predictions，在本轮迭代中是已知的。碰到求最值，肯定直接一阶导数等于0呀，而且当前情况下是有解析解的～

但是很遗憾～～

sklearn并没有采用解析解，而是用二阶泰勒展开进行近似计算，论文说是朝着牛顿方向迭代一次，其实是一个意思。大概你也知道我其实不信邪，于是实验了几把，发现同样迭代5轮使用牛顿法更新叶子结点Loss衰减的更快，AUC也会更好点～～如果不熟悉牛顿法可以参考我之前写的文章牛顿法，看完了你大概会知道为啥$\gamma$会等于

$$\gamma = \frac{g(\gamma)}{h(\gamma)}$$

$g(\gamma)$和$h(\gamma)$分别是gradient和hessian matrix，即一阶二阶导

$$g(\gamma) = \frac{\partial \ell(y, \gamma)}{\partial \gamma} = y – \text{sigmoid}(F_{m-1}(x_i) + \gamma)$$

$$h(\gamma) = \frac{\partial^{2} \ell(y, \gamma)}{\partial \gamma ^2} = (1 – \text{sigmoid}(F_{m-1}(x_i) + \gamma)) \text{sigmoid}(F_{m-1}(x_i) + \gamma)$$

代码实现在方法_update_terminal_region里:


def _update_terminal_region(self,tree,terminal_regions,leaf,X,y,residual,raw_predictions,sample_weight,):
    """Make a single Newton-Raphson step.

    our node estimate is given by:

        sum(w * (y - prob)) / sum(w * prob * (1 - prob))

    we take advantage that: y - prob = residual
    """

好了，核心的Loss部分我们就看完了，用于GBDT的Loss Function大概结构都差不多的。除去loss部分，其他代码都是比较直觉的，就不一一细说了，大家都是成年人了，应该看得懂～～我把训练和预测阶段的主要流程图给画出来贴在下面，您有兴趣就慢慢品吧～大概如此

训练阶段的源码实现

预测阶段的源码实现

参考文献：

Friedman J H . Greedy Function Approximation: A Gradient Boosting Machine[J]. Annals of Statistics, 2001, 29(5):1189-1232.
Jerome F , Robert T , Trevor H . Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors)[J]. Ann. Statist. 2000.(FHT00)
sklearn官方文档
sklearn 源码

Loss Function 的实现

更新叶子结点

训练阶段的源码实现

预测阶段的源码实现

By tensorzen

发表回复取消回复

You Missed

Step by Step实现RAG

timeScale vs fixedDeltaTime

Difference between Gradient and Derivative

Fixed update with Physics.Simulate in Unity

Loss Function 的实现

更新叶子结点

训练阶段的源码实现

预测阶段的源码实现

By tensorzen

Related Post

发表回复 取消回复

You Missed

发表回复取消回复