别再死记公式了！用Python手把手带你‘算’懂信息熵与信息增益（附实战代码）-编程实验室

用Python实战拆解信息熵：从天气预报到外卖决策的数学直觉

每次看到机器学习教材里那些关于信息熵的数学公式，你是不是总感觉像在读天书？那些连笔顺都写不对的希腊字母和上下标，让人恨不得把书扔出窗外。但别急，今天我们要用Python和几个生活化的例子，把这些抽象概念变成看得见摸得着的数字游戏。

1. 信息熵：从猜硬币到天气预报

想象你正和室友玩猜硬币游戏。如果这枚硬币被动了手脚，每次都是正面朝上，那你根本不需要猜——结果永远确定，游戏毫无悬念。但如果是公平硬币，正反面概率各50%，这时候猜测的难度最大，游戏也最刺激。

这就是信息熵的核心思想：不确定性越大，熵值越高。用Python来模拟这个场景：

import numpy as np def entropy(probabilities): return -sum(p * np.log2(p) for p in probabilities if p > 0) # 硬币总是正面 print(entropy([1.0, 0.0])) # 输出：0.0 # 公平硬币 print(entropy([0.5, 0.5])) # 输出：1.0 # 轻微偏斜的硬币 print(entropy([0.7, 0.3])) # 输出：0.8813

现在把这个概念延伸到天气预报。假设你所在的城市有三种天气：

天气类型	概率	计算项
晴天	0.5	-0.5*log2(0.5)
阴天	0.3	-0.3*log2(0.3)
雨天	0.2	-0.2*log2(0.2)

计算这个天气系统的熵值：

weather_entropy = entropy([0.5, 0.3, 0.2]) print(f"天气系统的熵值：{weather_entropy:.4f} bits")

你会得到一个约1.4855的值，这表示预测明天天气的平均不确定性。对比单一结果的0和公平硬币的1，这个数值反映了更复杂系统中的信息量。

2. 条件熵：当天气影响你的外卖选择

现在进入更有趣的部分——条件熵。继续我们的外卖预测场景，假设你收集了半个月的外卖记录和当天天气数据：

import pandas as pd data = { '天气': ['晴', '晴', '阴', '雨', '晴', '阴', '雨', '晴', '阴', '雨'], '点外卖': [1, 0, 1, 1, 0, 1, 1, 0, 0, 1] } df = pd.DataFrame(data) print(df)

要计算在不同天气条件下点外卖的不确定性，我们需要：

计算每种天气出现的概率
计算每种天气下点外卖的条件概率
综合这些部分的不确定性

def conditional_entropy(df, feature, target): total = len(df) cond_entropy = 0 for value in df[feature].unique(): subset = df[df[feature] == value] prob = len(subset) / total target_entropy = entropy(subset[target].value_counts(normalize=True)) cond_entropy += prob * target_entropy return cond_entropy weather_cond_entropy = conditional_entropy(df, '天气', '点外卖') print(f"天气条件下的外卖选择条件熵：{weather_cond_entropy:.4f}")

这个值表示当我们知道天气情况后，预测外卖选择还剩下的不确定性。与原始熵的对比将揭示天气这个特征的信息量。

3. 信息增益：找到最关键的决定因素

信息增益就是原始熵减去条件熵的差值，它量化了一个特征能为我们减少多少不确定性。在我们的外卖例子中：

total_entropy = entropy(df['点外卖'].value_counts(normalize=True)) info_gain = total_entropy - weather_cond_entropy print(f"天气特征的信息增益：{info_gain:.4f} bits")

为了更全面，我们可以对比多个特征的信息增益。假设现在数据集增加了"心情"特征：

expanded_data = { '天气': ['晴', '晴', '阴', '雨', '晴', '阴', '雨', '晴', '阴', '雨'], '心情': ['好', '差', '好', '差', '好', '差', '好', '差', '好', '差'], '点外卖': [1, 0, 1, 1, 0, 1, 1, 0, 0, 1] } expanded_df = pd.DataFrame(expanded_data) features = ['天气', '心情'] for feature in features: cond_entropy = conditional_entropy(expanded_df, feature, '点外卖') gain = total_entropy - cond_entropy print(f"{feature}的信息增益：{gain:.4f} bits")

这个对比将清楚地显示哪个特征对预测外卖选择更有价值。在决策树算法中，我们会选择信息增益最大的特征作为当前节点的分割标准。

4. 实战演练：构建迷你决策树

现在让我们把这些概念整合起来，用Python实现一个简化版的决策树构建过程。我们将使用经典的鸢尾花数据集，但为了保持简单，只考虑两个特征和三个类别：

from sklearn.datasets import load_iris iris = load_iris() df = pd.DataFrame(iris.data[:, :2], columns=['花萼长度', '花萼宽度']) df['类别'] = iris.target def find_best_split(df, features, target): best_gain = -1 best_feature = None total_entropy = entropy(df[target].value_counts(normalize=True)) for feature in features: current_gain = total_entropy - conditional_entropy(df, feature, target) if current_gain > best_gain: best_gain = current_gain best_feature = feature return best_feature, best_gain best_feature, best_gain = find_best_split(df, ['花萼长度', '花萼宽度'], '类别') print(f"最佳分割特征：{best_feature}，信息增益：{best_gain:.4f}")

这个简单的实现展示了决策树算法的核心思想。在实际应用中，我们会递归地应用这个过程，直到满足停止条件（如达到最大深度或信息增益小于阈值）。

5. 信息增益比的改进与局限

虽然信息增益是个强大的工具，但它有个明显的倾向：更偏好那些取值较多的特征。想象一个极端情况——如果我们把"订单ID"作为特征，每个ID唯一对应一个外卖选择，这个特征的信息增益会最大，但实际上它对预测新数据毫无帮助。

为了解决这个问题，C4.5算法引入了信息增益比的概念：

def intrinsic_value(df, feature): """计算特征的固有值""" total = len(df) return -sum((count/total) * np.log2(count/total) for count in df[feature].value_counts()) def gain_ratio(df, feature, target): total_entropy = entropy(df[target].value_counts(normalize=True)) cond_entropy = conditional_entropy(df, feature, target) info_gain = total_entropy - cond_entropy iv = intrinsic_value(df, feature) return info_gain / iv if iv != 0 else 0 for feature in ['天气', '心情']: ratio = gain_ratio(expanded_df, feature, '点外卖') print(f"{feature}的信息增益比：{ratio:.4f}")

这个改进帮助算法更公平地评估不同特征的重要性，特别是在特征取值数量差异较大时。