保姆级教程：用Python+LIBSVM复现西瓜书SVM习题（附完整代码与数据集）-编程实验室

从理论到实践：Python+LIBSVM实现西瓜书SVM习题全流程解析

在机器学习领域，支持向量机(SVM)一直以其优秀的分类性能和清晰的数学原理备受推崇。周志华教授的《机器学习》(西瓜书)作为国内经典教材，其第六章对SVM的理论讲解深入浅出，但许多学习者在将理论知识转化为代码实践时常常感到无从下手。本文将带你完整复现西瓜书第六章的关键习题，使用Python和LIBSVM库在西瓜数据集3.0a上实现线性核与高斯核SVM的对比实验。

1. 环境准备与工具安装

工欲善其事，必先利其器。在开始编码前，我们需要配置好开发环境。推荐使用Python 3.7及以上版本，这是目前机器学习领域最稳定且广泛支持的版本。

首先安装必要的库：

pip install numpy matplotlib scikit-learn libsvm-official

注意：libsvm-official是LIBSVM的Python接口，相比原始C版本更易于使用。

验证安装是否成功：

import numpy as np import matplotlib.pyplot as plt from libsvm.svm import svm_parameter, svm_problem from libsvm.svmutil import svm_train, svm_predict print("所有库已成功导入")

如果运行没有报错，说明环境已准备就绪。接下来我们创建项目目录结构：

/svm_experiment /data # 存放数据集 /notebooks # Jupyter笔记本 /scripts # Python脚本 /results # 实验结果和图表

2. 数据准备与预处理

西瓜数据集3.0a是西瓜书中常用的示例数据集，包含两个类别的西瓜样本，每个样本有2个特征。我们将手动创建这个数据集，确保与教材一致。

# 西瓜数据集3.0a X = np.array([ [0.697, 0.460], [0.774, 0.376], [0.634, 0.264], [0.608, 0.318], [0.556, 0.215], [0.403, 0.237], [0.481, 0.149], [0.437, 0.211], [0.666, 0.091], [0.243, 0.267], [0.245, 0.057], [0.343, 0.099], [0.639, 0.161], [0.657, 0.198], [0.360, 0.370], [0.593, 0.042], [0.719, 0.103] ]) y = np.array([1, 1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1])

为了更好地理解数据，我们先进行可视化：

plt.scatter(X[y==1, 0], X[y==1, 1], c='r', label='好瓜') plt.scatter(X[y==-1, 0], X[y==-1, 1], c='b', label='坏瓜') plt.xlabel('密度') plt.ylabel('含糖率') plt.legend() plt.title('西瓜数据集3.0a分布') plt.show()

从散点图可以直观看出，两类西瓜在特征空间中有一定的线性可分性，但也存在部分重叠区域，这为我们比较不同核函数的性能提供了很好的案例。

3. 线性核SVM实现与解析

线性核是SVM中最简单的核函数，适用于近似线性可分的数据。我们先配置LIBSVM参数：

# 设置线性核参数 param_linear = svm_parameter( '-s 0 -t 0 -c 4 -b 1' # 分类任务，线性核，惩罚系数C=4，启用概率估计 )

关键参数说明：

-s 0：C-SVC分类器
-t 0：线性核
-c 4：惩罚系数，控制分类边界的"硬度"
-b 1：启用概率估计

训练模型并获取支持向量：

prob = svm_problem(y, X) model_linear = svm_train(prob, param_linear) # 获取支持向量 sv_indices = model_linear.get_sv_indices() sv_coef = model_linear.get_sv_coef() sv = model_linear.get_SV() print(f"找到 {len(sv_indices)} 个支持向量") print("支持向量索引:", sv_indices) print("支持向量系数:", sv_coef)

可视化决策边界和支持向量：

# 创建网格点用于绘制决策边界 xx, yy = np.meshgrid(np.linspace(0.2, 0.8, 500), np.linspace(0, 0.5, 500)) Z = np.zeros(xx.shape) # 预测每个网格点的类别 for i in range(xx.shape[0]): for j in range(xx.shape[1]): Z[i,j] = svm_predict([0], [[xx[i,j], yy[i,j]]], model_linear)[0][0] # 绘制结果 plt.contourf(xx, yy, Z, alpha=0.2) plt.scatter(X[y==1, 0], X[y==1, 1], c='r', label='好瓜') plt.scatter(X[y==-1, 0], X[y==-1, 1], c='b', label='坏瓜') plt.scatter(X[sv_indices-1, 0], X[sv_indices-1, 1], s=100, facecolors='none', edgecolors='k', label='支持向量') plt.xlabel('密度') plt.ylabel('含糖率') plt.legend() plt.title('线性核SVM决策边界') plt.show()

线性核SVM在西瓜数据集上的表现相当不错，能够找到合理的分界线将大部分样本正确分类。支持向量主要集中在两类样本的交界处，这些"关键样本"决定了最终的决策边界。

4. 高斯核SVM实现与对比分析

高斯核(RBF核)能够处理更复杂的非线性分类问题。我们保持其他参数不变，仅修改核函数类型：

param_rbf = svm_parameter( '-s 0 -t 2 -c 4 -g 0.5 -b 1' # 高斯核，gamma=0.5 )

新增参数：

-t 2：高斯核
-g 0.5：核函数参数gamma，控制单个样本的影响范围

训练高斯核模型：

model_rbf = svm_train(prob, param_rbf) # 获取支持向量信息 sv_indices_rbf = model_rbf.get_sv_indices() sv_coef_rbf = model_rbf.get_sv_coef() sv_rbf = model_rbf.get_SV() print(f"高斯核找到 {len(sv_indices_rbf)} 个支持向量")

可视化高斯核的决策边界：

# 预测网格点类别 Z_rbf = np.zeros(xx.shape) for i in range(xx.shape[0]): for j in range(xx.shape[1]): Z_rbf[i,j] = svm_predict([0], [[xx[i,j], yy[i,j]]], model_rbf)[0][0] # 绘制结果 plt.contourf(xx, yy, Z_rbf, alpha=0.2) plt.scatter(X[y==1, 0], X[y==1, 1], c='r', label='好瓜') plt.scatter(X[y==-1, 0], X[y==-1, 1], c='b', label='坏瓜') plt.scatter(X[sv_indices_rbf-1, 0], X[sv_indices_rbf-1, 1], s=100, facecolors='none', edgecolors='k', label='支持向量') plt.xlabel('密度') plt.ylabel('含糖率') plt.legend() plt.title('高斯核SVM决策边界') plt.show()

高斯核SVM产生了更灵活的决策边界，能够更好地拟合数据的分布。有趣的是，在这个特定数据集上，两种核函数找到的支持向量完全相同，这与数据集的特性和参数选择有关。

5. 模型评估与参数调优

为了全面比较两种核函数的性能，我们需要引入量化评估指标。首先定义评估函数：

def evaluate_model(model, X_test, y_test): p_label, p_acc, p_val = svm_predict(y_test, X_test, model) return { 'accuracy': p_acc[0], 'precision': p_acc[2], 'recall': p_acc[3] }

使用5折交叉验证比较模型：

from sklearn.model_selection import KFold kf = KFold(n_splits=5, shuffle=True, random_state=42) linear_scores = [] rbf_scores = [] for train_index, test_index in kf.split(X): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] # 训练线性模型 prob_train = svm_problem(y_train, X_train) model_linear = svm_train(prob_train, param_linear) linear_scores.append(evaluate_model(model_linear, X_test, y_test)) # 训练高斯核模型 model_rbf = svm_train(prob_train, param_rbf) rbf_scores.append(evaluate_model(model_rbf, X_test, y_test)) # 计算平均得分 avg_linear = {k: np.mean([s[k] for s in linear_scores]) for k in linear_scores[0]} avg_rbf = {k: np.mean([s[k] for s in rbf_scores]) for k in rbf_scores[0]} print("线性核平均表现:", avg_linear) print("高斯核平均表现:", avg_rbf)

参数调优是SVM应用中的关键步骤。我们可以使用网格搜索寻找最优的C和gamma参数：

from sklearn.model_selection import GridSearchCV from libsvm.svmutil import svm_train_validation # 定义参数网格 param_grid = { 'C': [0.1, 1, 10, 100], 'gamma': [0.01, 0.1, 1, 10] } # 执行网格搜索 best_score = 0 best_params = {} for C in param_grid['C']: for gamma in param_grid['gamma']: param = svm_parameter(f'-s 0 -t 2 -c {C} -g {gamma} -v 5') # 5折交叉验证 current_score = svm_train_validation(y, X, param) if current_score > best_score: best_score = current_score best_params = {'C': C, 'gamma': gamma} print(f"最佳参数: {best_params}, 交叉验证准确率: {best_score}%")

在实际项目中，参数调优可能需要更精细的搜索范围和更复杂的验证策略，但对于这个教学示例，上述方法已经足够展示核心思路。

6. 进阶探讨与扩展实验

理解了基础实现后，我们可以进一步探讨几个有趣的问题：

1. 支持向量数量与模型复杂度的关系

通过调整惩罚参数C，观察支持向量数量的变化：

C_values = [0.01, 0.1, 1, 10, 100] sv_counts = [] for C in C_values: param = svm_parameter(f'-s 0 -t 0 -c {C}') model = svm_train(prob, param) sv_counts.append(len(model.get_sv_indices())) plt.plot(C_values, sv_counts, 'bo-') plt.xscale('log') plt.xlabel('惩罚参数C (log scale)') plt.ylabel('支持向量数量') plt.title('C参数与支持向量数量的关系') plt.show()

2. 核函数选择对决策边界的影响

比较不同gamma值下高斯核的表现：

gamma_values = [0.01, 0.1, 1, 10] plt.figure(figsize=(12, 8)) for i, gamma in enumerate(gamma_values): param = svm_parameter(f'-s 0 -t 2 -c 1 -g {gamma}') model = svm_train(prob, param) # 预测网格点 Z = np.zeros(xx.shape) for i_ in range(xx.shape[0]): for j in range(xx.shape[1]): Z[i_,j] = svm_predict([0], [[xx[i_,j], yy[i_,j]]], model)[0][0] plt.subplot(2, 2, i+1) plt.contourf(xx, yy, Z, alpha=0.2) plt.scatter(X[:,0], X[:,1], c=['r' if label==1 else 'b' for label in y]) plt.title(f'gamma={gamma}') plt.tight_layout() plt.show()

3. 扩展到多分类问题

虽然西瓜数据集是二分类问题，但我们可以简单扩展代码处理多分类场景。LIBSVM原生支持一对多的多分类策略：

# 假设我们有一个三分类问题 y_multi = np.array([0, 0, 1, 1, 2, 2]) # 示例标签 X_multi = np.random.rand(6, 2) # 示例特征 param_multi = svm_parameter('-s 0 -t 0') prob_multi = svm_problem(y_multi, X_multi) model_multi = svm_train(prob_multi, param_multi) # 预测时自动处理多分类 p_label, p_acc, p_val = svm_predict(y_multi, X_multi, model_multi)