Scikit-learn 完整教程 - Python 入门教程

📘 Scikit-learn 完整教程

Scikit-learn 是 Python 中最流行的机器学习库，提供了各种经典的机器学习算法和工具。本教程将从零开始，系统讲解 Scikit-learn 的核心概念、算法和实战应用。

Scikit-learn 核心特性：

🔄 统一 API：所有模型遵循 fit/predict 模式
📊 算法丰富：分类、回归、聚类、降维等
🔧 工具完善：预处理、模型选择、评估指标
🚀 高效性能：基于 NumPy 和 SciPy 优化
📚 文档完善：详细文档和丰富示例

🌳 Scikit-learn 知识体系

Scikit-learn

↓

监督学习

无监督学习

模型选择

数据预处理

↓

分类

回归

聚类

降维

交叉验证

网格搜索

标准化

编码

一、机器学习工作流程

掌握标准的机器学习项目流程是成功的关键。

🔄 完整工作流程图

1. 数据收集

↓

2. 数据探索与可视化

↓

3. 数据预处理

↓

4. 特征工程

↓

5. 选择模型？

↓

6. 训练模型

↓

7. 性能达标？

↓

8. 模型调优

9. 模型评估

↓

10. 部署应用

📋 各阶段详细说明

阶段	主要任务	常用工具	输出
1. 数据收集	获取原始数据	Pandas, SQL, API	原始数据集
2. 数据探索	理解数据分布、相关性	Matplotlib, Seaborn	EDA 报告
3. 数据预处理	清洗、填充、标准化	sklearn.preprocessing	干净数据
4. 特征工程	特征选择、构造、降维	sklearn.feature_selection	特征矩阵
5-6. 模型训练	选择算法、训练模型	sklearn.ensemble, svm	训练好的模型
7-9. 模型评估	交叉验证、调参、评估	sklearn.model_selection	最优模型
10. 部署	模型保存、API 服务	joblib, pickle, Flask	生产模型

二、Scikit-learn API 设计

理解 Scikit-learn 的统一 API 设计是高效使用的关键。

🔄 统一 API 序列图

用户代码

Estimator

Model

1. fit(X, y) ← 训练模型

2. predict(X) ← 预测

3. score(X, y) ← 评估

📋 核心 API 方法

方法	参数	返回值	说明
`fit(X, y)`	X: 特征，y: 标签	self	训练模型，学习参数
`predict(X)`	X: 新样本	预测值	预测新样本
`predict_proba(X)`	X: 新样本	概率数组	预测类别概率
`score(X, y)`	X: 特征，y: 标签	分数	模型评分
`transform(X)`	X: 数据	转换后数据	数据转换
`fit_transform(X)`	X: 数据	转换后数据	拟合并转换

💻 API 使用示例

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. 准备数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 2. 创建模型
model = RandomForestClassifier(n_estimators=100, random_state=42)

# 3. 训练模型 (fit)
model.fit(X_train, y_train)

# 4. 预测 (predict)
y_pred = model.predict(X_test)

# 5. 评估 (score)
accuracy = model.score(X_test, y_test)
print(f"准确率：{accuracy:.4f}")

# 6. 获取特征重要性
importances = model.feature_importances_

三、监督学习算法

监督学习使用标记数据训练模型，包括分类和回归两大任务。

🌳 监督学习算法分类

监督学习

↓

分类算法

回归算法

↓

逻辑回归

决策树

随机森林

SVM

KNN

朴素贝叶斯

线性回归

岭回归

Lasso

决策树回归

随机森林回归

📋 分类算法详细对比

算法	类名	优点	缺点	适用场景
逻辑回归	`LogisticRegression`	简单、可解释、输出概率	只能处理线性问题	二分类、基线模型
决策树	`DecisionTreeClassifier`	易理解、可可视化	易过拟合	小数据集、可解释性要求
随机森林	`RandomForestClassifier`	准确率高、不易过拟合	模型大、速度慢	通用、默认选择
SVM	`SVC`	高维有效、核技巧	大数据慢、参数敏感	小样本、高维数据
KNN	`KNeighborsClassifier`	简单、无需训练	预测慢、对异常值敏感	小数据集、基线
梯度提升	`GradientBoostingClassifier`	准确率高、灵活	参数多、需调优	竞赛、高精度要求

📋 回归算法详细对比

算法	类名	正则化	适用场景
线性回归	`LinearRegression`	无	基线模型
岭回归	`Ridge`	L2	多重共线性
Lasso	`Lasso`	L1	特征选择
弹性网络	`ElasticNet`	L1+L2	相关特征多

💻 分类算法对比示例

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# 加载数据
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 定义多个分类器
classifiers = {
    '逻辑回归': LogisticRegression(max_iter=1000),
    '决策树': RandomForestClassifier(n_estimators=10, random_state=42),
    '随机森林': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='rbf', probability=True),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    '梯度提升': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

# 训练和评估
results = {}
for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    results[name] = accuracy
    print(f"{name}: {accuracy:.4f}")

# 找出最佳模型
best_model = max(results, key=results.get)
print(f"\n最佳模型：{best_model} (准确率：{results[best_model]:.4f})")

四、无监督学习算法

无监督学习处理未标记数据，包括聚类和降维等任务。

🌳 无监督学习算法分类

无监督学习

↓

聚类算法

降维算法

↓

K-Means

DBSCAN

层次聚类

高斯混合

PCA

t-SNE

UMAP

NMF

📋 聚类算法详细对比

算法	类名	需要 K 值	处理噪声	适用场景
K-Means	`KMeans`	是	敏感	球形簇、大数据
DBSCAN	`DBSCAN`	否	好	任意形状、有噪声
层次聚类	`AgglomerativeClustering`	是	一般	小数据、层次结构
高斯混合	`GaussianMixture`	是	一般	概率聚类

💻 聚类分析完整示例

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# 生成聚类数据
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# 1. K-Means 聚类
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X)
kmeans_score = silhouette_score(X, kmeans_labels)
print(f"K-Means 轮廓分数：{kmeans_score:.4f}")

# 2. DBSCAN 聚类
dbscan = DBSCAN(eps=0.3, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)
# 去除噪声点计算轮廓分数
mask = dbscan_labels != -1
if mask.sum() > 1:
    dbscan_score = silhouette_score(X[mask], dbscan_labels[mask])
    print(f"DBSCAN 轮廓分数：{dbscan_score:.4f}")

# 3. 层次聚类
hierarchical = AgglomerativeClustering(n_clusters=4)
hierarchical_labels = hierarchical.fit_predict(X)
hierarchical_score = silhouette_score(X, hierarchical_labels)
print(f"层次聚类轮廓分数：{hierarchical_score:.4f}")

# 可视化比较
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis')
axes[0].set_title(f'K-Means (score={kmeans_score:.3f})')
axes[1].scatter(X[:, 0], X[:, 1], c=dbscan_labels, cmap='viridis')
axes[1].set_title(f'DBSCAN')
axes[2].scatter(X[:, 0], X[:, 1], c=hierarchical_labels, cmap='viridis')
axes[2].set_title(f'层次聚类 (score={hierarchical_score:.3f})')
plt.tight_layout()
plt.show()

五、数据预处理

数据预处理是机器学习成功的关键步骤。

🔄 预处理流程图

原始数据

↓

缺失值处理

↓

特征缩放

↓

类别编码

↓

预处理完成

📋 预处理方法对比

方法	类名	公式	适用场景
标准化	`StandardScaler`	(x - μ) / σ	特征服从正态分布
归一化	`MinMaxScaler`	(x - min) / (max - min)	边界明确的数据
鲁棒标准化	`RobustScaler`	(x - 中位数) / IQR	有异常值的数据
独热编码	`OneHotEncoder`	类别→二进制向量	名义类别变量
标签编码	`LabelEncoder`	类别→整数	有序类别变量

💻 完整预处理示例

from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np

# 创建示例数据
data = pd.DataFrame({
    'age': [25, 30, np.nan, 45, 50],
    'salary': [50000, 60000, 75000, np.nan, 90000],
    'city': ['北京', '上海', '北京', '广州', '上海'],
    'department': ['技术', '销售', '技术', '技术', '销售'],
    'target': [1, 0, 1, 1, 0]
})

# 1. 分离特征和标签
X = data.drop('target', axis=1)
y = data['target']

# 2. 识别数值和类别特征
numeric_features = ['age', 'salary']
categorical_features = ['city', 'department']

# 3. 创建数值预处理管道
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # 填充缺失值
    ('scaler', StandardScaler())  # 标准化
])

# 4. 创建类别预处理管道
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # 独热编码
])

# 5. 创建预处理器
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# 6. 应用预处理
X_processed = preprocessor.fit_transform(X)

print(f"原始形状：{X.shape}")
print(f"处理后形状：{X_processed.shape}")
print(f"特征名：{preprocessor.get_feature_names_out()}")

六、模型选择与调优

选择合适的模型和参数是提升性能的关键。

🔄 交叉验证流程图

完整数据集

↓

K 折划分

↓

训练 K-1 折

验证 1 折

↓

重复 K 次

↓

计算平均分数

📋 交叉验证方法对比

方法	类名	优点	缺点
K 折交叉验证	`KFold`	充分利用数据	计算成本高
分层 K 折	`StratifiedKFold`	保持类别比例	仅用于分类
留一法	`LeaveOneOut`	无偏估计	计算量极大
自助法	`ShuffleSplit`	灵活控制	可能有重复

💻 网格搜索与随机搜索

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import numpy as np

# 加载数据
iris = load_iris()
X, y = iris.data, iris.target

# 1. 交叉验证
rf = RandomForestClassifier(n_estimators=100, random_state=42)
cv_scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
print(f"5 折交叉验证平均分：{cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# 2. 网格搜索（穷举）
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X, y)

print(f"\n网格搜索最佳参数：{grid_search.best_params_}")
print(f"最佳交叉验证分数：{grid_search.best_score_:.4f}")

# 3. 随机搜索（更高效）
param_dist = {
    'n_estimators': np.random.randint(50, 300, 100),
    'max_depth': [None] + list(np.random.randint(10, 50, 50)),
    'min_samples_split': np.random.randint(2, 20, 50),
    'min_samples_leaf': np.random.randint(1, 10, 50)
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_dist,
    n_iter=50,  # 随机搜索 50 次
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42
)

random_search.fit(X, y)

print(f"\n随机搜索最佳参数：{random_search.best_params_}")
print(f"最佳交叉验证分数：{random_search.best_score_:.4f}")

七、实战案例

通过实际案例掌握 Scikit-learn 的完整应用。

📊 案例 1：鸢尾花分类入门

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# 1. 加载数据
iris = load_iris()
X, y = iris.data, iris.target

# 2. 划分训练测试集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. 训练模型
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# 4. 评估
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# 5. 可视化
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('预测')
plt.ylabel('真实')
plt.title('混淆矩阵')
plt.show()

🏠 案例 2：房价预测中级

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# 1. 加载数据
housing = fetch_california_housing()
X, y = housing.data, housing.target

# 2. 划分数据
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. 训练模型
reg = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
reg.fit(X_train, y_train)

# 4. 评估
y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: ${rmse*100000:.2f}")
print(f"R²: {r2:.4f}")

# 5. 特征重要性
importances = reg.feature_importances_
for name, importance in zip(housing.feature_names, importances):
    print(f"{name}: {importance:.4f}")