Kaggle泰坦尼克号生存预测 Titanic

文章目录

  • 前言
  • 案例背景
  • 数据集介绍
  • 加载数据集
  • 探索性数据分析(EDA)
    • 可视化特征和目标值之间关系
    • 缺失值分析
  • 数据预处理
    • 数据清洗
      • 缺失值处理
      • 去除噪声并且规范化文本内容
    • 数据转换
  • 数据划分
  • 建模
    • 逻辑回归模型
    • 决策分类树模型
    • 随机森林模型
    • 梯度提升树模型
  • 预测
  • LR 完整的 python 代码

前言

官网链接:Titanic - Machine Learning from Disaster | Kaggle
Notebook 链接:Titanic Analysis Predictions | LR, DT, RF, GBT | Kaggle
(其中 Version 1-3 含有分析过程,文末仅贴有逻辑回归模型的完整 python 代码)

案例背景

泰坦尼克号的沉没是历史上最臭名昭著的沉船事故之一。

1912 年 4 月 15 日,在她的处女航中,被广泛认为“不沉”的泰坦尼克号与冰山相撞后沉没。不幸的是,船上没有足够的救生艇,导致 2224 名乘客和机组人员中有 1502 人死亡。

虽然生存有一定的运气成分,但似乎某些群体比其他群体更有可能生存。

在本次挑战中,我们要求建立一个预测模型来回答以下问题:“什么样的人更有可能生存?”使用乘客数据(即姓名、年龄、性别、社会经济阶层等)。

数据集介绍

数据分为两组:

  • 训练集(train.csv)
  • 测试集(test.csv)

训练集:包含机上部分乘客(确切地说是 891 名)的详细信息,重要的是,将揭示他们是否幸存,也称为“基本事实”。

测试集:包含类似的信息,但没有披露每位乘客的“基本事实”。预测这些结果是你的工作。

列名 含义
PassengerId 乘客编号
Survived 生存情况(0:死亡,1:存活)
Pclass 客舱等级
Name 姓名
Sex 性别
Age 年龄
SibSp 同代直系亲属数
Parch 不同代直系亲属数
Ticket 船票编号
Fare 船票价格
Cabin 客舱号
Embarked 登船港口

加载数据集

# 忽略警告
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
# 加载数据集
df = pd.read_csv("./titanic/train.csv")
df.sample(5, random_state=0)

在这里插入图片描述

探索性数据分析(EDA)

df.info()

在这里插入图片描述

可视化特征和目标值之间关系

from matplotlib import pyplot as plt
import seaborn as sns

features = ["Pclass", "Age", "SibSp", "Parch", "Fare"]
fig, axes = plt.subplots(1, 5, figsize=(15, 3), tight_layout=True)
for feature, ax in zip(features, axes):
    plt.sca(ax)
    sns.kdeplot(df.loc[df["Survived"] == 1, feature], label="1", fill=True)
    sns.kdeplot(df.loc[df["Survived"] == 0, feature], label="0", fill=True)
    plt.legend(title="Survived")
plt.show()

在这里插入图片描述

缺失值分析

df.isnull().sum()

在这里插入图片描述

# 删除缺失值
data = df['Age'].dropna()

# 绘制直方图
sns.histplot(data, kde=True, color='skyblue', label='Histogram', stat='density')

# 绘制正态分布曲线
sns.kdeplot(data, color='r', label='Normal Distribution')
plt.legend()
plt.show()

在这里插入图片描述

sum(df['Cabin'].isnull()) / len(df)

在这里插入图片描述

plt.pie(x=df['Embarked'].value_counts().values, labels=df['Embarked'].value_counts().index, autopct='%1.1f%%')
plt.show()

在这里插入图片描述

  • 处理缺失值的策略
    • Age 趋近于正态分布,根据 Name 中的称呼给 Age 赋其对于均值
    • Cabin 中缺失值占比 77%,缺失过多,删除该列
    • Embarked 中有 2 个缺失值使用占比最大的 S 填充

数据预处理

数据清洗

缺失值处理

import re

def name_title(x):
    return x.split('.')[0].split(' ')[-1]

df['Name'].apply(remove_noise).value_counts()

在这里插入图片描述

def remove_noise(x):
    return re.sub(r'[".,()]+', '', x)

df['NameTitle'] = df['Name'].apply(name_title)
df.sample(5, random_state=0)

在这里插入图片描述

# 根据分组计算平均值
group_means = df.groupby('NameTitle')['Age'].mean()

# 填充缺失值
df['Age'] = df['Age'].fillna(df['NameTitle'].map(group_means))
df.sample(5, random_state=0)
df = df.drop('Cabin', axis=1)
df['Embarked'].fillna('S', inplace=True)
df.head()

在这里插入图片描述

# 提取每个单元格中包含的非字母字符
symbols_per_cell = df['Name'].apply(lambda x: ''.join([char for char in x if not char.isalpha()]))

# 获取所有不同的符号
unique_symbols = set(''.join(symbols_per_cell))
unique_symbols

在这里插入图片描述

去除噪声并且规范化文本内容

def ticket_pref(x):
    if len(x.split(' ')) == 1:
        return 'nan'
    else:
        x = ".".join(x.split(' ')[:-1])
        return re.sub(r'[./]+', '', x).lower()

def ticket_ID(x):
    x = x.split(' ')[-1]
    return int(x) if x.isdigit() else 0

df['Name'] = df['Name'].apply(remove_noise)
df['TicketPref'] = df['Ticket'].apply(ticket_pref)
df['TicketID'] = df['Ticket'].apply(ticket_ID)
df.sample(5, random_state=0)
y = df['Survived']
X = df.drop(['PassengerId', 'Survived', 'Ticket', 'NameTitle'], axis=1)
X.sample(5, random_state=0)

在这里插入图片描述

数据转换

  • 处理文本数据
    • Name 使用 TF-IDF(Term Frequency-Inverse Document Frequency)进行特征提取(Feature Extraction)
    • Sex、Embarked、TicketPref 使用独热编码(One-Hot Encoding)进行特征编码(Feature Encoding)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector

# 提取数值类型的特征列
numeric_columns = selector(dtype_include='number')

# 定义 Pipeline 中每个步骤
text_transformer = Pipeline(steps=[
    ('tfidf', TfidfVectorizer())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# 使用 ColumnTransformer 指定每列的处理方式
preprocessor = ColumnTransformer(
    transformers=[
        ('text', text_transformer, 'Name'),
        ('categorical', categorical_transformer, ['Sex', 'Embarked', 'TicketPref']),
        ('numeric', numeric_transformer, numeric_columns)
    ])

# 创建完整的 Pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# 在你的数据上使用 Pipeline 进行处理
X_processed = pipeline.fit_transform(X)

数据划分

from sklearn.model_selection import train_test_split

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

在这里插入图片描述

建模

逻辑回归模型

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, classification_report
import numpy as np

# 创建逻辑回归模型
lr = LogisticRegression()

# 定义参数网格
param_grid = {
    'C': np.logspace(-3, 3, 7),
    'max_iter': list(range(5, 40, 5)),
}

# 设置多类分类评估器
scorer = make_scorer(accuracy_score)

# 创建 GridSearchCV 对象
grid_search = GridSearchCV(
    estimator=lr,
    param_grid=param_grid,
    scoring=scorer,
    cv=5 # 使用交叉验证
)

# 运行网格搜索
grid_search.fit(X_train, y_train)

# 输出最佳参数
print("Best Parameters: ", grid_search.best_params_)

# 在验证集上评估模型
lr_model = grid_search.best_estimator_
y_pred = lr_model.predict(X_test)

# 评估(Evaluation)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))

在这里插入图片描述

决策分类树模型

from sklearn.tree import DecisionTreeClassifier

# Create Decision Tree classifier
dt_classifier = DecisionTreeClassifier()

# Define parameter grid
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': list(range(5, 25, 5)),
    'min_samples_split': [3, 7, 12],
    'min_samples_leaf': [2, 4, 6],
}

# Set the scoring metric
scorer = make_scorer(accuracy_score)

# Create GridSearchCV object
grid_search = GridSearchCV(
    estimator=dt_classifier,
    param_grid=param_grid,
    scoring=scorer,
    cv=5  # Using 5-fold cross-validation
)

# Run grid search
grid_search.fit(X_train, y_train)

# Output the best parameters
print("Best Parameters: ", grid_search.best_params_)

# Evaluate the model on the test set
dt_model = grid_search.best_estimator_
y_pred = dt_model.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))

在这里插入图片描述

随机森林模型

from sklearn.ensemble import RandomForestClassifier

# Create Random Forest classifier
rf_classifier = RandomForestClassifier()

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'criterion': ['gini', 'entropy'],
    'max_depth': [5, 10, 15],
    'min_samples_split': [3, 7, 12],
    'min_samples_leaf': [2, 4, 6],
}

# Set the scoring metric
scorer = make_scorer(accuracy_score)

# Create GridSearchCV object
grid_search = GridSearchCV(
    estimator=rf_classifier,
    param_grid=param_grid,
    scoring=scorer,
    cv=5  # Using 5-fold cross-validation
)

# Run grid search
grid_search.fit(X_train, y_train)

# Output the best parameters
print("Best Parameters: ", grid_search.best_params_)

# Evaluate the model on the test set
rf_model = grid_search.best_estimator_
y_pred = rf_model.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))

在这里插入图片描述

梯度提升树模型

from sklearn.ensemble import GradientBoostingClassifier

# Create Gradient Boosting classifier
gb_classifier = GradientBoostingClassifier()

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'min_samples_split': [3, 7, 12],
    'min_samples_leaf': [2, 4, 6],
}

# Set the scoring metric
scorer = make_scorer(accuracy_score)

# Create GridSearchCV object
grid_search = GridSearchCV(
    estimator=gb_classifier,
    param_grid=param_grid,
    scoring=scorer,
    cv=5  # Using 5-fold cross-validation
)

# Run grid search
grid_search.fit(X_train, y_train)

# Output the best parameters
print("Best Parameters: ", grid_search.best_params_)

# Evaluate the model on the test set
gb_model = grid_search.best_estimator_
y_pred = gb_model.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))

在这里插入图片描述

from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import confusion_matrix

# 假设 y_test 是真实标签,y_scores 是预测的概率得分
models = [lr_model, dt_model, rf_model, gb_model]
y_scores = [model.predict_proba(X_test)[:, 1] for model in models]

fig, axes = plt.subplots(2, 4, figsize=(15, 7), tight_layout=True)
fig.suptitle('ROC Curve & Confusion matrix', size=16)
for i in range(4):
    # 计算 ROC 曲线的值
    fpr, tpr, thresholds = roc_curve(y_test, y_scores[i])
    # 计算 AUC(Area Under the Curve)
    auc = roc_auc_score(y_test, y_scores[i])
    plt.sca(axes[0][i])
    plt.plot(fpr, tpr, label=f'AUC = {auc:.2f}')
    plt.plot([0, 1], [0, 1], 'k--', label='Random')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(models[i].__class__.__name__)
    plt.legend()

    plt.sca(axes[1][i])
    y_pred = models[i].predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, cmap='Blues', fmt='g', linewidths=.5)
    plt.title(models[i].__class__.__name__)
    plt.xlabel('Predicted Labels')
    plt.ylabel('Real Labels')
plt.show()

在这里插入图片描述

预测

# 导入数据集
test_data = pd.read_csv("./titanic/test.csv")

# 数据预处理
test_data['NameTitle'] = test_data['Name'].apply(name_title)
group_means = test_data.groupby('NameTitle')['Age'].mean()
test_data['Age'].fillna(df['NameTitle'].map(group_means), inplace=True)
test_data['Fare'].fillna(test_data['Fare'].mean(), inplace=True)
test_data['Name'] = test_data['Name'].apply(remove_noise)
test_data['TicketPref'] = test_data['Ticket'].apply(ticket_pref)
test_data['TicketID'] = test_data['Ticket'].apply(ticket_ID)
test = test_data.drop(['PassengerId', 'Ticket', 'Cabin', 'NameTitle'], axis=1)
test.sample(5, random_state=0)

在这里插入图片描述

# 数据转化
X_test_processed = pipeline.transform(test)
X_test_processed.shape

在这里插入图片描述

# 模型预测
val = lr_model.predict(X_test_processed)
sub = pd.read_csv("./titanic/gender_submission.csv")
sub['Survived'] = val
sub.to_csv('./titanic/submission.csv', index=False)
print("Your submission was successfully saved!")

在这里插入图片描述

LR 完整的 python 代码

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector as selector
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, classification_report
import numpy as np
import re
import warnings
warnings.filterwarnings("ignore")

'''Setp 1: Load dataset'''
df = pd.read_csv("titanic/train.csv")


'''Setp 2: Data Preprocessing'''
def name_title(x):
    return x.split('.')[0].split(' ')[-1]

def remove_noise(x):
    return re.sub(r'[".,()]+', '', x)

def ticket_pref(x):
    if len(x.split(' ')) == 1:
        return 'nan'
    else:
        x = ".".join(x.split(' ')[:-1])
        return re.sub(r'[./]+', '', x).lower()

def ticket_ID(x):
    x = x.split(' ')[-1]
    return int(x) if x.isdigit() else 0

# data preprocessing
def preprocessing(df):
    df = df.copy()

    # Missing Data Handling
    df['NameTitle'] = df['Name'].apply(name_title)
    
    # Fill in missing values
    df['Age'].fillna(df['NameTitle'].map(df.groupby('NameTitle')['Age'].mean()), inplace=True)
    df['Embarked'].fillna('S', inplace=True)

    # Remove Noise
    df['Name'] = df['Name'].apply(remove_noise)

    # Standardize Text Content
    df['TicketPref'] = df['Ticket'].apply(ticket_pref)
    df['TicketID'] = df['Ticket'].apply(ticket_ID)

    return df

train_df = preprocessing(df)
y = train_df['Survived']
X = train_df.drop(['PassengerId', 'Survived', 'Ticket', 'NameTitle', 'Cabin'], axis=1)


'''Setp 3: Data Transformation'''
# Extracting columns with numerical features
numeric_columns = selector(dtype_include='number')

# Define each step in the pipeline
text_transformer = Pipeline(steps=[
    ('tfidf', TfidfVectorizer())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Use ColumnTransformer to specify the processing method for each column
preprocessor = ColumnTransformer(
    transformers=[
        ('text', text_transformer, 'Name'),
        ('categorical', categorical_transformer, ['Sex', 'Embarked', 'TicketPref']),
        ('numeric', numeric_transformer, numeric_columns)
    ])

# Create a complete pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Use a pipeline to process data
X_processed = pipeline.fit_transform(X)


'''Setp 4: Data Splitting'''
# Splitting the training set and test set
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=0)


'''Setp 5: Modeling'''
# Create Logistic Regression 
lr = LogisticRegression()

# Define parameter grid
param_grid = {
    'C': np.logspace(-3, 3, 7),
    'max_iter': list(range(5, 50, 1)),
}

# Set the scoring metric
scorer = make_scorer(accuracy_score)

# Create GridSearchCV object
grid_search = GridSearchCV(
    estimator=lr,
    param_grid=param_grid,
    scoring=scorer,
    cv=5 # Using 5-fold cross-validation
)

# Run grid search
grid_search.fit(X_train, y_train)

# Output the best parameters
print("Best Parameters: ", grid_search.best_params_)

# Evaluate the model on the test set
lr_model = grid_search.best_estimator_
y_pred = lr_model.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))


'''Setp 6: Predicting''' 
test_data = pd.read_csv("titanic/test.csv")
test_data.head()

def preprocess_data(df):
    df = df.copy()

    # Missing Data Handling
    df['NameTitle'] = df['Name'].apply(name_title)

    # Fill in missing values
    df['Age'].fillna(df['NameTitle'].map(train_df.groupby('NameTitle')['Age'].mean()), inplace=True)
    df['Fare'].fillna(df['Fare'].mean(), inplace=True)

    # Remove Noise
    df['Name'] = df['Name'].apply(remove_noise)

    # Standardize Text Content
    df['TicketPref'] = df['Ticket'].apply(ticket_pref)
    df['TicketID'] = df['Ticket'].apply(ticket_ID)
    
    df = df.drop(['PassengerId', 'Ticket', 'NameTitle', 'Cabin'], axis=1)
    return df

# Data preprocessing
test = preprocess_data(test_data)

# Data Transformation
X_test_processed = pipeline.transform(test)

# Predicting
val = lr_model.predict(X_test_processed)
sub = pd.read_csv("titanic/gender_submission.csv")
sub['Survived'] = val
sub.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")