2018-12-06

数据挖掘

Kaggle之房价预测

Kaggle入门赛之房价预测

前言：最近实验室有一个数据挖掘的任务需要完成，是一个和国家电网相关的回归任务。因此我在 Kaggle 上找一些回归任务的比赛来练练手，房价预测这个比赛的难度比较适合我这种新手，因此选择了这个题目。本篇博客的大部分内容来自该比赛的一个高分 kernel, 感兴趣的同学可以直接跳转查看原博。

1.比赛介绍

这是 Kaggle 上的一个为数据科学初学者准备的比赛，需要参赛者掌握一定的 R 或 Python 以及机器学习的基础知识。它给出了一个训练集和一个测试集，最终需要对测试集中的结果进行预测并提交。训练集中包含了79个特征以预测房价（SalePrice），包括：房屋的大小、地段、用途、基础设施等等。具体的描述可查看该比赛的比赛页面,比赛中所用到数据也均可通过比赛页面下载。

2.准备工作

数据挖掘类的比赛最为重要的工作就是：特征工程。本篇博客将会用到的特征工程方法还是非常朴素的，具体如下：

通过已有的数据拟合曲线来填充缺失值。
变换某些看起来更像分类变量的数值型变量。
标签编码一些带有排序信息的分类变量（例如，标签顺序按照出现次数大小排序）。
对于一些倾斜特征采用 Box Cox 变换（替代对数变换）：这会得到一个更好的结果。
得到分类特征的假值。

之后我们将会选择一些基本模型（例如基于 sklearn 的 XGBoost），在 stacking/ensembling 之前在数据集上进行交叉验证。它的关键在于让（线性）模型对异常值鲁棒。这将同时提高积分板和交叉验证的分数。

接下来真正开始工作吧！首先导入相关的包和比赛所需的数据。

#import some necessary librairies

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)


from scipy import stats
from scipy.stats import norm, skew #for some statistics


pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points


from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8")) #check the files available in the directory

sample_submission.csv

test.csv

train.csv

#Now let's import and put the train and test datasets in  pandas dataframe

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
##display the first five rows of the train dataset.
train.head(5)

1 2	##display the first five rows of the test dataset. test.head(5)

#check the numbers of samples and features
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))

#Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']

#Now drop the  'Id' colum since it's unnecessary for  the prediction process.
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

#check again the data size after dropping the 'Id' variable
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape)) 
print("The test data size after dropping Id feature is : {} ".format(test.shape))

>The train data size before dropping Id feature is : (1460, 81) 
>The test data size before dropping Id feature is : (1459, 80) 
>
>The train data size after dropping Id feature is : (1460, 80) 
>The test data size after dropping Id feature is : (1459, 79) 
>

3.数据处理

异常值

首先对数据集中的重要特征的异常值进行处理。

fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

可从散点图中看到有两个非常明显的异常值，有着极大的房屋面积房价却很低。这两个值是非常典型的异常值，因此可以安全地删除它们。

#Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)

#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

注意：

删除异常值并不总是安全的。我们决定删除这两个值是因为它俩太奇怪了（极大的空间有着极低的价格）。

数据集中仍然可能存在其它的异常值。然而删除所有的异常值可能对模型产生不好的影响，特别是当它们可能出现在测试集的时候。因此在处理异常值得时候最好不要全部删除掉，以提高模型的鲁棒性。

目标变量

房价是我们需要预测的目标变量，因此首先对它进行一些分析。

1	df_train['SalePrice'].describe()

>count      1460.000000
>mean     180921.195890
>std       79442.502883
>min       34900.000000
>25%      129975.000000
>50%      163000.000000
>75%      214000.000000
>max      755000.000000
>Name: SalePrice, dtype: float64
>

sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

1 2	> mu = 180932.92 and sigma = 79467.79 >

目标变量是接近正态分布的。线性模型在正态分布的数据上拟合效果更好，所以我们可以对数据做一些变换使其更接近正态分布。

目标变量的对数变换

#We use the numpy fuction log1p which  applies log(1+x) to all elements of the column
train["SalePrice"] = np.log1p(train["SalePrice"])

#Check the new distribution 
sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

1 2	> mu = 12.02 and sigma = 0.40 >

在经过对数变换后，房价数据接近于正态分布。

4.特征工程

首先将训练数据和测试数据合并到一个数据帧。

ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))

1 2	>all_data size is : (2917, 79) >

缺失数据

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)

f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

1 2	>Text(0.5,1,'Percent missing data by feature') >

数据关联

#Correlation map to see how features are correlated with SalePrice
corrmat = train.corr()
plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, vmax=0.9, square=True)

输入缺失值

根据不同特征的不同情况对缺失值进行填充。

对于离散值且数据说明里有 NA 选项，则将缺失值补为None。

PoolQC : 数据描述说 NA 表示 “没有泳池”。这是有意义的，因为有大量的缺失数据，这表明大多数的房子都没有游泳池。

1	all_data["PoolQC"] = all_data["PoolQC"].fillna("None")

MiscFeature : 数据描述说 NA 表示“没有杂项特征”。

1	all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")

Alley : 数据描述说 NA 表示“没有小巷联通”。

1	all_data["Alley"] = all_data["Alley"].fillna("None")

Fence : 数据描述说 NA 表示“没有棚栏”。

1	all_data["Fence"] = all_data["Fence"].fillna("None")

对于和其他特征有关联的值，可以先按其它特征分组，然后取中位数。

LotFrontage : 因为房子附近的街道到房子的距离和到该房子临近的房子的距离是类似的，因此去临近房子的中位数来填充该距离。

1
2
3

#Group by neighborhood and fill in missing value by the median LotFrontage of all the neighborhood
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

对于连续值，将缺失的值补为0。

BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath and BsmtHalfBath :缺失的值为0意味着没有基础设施。

1 2	for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'): all_data[col] = all_data[col].fillna(0)

对于离散值且在数据说明里没有 NA 选项，则将缺失值填充为出现最多的值。

MSZoning(大致区域分类)：‘RL’是最常见的值。所以我们使用‘RL’填充缺失值。

1	all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])

对于所有值几乎都一样，参考意义不大的特征，直接删除。

Utilities:该特征所有的值几乎都为‘AllPub’,除了一个‘NoSeWa’和2个NA。因为唯一的’NoSewa’出现在训练集，这个特征将不会对预测有帮助，因此可以安全地移除它。

1	all_data = all_data.drop(['Utilities'], axis=1)

在缺失值处理完后，可以简单地查看是否还有未处理地缺失值：

#Check remaining missing values if any 
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head()

5.模型

导入相关库

from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb

定义交叉策略

我们使用 Sklearn 中的 cross_cal_score 函数。然而这个函数没有扰乱属性，我们添加了一行代码，使得在交叉验证前对数据集进行扰乱。

#Validation function
n_folds = 5

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

基础模型

LASSO 回归：

这个模型会对异常值特别敏感。所以我们需要让它变得更加鲁棒。在 pipeline 中使用 sklearn 中的 Robustscaler() 方法。

1	lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))

Elastic Net 回归：

1	ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))

Kernel Ridge 回归：

1	KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)

Gradient Boosting 回归：

使用 huber loss 使其对异常值鲁棒

GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)

XGBoost:

model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)

LightGBM:

model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

基本模型分数

模型初始化好之后就开始测试一下交叉验证的损失分数吧。

1 2	score = rmsle_cv(lasso) print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

1 2	>Lasso score: 0.1115 (0.0074) >

1 2	score = rmsle_cv(ENet) print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

1 2	>ElasticNet score: 0.1116 (0.0074) >

1 2	score = rmsle_cv(KRR) print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

1 2	>Kernel Ridge score: 0.1153 (0.0075) >

1 2	score = rmsle_cv(GBoost) print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

1 2	>Gradient Boosting score: 0.1177 (0.0080) >

1 2	score = rmsle_cv(model_xgb) print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

1 2	>Xgboost score: 0.1161 (0.0079) >

1 2	score = rmsle_cv(model_lgb) print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))

1 2	>LGBM score: 0.1157 (0.0067) >

堆叠模型（stacking model)

最简单的堆叠方法：平均化基础模型

我们首先用最简单的方法实验，构建一个新类来拓展 scikit-learn 中 model类，作为我们自己的堆叠模型。

平均化的基础模型类

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)

平均化的基础模型分数

averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))

score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

1 2	> Averaged base models score: 0.1091 (0.0075) >

从结果看来，简单的堆叠模型确实提高了分数。这将激励我们探索更为高效的堆叠模型。

不那么简单的堆叠模型：增加一个元模型

在这个方法中，我们将会在简单堆叠模型的基础上增加一个元模型并使用基本模型的预测结果来训练元模型。

元模型的训练过程如下：

划分数据集为两个分离的部分（train 和 holdout）
使用第一部分训练几个基本的模型（train）
在这些训练过的模型上使用第二部分进行测试（holdout）
使用3）中的预测结果作为输入，正确地响应作为输出来训练元模型

前三步是可以迭代完成的。举例来说，我们有一个包含5个基本模型的堆叠模型，我们首先会把训练数据分为5份

。之后我们会做五次迭代。在每一次迭代中，我们在4份数据上训练基本模型然后在剩下的那份数据上预测。

可以确信的是，在五次迭代后，我们可以得到完整的5份由基本模型预测得到的训练数据，这在之后将会作为训练数据来训练我们的原模型。

在预测部分，我们会平均所有基础模型在测试数据上的预测结果，并将它们作为元特征，最终的预测结果将由元模型得出。

#####改进的堆叠模型类

class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
   
    # We again fit the data on clones of the original models
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred
                
        # Now train the cloned  meta-model using the out-of-fold predictions as new feature
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
   
    #Do the predictions of all base models on the test data and use the averaged predictions as 
    #meta-features for the final prediction which is done by the meta-model
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)

改进的堆叠模型的分数

为了使得两种方法有可比性（使用了同样数量的基础模型），我们使用了 Enet、KRR、Gboost作为基础模型，使用 lasso 作为元模型。

stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR),
                                                 meta_model = lasso)

score = rmsle_cv(stacked_averaged_models)
print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))

1 2	>Stacking Averaged models score: 0.1085 (0.0074) >

通过加入元模型确实得到了更好的结果。

融合模型

我们增加了 XGBoost、LightGBM 到之前定义的堆叠模型中。

我们首先定义一个评估函数 rmsle

1 2	def rmsle(y, y_pred): return np.sqrt(mean_squared_error(y, y_pred))

最终的训练和预测

StackRegressor:

stacked_averaged_models.fit(train.values, y_train)
stacked_train_pred = stacked_averaged_models.predict(train.values)
stacked_pred = np.expm1(stacked_averaged_models.predict(test.values))
print(rmsle(y_train, stacked_train_pred))

1 2	>0.0781571937916 >

XGBoost:

model_xgb.fit(train, y_train)
xgb_train_pred = model_xgb.predict(train)
xgb_pred = np.expm1(model_xgb.predict(test))
print(rmsle(y_train, xgb_train_pred))

1 2	>0.0785165142425 >

LightGBM:

model_lgb.fit(train, y_train)
lgb_train_pred = model_lgb.predict(train)
lgb_pred = np.expm1(model_lgb.predict(test.values))
print(rmsle(y_train, lgb_train_pred))

1 2	>0.0716757468834 >

'''RMSE on the entire Train data when averaging'''

print('RMSLE score on train data:')
print(rmsle(y_train,stacked_train_pred*0.70 +
               xgb_train_pred*0.15 + lgb_train_pred*0.15 ))

1
2
3

>RMSLE score on train data:
>0.0752190464543
>

融合预测：

1	ensemble = stacked_pred0.70 + xgb_pred0.15 + lgb_pred*0.15

提交：

sub = pd.DataFrame()
sub['Id'] = test_ID
sub['SalePrice'] = ensemble
sub.to_csv('submission.csv',index=False)

# Kaggle

Kaggle之房价预测

Kaggle入门赛之房价预测

1.比赛介绍

2.准备工作

3.数据处理

异常值

注意：

目标变量

目标变量的对数变换

4.特征工程

缺失数据

数据关联

输入缺失值

更多特征工程

将某些看起来是连续值的特征转换为离散值

标签编码一些带有排序信息的分类变量（例如，标签顺序按照出现次数大小排序）。

增加一个更为重要的特征

斜率特征

Box Cox变换高斜率特征

获取分类特征的假值

5.模型

定义交叉策略

基础模型

基本模型分数

堆叠模型（stacking model)

最简单的堆叠方法：平均化基础模型

平均化的基础模型类

平均化的基础模型分数

不那么简单的堆叠模型：增加一个元模型

改进的堆叠模型的分数

融合模型

最终的训练和预测

Comments

Links

Categories

Tag Cloud

Recent

Archives

Tags

Recent

Archives

Tags

Your browser is out-of-date!