以KaggleDays数据集为例，编码方法介绍

编者按：华沙大学机器学习科学家wojciech rosinski介绍了类别编码的主要方法。
介绍
这是特征工程方法系列的第一篇。在机器学习的实践中，特征工程是最重要而定义最松散的方面之一。它可以被视为艺术，没有严格的规则，创造性是其关键。
特征工程是要为机器学习模型创建更好的信息表示。即便使用非线性算法，如果使用原始数据，我们也无法建模数据集的变量之间的所有交互（关系）。因此，我们需要手工探查、处理数据。
这就带来了一个问题——深度学习怎么讲？深度学习是为了最小化手工处理数据的需求，使模型能够自行学习恰当的数据表示。在图像、语音、文本之类没有给定其他“元数据”的数据上，深度学习会表现得更好。而在表格式数据上，没有什么可以战胜梯度提升树方法，例如xgboost或lightgbm。机器学习竞赛证明了这一点——几乎所有表格式数据的获胜方案中，基于决策树的模型是最佳的，而深度学习模型通常没法达到这么好的结果（但混合基于决策树的模型时效果非常好 ;-) ）
特征工程的偏差是领域知识。取决于需要解决的问题，每个数据集应该使用不同的特征工程方法，原因正在于此。不过，仍有一些广泛使用的方法，至少值得尝试下能否提升模型表现。hj vav veen的讲演中提到了大量的实用信息。下面的一些方法正是根据讲演的描述实现的。
本文以kaggledays数据集为例，编码方法介绍参考了上面的讲演。
数据集
数据来自reddit，包含问题和回答。目标是预测回答的赞数。之所以用这个数据集为例，是因为它包含文本和标准特征。
引入需要用到的库：
import gc
import numpy as np
import pandas as pd
加载数据：
x = pd.read_csv('../input/train.csv', sep=\t, index_col='id')
列：
['question_id',
'subreddit',
'question_utc',
'question_text',
'question_score',
'answer_utc',
'answer_text',
'answer_score']
每个question_id对应一个具体问题（见question_text）。每个question_id可能出现多次，因为每一行包含对这一问题的一个不同回答（见answer_text）。问题和回答的时间日期由_utc列提供。另外还包括问题发布的subreddit（版块）的信息。question_score是问题的赞数，而answer_score是回答的赞数。answer_score是目标变量。
数据需要根据question_id分为训练子集和验证子集，仿效kaggle分训练集和测试集的做法。
question_ids = x.question_id.unique()
question_ids_train = set(pd.series(question_ids).sample(frac=0.8))
question_ids_valid = set(question_ids).difference(question_ids_train)
x_train = x[x.question_id.isin(question_ids_train)]
x_valid = x[x.question_id.isin(question_ids_valid)]
类别特征和数值特征
机器学习模型只能处理数字。数值（连续、定量）变量是可以在有限或无限区间内取任何值的变量，它们可以很自然地用数字表示，所以可以在模型中直接使用。原始类别变量通常以字符串的形式存在，在传入模型之前需要变换。
subreddit是类别变量的一个好例子，其中包含41个不同的类别，例如：
['askreddit', 'jokes', 'politics', 'explainlikeimfive', 'gaming']
让我们看下最流行的类别（x.subreddit.value_counts()[:5]）：
askreddit 275667
politics 123003
news 42271
worldnews 40016
gaming 32117
name: subreddit, dtype: int64
数值变量的一个例子是question_score，可以通过x.question_score.describe()浏览信息：
mean 770.891169
std 3094.752794
min 1.000000
25% 2.000000
50% 11.000000
75% 112.000000
max 48834.000000
name: question_score, dtype: float64
类别特征编码
类别编码的两个基本方法是独热编码（onehot encoding）和标签编码（label encoding）。独热编码可以通过pandas.get_dummies完成。具备k个类别的变量的编码结果是一个k列的二值矩阵，其中第i列的值为1意味着这项观测属于第i类。
标签编码直接将类别转换为数字。pandas.factorize提供了这一功能，或者，pandas中category类型的列提供了cat.codes。使用标签编码能够保持原本的维度。
还有一些不那么标准的编码方法也值得一试，它们可能可以提升模型的表现。这里将介绍三种方法：
频数编码（count encoding）
labelcount编码
目标编码（target encoding）
频数编码
频数编码使用频次替换类别，频次根据训练集计算。这个方法对离群值很敏感，所以结果可以归一化或者转换一下（例如使用对数变换）。未知类别可以替换为1。
尽管可能性不是非常大，有些变量的频次可能是一样的，这将导致碰撞——两个类别编码为相同的值。没法说这是否会导致模型退化或者改善，不过原则上我们不希望出现这种情况。
def count_encode(x, categorical_features, normalize=false):
print('count encoding: {}'.format(categorical_features))
x_ = pd.dataframe()
for cat_feature in categorical_features:
x_[cat_feature] = x[cat_feature].astype(
'object').map(x[cat_feature].value_counts())
if normalize:
x_[cat_feature] = x_[cat_feature] / np.max(x_[cat_feature])
x_ = x_.add_suffix('_count_encoded')
if normalize:
x_ = x_.astype(np.float32)
x_ = x_.add_suffix('_normalized')
else:
x_ = x_.astype(np.uint32)
return x_
让我们编码下subreddit列：
train_count_subreddit = count_encode(x_train, ['subreddit'])
并查看结果。最流行的5个subreddit：
askreddit 221941
politics 98233
news 33559
worldnews 32010
gaming 25567
name: subreddit, dtype: int64
编码为：
221941 221941
98233 98233
33559 33559
32010 32010
25567 25567
name: subreddit_count_encoded, dtype: int64
基本上，这用频次替换了subreddit类别。我们也可以除以最频繁出现的类别的频次，以得到归一化的值：
1.000000 221941
0.442609 98233
0.151207 33559
0.144228 32010
0.115197 25567
name: subreddit_count_encoded_normalized, dtype: int64
labelcount编码
我们下面将描述的方法称为labelcount编码，它根据类别在训练集中的频次排序类别（升序或降序）。相比标准的频次编码，labelcount具有特定的优势——对离群值不敏感，也不会对不同的值给出同样的编码。
def labelcount_encode(x, categorical_features, ascending=false):
print('labelcount encoding: {}'.format(categorical_features))
x_ = pd.dataframe()
for cat_feature in categorical_features:
cat_feature_value_counts = x[cat_feature].value_counts()
value_counts_list = cat_feature_value_counts.index.tolist()
if ascending:
# 升序
value_counts_range = list(
reversed(range(len(cat_feature_value_counts))))
else:
# 降序
value_counts_range = list(range(len(cat_feature_value_counts)))
labelcount_dict = dict(zip(value_counts_list, value_counts_range))
x_[cat_feature] = x[cat_feature].map(
labelcount_dict)
x_ = x_.add_suffix('_labelcount_encoded')
if ascending:
x_ = x_.add_suffix('_ascending')
else:
x_ = x_.add_suffix('_descending')
x_ = x_.astype(np.uint32)
return x_
编码：
train_lc_subreddit = labelcount_encode(x_train, ['subreddit'])
这里默认使用降序，subreddit列最流行的5个类别是：
0 221941
1 98233
2 33559
3 32010
4 25567
name: subreddit_labelcount_encoded_descending, dtype: int64
askreddit是最频繁的类别，因此被转换为0，也就是第一位。
使用升序的话，同样这5个类别编码如下：
40 221941
39 98233
38 33559
37 32010
36 25567
name: subreddit_labelcount_encoded_ascending, dtype: int64
目标编码
最后是最有技巧性的方法——目标编码。它使用目标变量的均值编码类别变量。我们为训练集中的每个分组计算目标变量的统计量（这里是均值），之后会合并验证集、测试集以捕捉分组和目标之间的关系。
举一个更明确的例子，我们可以在每个subreddit上计算answer_score的均值，这样，在特定subreddit发帖可以期望得到多少赞，我们可以有个大概的估计。
使用目标变量时，非常重要的一点是不要泄露任何验证集的信息。所有基于目标编码的特征都应该在训练集上计算，接着仅仅合并或连接验证集和测试集。即使验证集中有目标变量，它不能用于任何编码计算，否则会给出过于乐观的验证误差估计。
如果使用k折交叉验证，基于目标的特征应该在折内计算。如果仅仅进行单次分割，那么目标编码应该在分开训练集和验证集之后进行。
此外，我们可以通过平滑避免将特定类别编码为0. 另一种方法是通过增加随机噪声避免可能的过拟合。
处置妥当的情况下，无论是线性模型，还是非线性模型，目标编码都是最佳的编码方式。
def target_encode(x, x_valid, categorical_features, x_test=none,
target_feature='target'):
print('target encoding: {}'.format(categorical_features))
x_ = pd.dataframe()
x_valid_ = pd.dataframe()
if x_test isnotnone:
x_test_ = pd.dataframe()
for cat_feature in categorical_features:
group_target_mean = x.groupby([cat_feature])[target_feature].mean()
x_[cat_feature] = x[cat_feature].map(group_target_mean)
x_valid_[cat_feature] = x_valid[cat_feature].map(group_target_mean)
x_ = x_.astype(np.float32)
x_ = x_.add_suffix('_target_encoded')
x_valid_ = x_valid_.astype(np.float32)
x_valid_ = x_valid_.add_suffix('_target_encoded')
if x_test isnotnone:
x_test_[cat_feature] = x_test[cat_feature].map(group_target_mean)
x_test_ = x_test_.astype(np.float32)
x_test_ = x_test_.add_suffix('_target_encoded')
return x_, x_valid_, x_test_
return x_, x_valid_
编码：
train_tm_subreddit, valid_tm_subreddit = target_encode(
x_train, x_valid, categorical_features=['subreddit'],
target_feature='answer_score')
如果我们查看下编码后的值，就会发现不同reddit的平均赞数有明显的差别：
23.406061 220014
13.082699 98176
19.020845 33916
17.521887 31869
18.235424 25520
21.535477 24692
18.640282 20416
23.688890 20009
3.159401 18695
name: subreddit_target_encoded, dtype: int64
askreddit 220014
politics 98176
news 33916
worldnews 31869
gaming 25520
todayilearned 24692
funny 20416
videos 20009
teenagers 18695
name: subreddit, dtype: int64
askreddit中的回答平均能有23.4个赞，而politics和teenagers中的回答分别只有13.1个赞。这样的特征可能非常强大，因为它让我们可以在特征集中明确编码一些目标信息。
获取类别的编码值
无需修改编码函数，我们可以通过如下方式在验证集或测试集上合并取得的值：
encoded = train_lc_subreddit.subreddit_labelcount_encoded_descending.value_counts().index.values
raw = x_train.subreddit.value_counts().index.values
encoding_dict = dict(zip(raw, encoded))
x_valid['subreddit_labelcount_encoded_descending'] = x_valid.loc[:,
'subreddit'].map(
encoding_dict)

MAX15090/MAX15090B热插拔方案在阻性负载低压应用中的应用
6A RGB LED Driver Reference De
布局电力行业，巡检机器人的优势何在？
基于新一代SoC SH7264的图形仪表板开发平台
海上组合惯导的介绍（关于惯导在海上的应用）
以KaggleDays数据集为例，编码方法介绍
德国机器视觉市场销售额连续8年创下新高
魅蓝5s与魅蓝note5测评：熟悉的外观最有诚意的升级
智能音箱+网络机顶盒，张嘴即来的AI交互新场景到来
TI CC2541蓝牙低功耗解决方案的主要特性与优势
“光伏＋”的应用场景同样蕴含巨大的市场
3GPP R16标准冻结和5G商用的开启将会催熟更多的新兴业务
密钥管理系统概述_密钥管理系统架构图
物联网传感器与农业融合，汇总十项科研新进展
气体检测仪有哪些使用误区，有什么注意事项
重点讲解Send与Sync相关的并发知识
未来半导体业不确定性引发的思考
多样化电气设备检测需求日益复杂
机器人企业从0到1的发展需要哪些资源？
为何5G需要用到网络切片技术