建模过程中分类变量的处理(笔记一)
- 2020 年 3 月 3 日
- 筆記
本文的内容来自参考书《Python机器学习基础教程》第四章数据表示与特征工程第一小节的内容
自己最浅显的理解:数学建模是基于数学表达式,数学表达式只认数字(连续变量),不认字符(分类变量);那么如何将我们收集到的数据中的字符转换成数字,科学家起了一个比较高端的名字叫做特征工程(feature engineering) 比如这一小节中使用到的示例数据:1994年美国成年人的收入,此数据集的任务是预测一名工人的收入是高于50,000美元还是低于50,000美元。数据集中的变量包括:
- age
- workclass
- educatiuon
- gender
- hours-per-week
- occupation
- income
其中age(年龄)和hours-per-week(每周工作时长)便是连续特征;而workclass(工作类型)、education(教育程度)、gender(性别)和occupation(职业)都是分类变量。 那么如何处理这种情况,一种解决办法是使用one-hot编码(或者叫做N取一编码,也叫作虚拟变量dummy variable)。虚拟变量背后的思想就是将一个分类变量替换为一个或多个新特征,新特征取值为0,1,对于数学公式而言0,1两个值是有意义的。比如数据集
seq |
gender |
income |
hours-per-week |
---|---|---|---|
1 |
Male |
50,000 |
50 |
2 |
Female |
60,000 |
40 |
经过转换就变成另外的格式
seq |
Male |
Female |
income |
hours-per-week |
---|---|---|---|---|
1 |
1 |
0 |
50,000 |
50 |
2 |
0 |
1 |
60,000 |
40 |
python中实现这种转换法的一种方式是使用pandas中的 get_dummies() 函数
接下来是重复书中的案例
第一步:下载数据集
使用搜索引擎搜索adult.data关键词,找到下载地址 http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data 可以选择将其复制到文本文件中,也可以选择使用python将其抓取下来,这应该是python爬虫一个非常简单的案例
- python抓取代码
from urllib.request import urlopen html = urlopen("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data") adult_data = html.read() adult_data = adult_data.decode('utf-8') fw = open('adult.data',"w",encoding = "utf-8") fw.write(adult_data) fw.close()
参考文献
- https://blog.csdn.net/xman4code/article/details/80989601
- https://www.jianshu.com/p/cfbdacbeac6e
第二步:数据处理与建模
import pandas as pd df = pd.read_csv('adult.data',header=None,index_col=False, names = ['age','workclass','fnlwgt','education','education-num', 'marital-status','occupation','relationship','race','gender', 'capital-gain','capital-loss','hours-per-week','native-country','income']) df.head()
输出结果
age workclass fnlwgt education education-num 0 39 State-gov 77516 Bachelors 13 1 50 Self-emp-not-inc 83311 Bachelors 13 2 38 Private 215646 HS-grad 9 3 53 Private 234721 11th 7 4 28 Private 338409 Bachelors 13 marital-status occupation relationship race gender 0 Never-married Adm-clerical Not-in-family White Male 1 Married-civ-spouse Exec-managerial Husband White Male 2 Divorced Handlers-cleaners Not-in-family White Male 3 Married-civ-spouse Handlers-cleaners Husband Black Male 4 Married-civ-spouse Prof-specialty Wife Black Female capital-gain capital-loss hours-per-week native-country income 0 2174 0 40 United-States <=50K 1 0 0 13 United-States <=50K 2 0 0 40 United-States <=50K 3 0 0 40 United-States <=50K 4 0 0 40 Cuba <=50K
选择特定的变量
df = df[['age','workclass','education','gender','hours-per-week','occupation','income']]
检查分类数据是否存在异常使用到value_counts()函数:显示唯一值及其出现次数
for char in list(df.columns): if df[char].dtypes == "object": print(df[char].value_counts())
输出结果
Private 22696 Self-emp-not-inc 2541 Local-gov 2093 ? 1836 State-gov 1298 Self-emp-inc 1116 Federal-gov 960 Without-pay 14 Never-worked 7 Name: workclass, dtype: int64 HS-grad 10501 Some-college 7291 Bachelors 5355 Masters 1723 Assoc-voc 1382 11th 1175 Assoc-acdm 1067 10th 933 7th-8th 646 Prof-school 576 9th 514 12th 433 Doctorate 413 5th-6th 333 1st-4th 168 Preschool 51 Name: education, dtype: int64 Male 21790 Female 10771 Name: gender, dtype: int64 Prof-specialty 4140 Craft-repair 4099 Exec-managerial 4066 Adm-clerical 3770 Sales 3650 Other-service 3295 Machine-op-inspct 2002 ? 1843 Transport-moving 1597 Handlers-cleaners 1370 Farming-fishing 994 Tech-support 928 Protective-serv 649 Priv-house-serv 149 Armed-Forces 9 Name: occupation, dtype: int64 <=50K 24720 >50K 7841 Name: income, dtype: int64
可以从结果中看到workclass和occupation变量中包括 “ ?”,接下来删除包含问号的行
df = df[df['occupation'] != "?"] df = df[df['workclass'] != "?"]
参考文献
- https://www.cnblogs.com/cocowool/p/8421997.html
使用get_dummies()函数对分类变量进行转换
df_dummies = pd.get_dummies(df) print("Features after get_dummies: n", list(df_dummies.columns))
输出结果
Features after get_dummies: ['age', 'hours-per-week', 'workclass_Federal-gov', 'workclass_Local-gov', 'workclass_Private', 'workclass_Self-emp-inc', 'workclass_Self-emp-not-inc', 'workclass_State-gov', 'workclass_Without-pay', 'education_10th', 'education_11th', 'education_12th', 'education_1st-4th', 'education_5th-6th', 'education_7th-8th', 'education_9th', 'education_Assoc-acdm', 'education_Assoc-voc', 'education_Bachelors', 'education_Doctorate', 'education_HS-grad', 'education_Masters', 'education_Preschool', 'education_Prof-school', 'education_Some-college', 'gender_Female', 'gender_Male', 'occupation_Adm-clerical', 'occupation_Armed-Forces', 'occupation_Craft-repair', 'occupation_Exec-managerial', 'occupation_Farming-fishing', 'occupation_Handlers-cleaners', 'occupation_Machine-op-inspct', 'occupation_Other-service', 'occupation_Priv-house-serv', 'occupation_Prof-specialty', 'occupation_Protective-serv', 'occupation_Sales', 'occupation_Tech-support', 'occupation_Transport-moving', 'income_<=50K', 'income_>50K']
接下来训练逻辑斯蒂回归分类模型
features = df_dummies.ix[:,'age':'occupation_Transport-moving'] # 这个语句不太明白 X = features.values Y = df_dummies['income_>50K'].values print("X.shape: {} Y.shape:{}".format(X.shape,Y.shape)) #输出 X.shape: (30718, 41) Y.shape:(30718,)
使用 ix()时遇到
C:UsersmingyAppDataLocalContinuumanaconda3libsite-packagesipykernel_launcher.py:1: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated """Entry point for launching an IPython kernel.
训练模型
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test = train_test_split(X,Y,random_state=0) logreg = LogisticRegression() logreg.fit(X_train,y_train) #输出结果 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) print("Test score:{:.2f}".format(logreg.score(X_test,y_test))) #输出结果 Test score:0.81