建模过程中分类变量的处理(笔记一)

本文的内容来自参考书《Python机器学习基础教程》第四章数据表示与特征工程第一小节的内容

自己最浅显的理解:数学建模是基于数学表达式,数学表达式只认数字(连续变量),不认字符(分类变量);那么如何将我们收集到的数据中的字符转换成数字,科学家起了一个比较高端的名字叫做特征工程(feature engineering) 比如这一小节中使用到的示例数据:1994年美国成年人的收入,此数据集的任务是预测一名工人的收入是高于50,000美元还是低于50,000美元。数据集中的变量包括:

  • age
  • workclass
  • educatiuon
  • gender
  • hours-per-week
  • occupation
  • income

其中age(年龄)和hours-per-week(每周工作时长)便是连续特征;而workclass(工作类型)、education(教育程度)、gender(性别)和occupation(职业)都是分类变量。 那么如何处理这种情况,一种解决办法是使用one-hot编码(或者叫做N取一编码,也叫作虚拟变量dummy variable)。虚拟变量背后的思想就是将一个分类变量替换为一个或多个新特征,新特征取值为0,1,对于数学公式而言0,1两个值是有意义的。比如数据集

seq

gender

income

hours-per-week

1

Male

50,000

50

2

Female

60,000

40

经过转换就变成另外的格式

seq

Male

Female

income

hours-per-week

1

1

0

50,000

50

2

0

1

60,000

40

python中实现这种转换法的一种方式是使用pandas中的 get_dummies() 函数

接下来是重复书中的案例

第一步:下载数据集

使用搜索引擎搜索adult.data关键词,找到下载地址 http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data 可以选择将其复制到文本文件中,也可以选择使用python将其抓取下来,这应该是python爬虫一个非常简单的案例

  • python抓取代码
from urllib.request import urlopen  html = urlopen("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data")  adult_data = html.read()  adult_data = adult_data.decode('utf-8')  fw = open('adult.data',"w",encoding = "utf-8")  fw.write(adult_data)  fw.close()
参考文献
  • https://blog.csdn.net/xman4code/article/details/80989601
  • https://www.jianshu.com/p/cfbdacbeac6e
第二步:数据处理与建模
import pandas as pd  df = pd.read_csv('adult.data',header=None,index_col=False,   names = ['age','workclass','fnlwgt','education','education-num',   'marital-status','occupation','relationship','race','gender',  'capital-gain','capital-loss','hours-per-week','native-country','income'])  df.head()
输出结果
age         workclass  fnlwgt  education  education-num    0   39         State-gov   77516  Bachelors             13  1   50  Self-emp-not-inc   83311  Bachelors             13  2   38           Private  215646    HS-grad              9  3   53           Private  234721       11th              7  4   28           Private  338409  Bachelors             13           marital-status         occupation   relationship   race  gender    0       Never-married       Adm-clerical  Not-in-family  White    Male  1  Married-civ-spouse    Exec-managerial        Husband  White    Male  2            Divorced  Handlers-cleaners  Not-in-family  White    Male  3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male  4  Married-civ-spouse     Prof-specialty           Wife  Black  Female       capital-gain  capital-loss  hours-per-week native-country income  0          2174             0              40  United-States  <=50K  1             0             0              13  United-States  <=50K  2             0             0              40  United-States  <=50K  3             0             0              40  United-States  <=50K  4             0             0              40           Cuba  <=50K

选择特定的变量

df = df[['age','workclass','education','gender','hours-per-week','occupation','income']]

检查分类数据是否存在异常使用到value_counts()函数:显示唯一值及其出现次数

for char in list(df.columns):      if df[char].dtypes == "object":          print(df[char].value_counts())

输出结果

Private             22696  Self-emp-not-inc     2541  Local-gov            2093  ?                    1836  State-gov            1298  Self-emp-inc         1116  Federal-gov           960  Without-pay            14  Never-worked            7  Name: workclass, dtype: int64  HS-grad         10501  Some-college     7291  Bachelors        5355  Masters          1723  Assoc-voc        1382  11th             1175  Assoc-acdm       1067  10th              933  7th-8th           646  Prof-school       576  9th               514  12th              433  Doctorate         413  5th-6th           333  1st-4th           168  Preschool          51  Name: education, dtype: int64  Male      21790  Female    10771  Name: gender, dtype: int64  Prof-specialty       4140  Craft-repair         4099  Exec-managerial      4066  Adm-clerical         3770  Sales                3650  Other-service        3295  Machine-op-inspct    2002  ?                    1843  Transport-moving     1597  Handlers-cleaners    1370  Farming-fishing       994  Tech-support          928  Protective-serv       649  Priv-house-serv       149  Armed-Forces            9  Name: occupation, dtype: int64  <=50K    24720  >50K      7841  Name: income, dtype: int64

可以从结果中看到workclass和occupation变量中包括 “ ?”,接下来删除包含问号的行

df = df[df['occupation'] != "?"]  df = df[df['workclass'] != "?"]
参考文献
  • https://www.cnblogs.com/cocowool/p/8421997.html

使用get_dummies()函数对分类变量进行转换

df_dummies = pd.get_dummies(df)  print("Features after get_dummies: n", list(df_dummies.columns))

输出结果

Features after get_dummies:   ['age', 'hours-per-week', 'workclass_Federal-gov', 'workclass_Local-gov', 'workclass_Private', 'workclass_Self-emp-inc', 'workclass_Self-emp-not-inc', 'workclass_State-gov', 'workclass_Without-pay', 'education_10th', 'education_11th', 'education_12th', 'education_1st-4th', 'education_5th-6th', 'education_7th-8th', 'education_9th', 'education_Assoc-acdm', 'education_Assoc-voc', 'education_Bachelors', 'education_Doctorate', 'education_HS-grad', 'education_Masters', 'education_Preschool', 'education_Prof-school', 'education_Some-college', 'gender_Female', 'gender_Male', 'occupation_Adm-clerical', 'occupation_Armed-Forces', 'occupation_Craft-repair', 'occupation_Exec-managerial', 'occupation_Farming-fishing', 'occupation_Handlers-cleaners', 'occupation_Machine-op-inspct', 'occupation_Other-service', 'occupation_Priv-house-serv', 'occupation_Prof-specialty', 'occupation_Protective-serv', 'occupation_Sales', 'occupation_Tech-support', 'occupation_Transport-moving', 'income_<=50K', 'income_>50K']
接下来训练逻辑斯蒂回归分类模型
features = df_dummies.ix[:,'age':'occupation_Transport-moving'] # 这个语句不太明白  X = features.values  Y = df_dummies['income_>50K'].values  print("X.shape: {} Y.shape:{}".format(X.shape,Y.shape))  #输出  X.shape: (30718, 41) Y.shape:(30718,)

使用 ix()时遇到

C:UsersmingyAppDataLocalContinuumanaconda3libsite-packagesipykernel_launcher.py:1: DeprecationWarning:  .ix is deprecated. Please use  .loc for label based indexing or  .iloc for positional indexing    See the documentation here:  http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated    """Entry point for launching an IPython kernel.

训练模型

from sklearn.linear_model import LogisticRegression  from sklearn.model_selection import train_test_split  X_train,X_test,y_train,y_test = train_test_split(X,Y,random_state=0)  logreg = LogisticRegression()  logreg.fit(X_train,y_train)  #输出结果  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,            intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,            penalty='l2', random_state=None, solver='liblinear', tol=0.0001,            verbose=0, warm_start=False)  print("Test score:{:.2f}".format(logreg.score(X_test,y_test)))  #输出结果  Test score:0.81