scala-sparkML学习笔记:struct type tinyint size int indices array

  • 2019 年 11 月 3 日
  • 筆記

完整题目:scala-sparkML学习笔记:struct type tinyint size int indices array int values array double type

错误类型:

CSV data source does not support struct<type:tinyint,size:int,indices:array<int>,values:array<double>> data type.

predictPredict.select("user_id", "probability", "label").coalesce(1)            .write.format("com.databricks.spark.csv").mode("overwrite")            .option("header", "true").option("delimiter","t").option("nullValue", Const.NULL)            .save(fileName.predictResultFile + day) 

predictPredict选择probability列保存会出现'`probability`' is of struct<type:tinyint,size:int,indices:array<int>,values:array<double>> type 这个错误, 因为是DenseVector不可以直接报保存到csv文件, 可以有下面两种解决方法: (主要思想是选择DenseVector中预测为1的那一列,类型为double)

        /*          import org.apache.spark.sql.SparkSession          val spark = SparkSession.builder().config("spark.debug.maxToStringFields", 500).enableHiveSupport.appName("QDSpark Pipeline").getOrCreate()          import spark.implicits._            val probabilityDataFrame = predictPredict.select("user_id", "probability", "label").rdd.map( row => (row.getInt(0), row.getAs[DenseVector](1)(1), row.getDouble(2)) ).toDF            probabilityDataFrame.select("_1", "_2", "_3").coalesce(1)            .write.format("com.databricks.spark.csv").mode("overwrite")            .option("header", "true").option("delimiter","t").option("nullValue", Const.NULL)            .save(fileName.predictResultFile + day)          */            val stages = new ArrayBuffer[StructField]()          stages += StructField("user_id", IntegerType, true)          stages += StructField("probability", DoubleType, true)          stages += StructField("label", DoubleType, true)          val schema = new StructType( stages.toArray  )          val probabilityNewRDD = predictPredict.select("user_id", "probability", "label").rdd.map( row => Row(row.getInt(0), row.getAs[DenseVector](1)(1), row.getDouble(2)) )          val probabilityDataFrame = SparkConfTrait.spark.createDataFrame(probabilityNewRDD, schema)            probabilityDataFrame.select("user_id", "probability", "label").coalesce(1)            .write.format("com.databricks.spark.csv").mode("overwrite")            .option("header", "true").option("delimiter","t").option("nullValue", Const.NULL)            .save(fileName.predictResultFile + day)