你遇到了吗？Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException)

2019 年 11 月 7 日
筆記

我在使用 Structured Streaming 的 ForeachWriter，写 HDFS 文件时，出现了这个异常

这个异常出现的原因是HDFS作为一个分布式文件系统，支持多线程读，但是不支持多线程写入。所以HDFS引入了一个时间类型的锁机制，也就是HDFS的租约机制（** lease holder**）。
这个知识点来源于这篇文章 http://blog.csdn.net/weixin_44252761/article/details/89517393

大数据计算时，多线程与分布式的并行可以很好的加速数据的处理速度。可在大数据存储时，分布式的文件存储系统对并发的写请求支持存在天然的缺陷。这是一对天然的矛盾，暂时无法解决，只能缓和。

怎么缓和呢？不得不崇拜Spark开发者的智商，非常的简单和实用。不能同时写一个文件，但是可以同时写多个文件啊，只要我（spark或者程序）认为这多个文件是一个文件，那写一个和多个就没有区别了。

按照这个想法，修改我的代码，真正代码篇幅太长，主要就是一个地方：
将val hdfsWritePath = new Path(path) 改为 val hdfsWritePath = new Path(path + "/" + partitionId) 即可。

有兴趣的朋友可以看看更全面的代码，原来的源代码如下：

       inputStream match {              case Some(is) =>                  is.writeStream                          .foreach(new ForeachWriter[Row]() {                              var successBufferedWriter: Option[BufferedWriter] = None                                def openHdfs(path: String, partitionId: Long, version: Long): Option[BufferedWriter] = {                                  val configuration: Configuration = new Configuration()                                  configuration.set("fs.defaultFS", hdfsAddr)                                    val fileSystem: FileSystem = FileSystem.get(configuration)                                  val hdfsWritePath = new Path(path)                                    val fsDataOutputStream: FSDataOutputStream =                                      if (fileSystem.exists(hdfsWritePath))                                          fileSystem.append(hdfsWritePath)                                      else                                          fileSystem.create(hdfsWritePath)                                    Some(new BufferedWriter(new OutputStreamWriter(fsDataOutputStream, StandardCharsets.UTF_8)))                              }                                override def open(partitionId: Long, version: Long): Boolean = {                                  successBufferedWriter =                                          if (successBufferedWriter.isEmpty) openHdfs(successPath, partitionId, version)                                          else successBufferedWriter                                  true                              }                                override def process(value: Row): Unit = {                                  successBufferedWriter.get.write(value.mkString(","))                                  successBufferedWriter.get.newLine()                              }                                override def close(errorOrNull: Throwable): Unit = {                                  successBufferedWriter.get.flush()                                  successBufferedWriter.get.close()                              }                          })                          .start()                          .awaitTermination()

上述代码初看没问题，却会导致标题错误，修改如下：

       inputStream match {              case Some(is) =>                  is.writeStream                          .foreach(new ForeachWriter[Row]() {                              var successBufferedWriter: Option[BufferedWriter] = None                                def openHdfs(path: String, partitionId: Long, version: Long): Option[BufferedWriter] = {                                  val configuration: Configuration = new Configuration()                                  configuration.set("fs.defaultFS", hdfsAddr)                                    val fileSystem: FileSystem = FileSystem.get(configuration)                                  val hdfsWritePath = new Path(path + "/" + partitionId)                                    val fsDataOutputStream: FSDataOutputStream =                                      if (fileSystem.exists(hdfsWritePath))                                          fileSystem.append(hdfsWritePath)                                      else                                          fileSystem.create(hdfsWritePath)                                    Some(new BufferedWriter(new OutputStreamWriter(fsDataOutputStream, StandardCharsets.UTF_8)))                              }                                override def open(partitionId: Long, version: Long): Boolean = {                                  successBufferedWriter =                                          if (successBufferedWriter.isEmpty) openHdfs(successPath, partitionId, version)                                          else successBufferedWriter                                  true                              }                                override def process(value: Row): Unit = {                                  successBufferedWriter.get.write(value.mkString(","))                                  successBufferedWriter.get.newLine()                              }                                override def close(errorOrNull: Throwable): Unit = {                                  successBufferedWriter.get.flush()                                  successBufferedWriter.get.close()                              }                          })                          .start()                          .awaitTermination()

如此轻松（其实困扰了我一天）就解决了这个可能大家都会遇到的问题，读取时路径到 successPath 即可，分享出来。

如果有什么问题或不足，希望大家可以与我联系，共同进步。

完~~~~

你遇到了吗？Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException)

VirMach 便宜 VPS

QNews

你遇到了吗？Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException)

分享此文：

Related Posts

Codeforces1131G Most Dangerous Shark

java中 字符串的构造方法和直接创建

屌丝的程序员的梦想 (九) 回老家创业失败 顺便结了个婚

Linux下搭建.NetCore3.0环境及创建Asp.NetCore3.0 Web 项目

VirMach 便宜 VPS

QNews

熱門搜尋

java中字符串的构造方法和直接创建

屌丝的程序员的梦想 (九) 回老家创业失败顺便结了个婚