【深入淺出 Yarn 架構與實現】2-1 Yarn 基礎庫概述

了解 Yarn 基礎庫是後面閱讀 Yarn 源碼的基礎，本節對 Yarn 基礎庫做總體的介紹。
並對其中使用的第三方庫 Protocol Buffers 和 Avro 是什麼、怎麼用做簡要的介紹。

一、主要使用的庫

Protocol Buffers：是 Google 開源的序列化庫，具有平台無關、高性能、兼容性好等優點。YARN 將其用到了 RPC 通訊中，默認情況下，YARN RPC 中所有參數採用 Protocol Buffers 進行序列化 / 反序列化。
Apache Avro：是 Hadoop 生態系統中的 RPC 框架，具有平台無關、支援動態模式(無需編譯)等優點，Avro 的最初設計動機是解決 YARN RPC 兼容性和擴展性差等問題。
RPC 庫：YARN 仍採用了 MRv1 中的 RPC 庫，但其中採用的默認序列化方法被替換成了 Protocol Buffers。
服務庫和事件庫 :YARN 將所有的對象服務化，以便統一管理(比創建、銷毀等)，而服務之間則採用事件機制進行通訊，不再使用類似 MRv1 中基於函數調用的方式。
狀態機庫：YARN 採用有限狀態機描述一些對象的狀態以及狀態之間的轉移。引入狀態機模型後，相比 MRv1， YARN 的程式碼結構更加清晰易懂。

二、第三方開源庫介紹

一）Protocol Buffers

1、簡要介紹

Protocol Buffers 是 Google 開源的一個語言無關、平台無關的通訊協議，其小巧、高效和友好的兼容性設計，使其被廣泛使用。
【可以類比 java 自帶的 Serializable 庫，功能上是一樣的。】

Protocol buffers are Google』s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.

核心特點：

語言、平台無關
簡潔
高性能
兼容性好

2、安裝環境

以 mac 為例（其他平台方式請自查）

# 1) brew安裝
brew install protobuf 

# 查看安裝目錄
$ which protoc 
/opt/homebrew/bin/protoc 


# 2) 配置環境變數
vim ~/.zshrc

# protoc (for hadoop)
export PROTOC="/opt/homebrew/bin/protoc"

source ~/.zshrc


# 3) 查看protobuf版本
$ protoc --version
libprotoc 3.19.1

3、寫個 demo

1）創建個 maven 工程，添加依賴

<dependencies>
  <dependency>
    <groupId>com.google.protobuf</groupId>
    <artifactId>protobuf-java</artifactId>
    <version>3.19.1</version>  <!--版本號務必和安裝的protoc版本一致-->
  </dependency>
</dependencies>

2）根目錄新建 protobuf 的消息定義文件 student.proto

proto 數據類型語法定義可以參考：ProtoBuf 入門教程

syntax = "proto3"; // 聲明為protobuf 3定義文件
package tutorial;

option java_package = "com.shuofxz.learning.student";	// 生成文件的包名
option java_outer_classname = "StudentProtos";				// 類名

message Student {								// 待描述的結構化數據
    string name = 1;
    int32 id = 2;
    optional string email = 3;	//optional 表示該欄位可以為空

    message PhoneNumber {				// 嵌套結構
        string number = 1;
        optional int32 type = 2;
    }

    repeated PhoneNumber phone = 4;	// 重複欄位
}

3）使用 protoc 工具生成消息對應的Java類（在 proto 文件目錄執行）

protoc -I=. --java_out=src/main/java student.proto

可以在對應的文件夾下找到 StudentProtos.java 類，裡面寫了序列化、反序列化等方法。

public class StudentExample {
    static public void main(String[] argv) {
        StudentProtos.Student Student1 = StudentProtos.Student.newBuilder()
                .setName("San Zhang")
                .setEmail("[email protected]")
                .setId(11111)
                .addPhone(StudentProtos.Student.PhoneNumber.newBuilder()
                        .setNumber("13911231231")
                        .setType(0))
                .addPhone(StudentProtos.Student.PhoneNumber.newBuilder()
                        .setNumber("01082345678")
                        .setType(1)).build();

        // 寫出到文件
        try {
            FileOutputStream output = new FileOutputStream("example.txt");
            Student1.writeTo(output);
            output.close();
        } catch(Exception e) {
            System.out.println("Write Error ! ");
        }

        // 從文件讀取
        try {
            FileInputStream input = new FileInputStream("example.txt");
            StudentProtos.Student Student2 = StudentProtos.Student.parseFrom(input);
            System.out.println("Student2:" + Student2);
        } catch(Exception e) {
            System.out.println("Read Error!");
        }
    }
}

以上就是一個 protocol buffers 使用的完整流程了。沒什麼難的，就是調用了一個第三方的序列化庫，將對象序列化到文件，再反序列化讀出來。
只不過需要先在 proto 文件中定義好數據結構，並生成對應的工具類。

4、在 Yarn 中應用

在 YARN 中，所有 RPC 函數的參數均採用 Protocol Buffers 定義的。RPC 仍使用 MRv1 中的 RPC。

二）Apache Avro

1、簡要介紹

Apache Avro 是 Hadoop 下的一個子項目。它本身既是一個序列化框架，同時也實現了 RPC 的功能。
但由於 Yarn 項目初期，Avro 還不成熟，Avro 則作為日誌序列化庫使用，所有事件的序列化均採用 Avro 完成。
特點：

豐富的數據結構類型;
快速可壓縮的二進位數據形式;
存儲持久數據的文件容器;
提供遠程過程調用 RPC;
簡單的動態語言結合功能。

相比於 Apache Thrift 和 Google 的 Protocol Buffers，Apache Avro 具有以下特點:

支援動態模式。Avro 不需要生成程式碼，這有利於搭建通用的數據處理系統，同時避免了程式碼入侵。
數據無須加標籤。讀取數據前，Avro 能夠獲取模式定義，這使得 Avro 在數據編碼時只需要保留更少的類型資訊，有利於減少序列化後的數據大小。
無須手工分配的域標識。Thrift 和 Protocol Buffers 使用一個用戶添加的整型域唯一性定義一個欄位，而 Avro 則直接使用域名，該方法更加直觀、更加易擴展。

2、安裝環境 & demo

參考：Avro學習入門

3、在 Yarn 中應用

Apache Avro 最初是為 Hadoop 量身打造的 RPC 框架，考慮到穩定性，YARN 暫時採用 Protocol Buffers 作為序列化庫，RPC 仍使用 MRv1 中的 RPC，而 Avro 則作為日誌序列化庫使用。在 YARN MapReduce 中，所有事件的序列化 / 反序列化均採用 Avro 完成，相關定義在 Events.avpr 文件中。

三、總結

本節簡要介紹了 Yarn 中五個重要的基礎庫，了解這些庫會幫助了解 Yarn 程式碼邏輯和數據傳遞方式。
對其中兩個第三方開源庫進行了介紹。Protocol Buffers 用作 RPC 函數參數的序列化和反序列化；Avro 在日誌和事件部分的序列化庫使用。

Tags: hadoop yarn 深入淺出 Yarn 架構與實現