從零寫一個編譯器(九):語義分析之構造抽象語法樹(AST)

  • 2019 年 10 月 3 日
  • 筆記

項目的完整代碼在 C2j-Compiler

前言

在上一篇完成了符號表的構建,下一步就是輸出抽象語法樹(Abstract Syntax Tree,AST)

抽象語法樹(abstract syntax tree 或者縮寫為 AST),是源代碼的抽象語法結構的樹狀表現形式,這裡特指編程語言的源代碼。樹上的每個節點都表示源代碼中的一種結構。

AST對於編譯器是至關重要的,現在的編譯型語言一般通過AST來生成IR,解釋型語言也可以不用虛擬機而直接遍歷AST來解釋執行,之後要寫解釋器和編譯器都依賴這個AST

這一篇主要文件有:

  • AstBuilder.java
  • AstNode.java
  • AstNodeImpl.java
  • NodeKey.java
  • NodeFactory.java

主要數據結構

AST節點的表示

public interface AstNode {        AstNode addChild(AstNode node);        AstNode getParent();        ArrayList<AstNode> getChildren();        void setAttribute(NodeKey key, Object value);        Object getAttribute(NodeKey key);        boolean isChildrenReverse();        void reverseChildren();        AstNode copy();    }

這是對AstNode接口的實現,並且繼承HashMap,這裡的NodeKey是

TokenType, VALUE, SYMBOL, PRODUCTION, TEXT

對應的value,

  1. TokenType就是非終結符的類型
  2. Text用來存儲解析對象的文本信息
  3. Symbol對應的就是變量的符號對象
  4. Value是對應對象解析的值,比如int a = 1,那麼value的值就為1
public class AstNodeImpl extends HashMap<NodeKey, Object> implements AstNode {      private Token type;      private AstNodeImpl parent;      private ArrayList<AstNode> children;      String name;        private boolean isChildrenReverse = false;        public AstNodeImpl(Token type) {          this.type = type;          this.parent = null;          this.children = new ArrayList<>();          setAttribute(NodeKey.TokenType, type);      }        @Override      public AstNode addChild(AstNode node) {          if (node != null) {              children.add(node);              ((AstNodeImpl) node).parent = this;          }            return node;      }        @Override      public AstNode getParent() {          return parent;      }        @Override      public void reverseChildren() {          if (isChildrenReverse) {              return;          }            Collections.reverse(children);          isChildrenReverse = true;      }        @Override      public boolean isChildrenReverse() {          return isChildrenReverse;      }        @Override      public ArrayList<AstNode> getChildren() {          reverseChildren();            return children;      }        @Override      public void setAttribute(NodeKey key, Object value) {          if (key == NodeKey.TEXT) {              name = (String) value;          }          put(key, value);      }        @Override      public Object getAttribute(NodeKey key) {          return get(key);      }        @Override      public String toString() {          String info = "";          if (get(NodeKey.VALUE) != null) {              info += "Node Value is " + get(NodeKey.VALUE).toString();          }            if (get(NodeKey.TEXT) != null) {              info += "nNode Text is " + get(NodeKey.TEXT).toString();          }            if (get(NodeKey.SYMBOL) != null) {              info += "nNode Symbol is " + get(NodeKey.SYMBOL).toString();          }            return info + "n Node Type is " + type.toString();      }        @Override      public AstNode copy() {          AstNodeImpl copy = (AstNodeImpl) NodeFactory.createICodeNode(type);          Set<Entry<NodeKey, Object>> attributes = entrySet();          Iterator<Map.Entry<NodeKey, Object>> it = attributes.iterator();            while (it.hasNext()) {              Map.Entry<NodeKey, Object> attribute = it.next();              copy.put(attribute.getKey(), attribute.getValue());          }            return copy;      }  }

NodeFactory

NodeFactory就是簡單的返回一個節點的實現

public class NodeFactory {      public static AstNode createICodeNode(Token type) {          return new AstNodeImpl(type);      }  }

構造AST

AST的創建也是需要在語法分析過程中根據reduce操作進行操作的。也就是在takeActionForReduce方法中調用AstBuilder的buildSyntaxTree方法

在AstBuilder裏面還是需要兩個堆棧來輔助操作

private Stack<AstNode> nodeStack = new Stack<>();    private LRStateTableParser parser = null;  private TypeSystem typeSystem = null;  private Stack<Object> valueStack = null;  private String functionName;  private HashMap<String, AstNode> funcMap = new HashMap<>();    private static AstBuilder instance;

構造AST的主要邏輯在buildSyntaxTree方法里,需要注意的是有一些節點在解釋執行和代碼生成的時候是不一樣的,有時代碼生成需要的節點解釋執行的話並不需要

在這裡提一下UNARY這個非終結符,這個非終結符和NAME很像,但是它一般是代表進行運算和一些操作的時候,比如數組,++,–或者函數調用的時候

其實構建AST的過程和符號表的構建過程有點兒類似,都是根據reduce操作來創建信息和組合信息,符號表是組合修飾符說明符等,而AST則是組合節點間的關係變成一棵樹

我們只看幾個操作

  • Specifiers_DeclList_Semi_TO_Def

    這個節點需要注意的是,從堆棧的什麼地方拿到Symbol,這個需要從reduce次數和推導式中得出

* DEF -> SPECIFIERS  DECL_LIST SEMI  * DECL -> VAR_DECL  * VAR_DECL -> NEW_NAME  *             | VAR_DECL LP RP  *             | VAR_DECL LP VAR_LIST RP  *             | LP VAR_DECL RP  *             | START VAR_DECL 

從推導式可以看出,DEF節點的符號應該在valueStack.size() – 3,但是DECL和VAR_DECL沒有做reduce操作,所以符號應該在valueStack.size() – 2。這其實和前面的符號表構建算出之前符號的位置是一樣的。

  • TO_UNARY

這裡則是變量、數字或者字符串的節點,如果是個變量的號,這個節點就需要一個Symbol的value了

case SyntaxProductionInit.Number_TO_Unary:  case SyntaxProductionInit.Name_TO_Unary:  case SyntaxProductionInit.String_TO_Unary:      node = NodeFactory.createICodeNode(Token.UNARY);      if (production == SyntaxProductionInit.Name_TO_Unary) {          assignSymbolToNode(node, text);      }        node.setAttribute(NodeKey.TEXT, text);      break;

其餘的節點無非是把一些語句拆分它的邏輯然後組成節點,真正的求值部分像Name_TO_Unary比較少,更多是比如把一個if else塊分成if節點、判斷節點、else節點,之後再按照這棵樹進行解釋執行或者代碼生成

public AstNode buildSyntaxTree(int production, String text) {      AstNode node = null;      Symbol symbol = null;      AstNode child = null;        if (Start.STARTTYPE == Start.INTERPRETER) {          int p1 = SyntaxProductionInit.Specifiers_DeclList_Semi_TO_Def;          int p2 = SyntaxProductionInit.Def_To_DefList;          int p3 = SyntaxProductionInit.DefList_Def_TO_DefList;            boolean isReturn = production == p1 || production == p2 || production == p3;          if (isReturn) {              return null;          }      }        switch (production) {          case SyntaxProductionInit.Specifiers_DeclList_Semi_TO_Def:              node = NodeFactory.createICodeNode(Token.DEF);              symbol = (Symbol) valueStack.get(valueStack.size() - 2);              node.setAttribute(NodeKey.SYMBOL, symbol);              break;          case SyntaxProductionInit.Def_To_DefList:              node = NodeFactory.createICodeNode(Token.DEF_LIST);              node.addChild(nodeStack.pop());              break;          case SyntaxProductionInit.DefList_Def_TO_DefList:              node = NodeFactory.createICodeNode(Token.DEF_LIST);              node.addChild(nodeStack.pop());              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.Number_TO_Unary:          case SyntaxProductionInit.Name_TO_Unary:          case SyntaxProductionInit.String_TO_Unary:              node = NodeFactory.createICodeNode(Token.UNARY);              if (production == SyntaxProductionInit.Name_TO_Unary) {                  assignSymbolToNode(node, text);              }                node.setAttribute(NodeKey.TEXT, text);              break;            case SyntaxProductionInit.Unary_LP_RP_TO_Unary:              node = NodeFactory.createICodeNode(Token.UNARY);              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.Unary_LP_ARGS_RP_TO_Unary:              node = NodeFactory.createICodeNode(Token.UNARY);              node.addChild(nodeStack.pop());              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.Unary_Incop_TO_Unary:          case SyntaxProductionInit.Unary_DecOp_TO_Unary:          case SyntaxProductionInit.LP_Expr_RP_TO_Unary:          case SyntaxProductionInit.Start_Unary_TO_Unary:              node = NodeFactory.createICodeNode(Token.UNARY);              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.Unary_LB_Expr_RB_TO_Unary:              node = NodeFactory.createICodeNode(Token.UNARY);              node.addChild(nodeStack.pop());              node.addChild(nodeStack.pop());                break;            case SyntaxProductionInit.Uanry_TO_Binary:              node = NodeFactory.createICodeNode(Token.BINARY);              child = nodeStack.pop();              node.setAttribute(NodeKey.TEXT, child.getAttribute(NodeKey.TEXT));              node.addChild(child);              break;            case SyntaxProductionInit.Binary_TO_NoCommaExpr:          case SyntaxProductionInit.NoCommaExpr_Equal_NoCommaExpr_TO_NoCommaExpr:              node = NodeFactory.createICodeNode(Token.NO_COMMA_EXPR);              child = nodeStack.pop();              String t = (String) child.getAttribute(NodeKey.TEXT);              node.addChild(child);              if (production == SyntaxProductionInit.NoCommaExpr_Equal_NoCommaExpr_TO_NoCommaExpr) {                  child = nodeStack.pop();                  t = (String) child.getAttribute(NodeKey.TEXT);                  node.addChild(child);              }              break;            case SyntaxProductionInit.Binary_Plus_Binary_TO_Binary:          case SyntaxProductionInit.Binary_DivOp_Binary_TO_Binary:          case SyntaxProductionInit.Binary_Minus_Binary_TO_Binary:          case SyntaxProductionInit.Binary_Start_Binary_TO_Binary:              node = NodeFactory.createICodeNode(Token.BINARY);              node.addChild(nodeStack.pop());              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.Binary_RelOP_Binary_TO_Binray:              node = NodeFactory.createICodeNode(Token.BINARY);              node.addChild(nodeStack.pop());                AstNode operator = NodeFactory.createICodeNode(Token.RELOP);              operator.setAttribute(NodeKey.TEXT, parser.getRelOperatorText());              node.addChild(operator);                node.addChild(nodeStack.pop());                break;            case SyntaxProductionInit.NoCommaExpr_TO_Expr:              node = NodeFactory.createICodeNode(Token.EXPR);              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.Expr_Semi_TO_Statement:          case SyntaxProductionInit.CompountStmt_TO_Statement:              node = NodeFactory.createICodeNode(Token.STATEMENT);              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.LocalDefs_TO_Statement:              node = NodeFactory.createICodeNode(Token.STATEMENT);              if (Start.STARTTYPE == Start.CODEGEN) {                  node.addChild(nodeStack.pop());              }              break;            case SyntaxProductionInit.Statement_TO_StmtList:              node = NodeFactory.createICodeNode(Token.STMT_LIST);              if (nodeStack.size() > 0) {                  node.addChild(nodeStack.pop());              }              break;            case SyntaxProductionInit.FOR_OptExpr_Test_EndOptExpr_Statement_TO_Statement:              node = NodeFactory.createICodeNode(Token.STATEMENT);              node.addChild(nodeStack.pop());              node.addChild(nodeStack.pop());              node.addChild(nodeStack.pop());              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.StmtList_Statement_TO_StmtList:              node = NodeFactory.createICodeNode(Token.STMT_LIST);              node.addChild(nodeStack.pop());              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.Expr_TO_Test:              node = NodeFactory.createICodeNode(Token.TEST);              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.If_Test_Statement_TO_IFStatement:              node = NodeFactory.createICodeNode(Token.IF_STATEMENT);              node.addChild(nodeStack.pop());              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.IfElseStatemnt_Else_Statemenet_TO_IfElseStatement:              node = NodeFactory.createICodeNode(Token.IF_ELSE_STATEMENT);              node.addChild(nodeStack.pop());              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.While_LP_Test_Rp_TO_Statement:          case SyntaxProductionInit.Do_Statement_While_Test_To_Statement:              node = NodeFactory.createICodeNode(Token.STATEMENT);              node.addChild(nodeStack.pop());              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.Expr_Semi_TO_OptExpr:          case SyntaxProductionInit.Semi_TO_OptExpr:              node = NodeFactory.createICodeNode(Token.OPT_EXPR);              if (production == SyntaxProductionInit.Expr_Semi_TO_OptExpr) {                  node.addChild(nodeStack.pop());              }              break;            case SyntaxProductionInit.Expr_TO_EndOpt:              node = NodeFactory.createICodeNode(Token.END_OPT_EXPR);              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.LocalDefs_StmtList_TO_CompoundStmt:              node = NodeFactory.createICodeNode(Token.COMPOUND_STMT);              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.NewName_LP_RP_TO_FunctDecl:          case SyntaxProductionInit.NewName_LP_VarList_RP_TO_FunctDecl:              node = NodeFactory.createICodeNode(Token.FUNCT_DECL);              node.addChild(nodeStack.pop());              child = node.getChildren().get(0);              functionName = (String) child.getAttribute(NodeKey.TEXT);              symbol = assignSymbolToNode(node, functionName);                break;            case SyntaxProductionInit.NewName_TO_VarDecl:              nodeStack.pop();              break;            case SyntaxProductionInit.NAME_TO_NewName:              node = NodeFactory.createICodeNode(Token.NEW_NAME);              node.setAttribute(NodeKey.TEXT, text);              break;            case SyntaxProductionInit.OptSpecifiers_FunctDecl_CompoundStmt_TO_ExtDef:              node = NodeFactory.createICodeNode(Token.EXT_DEF);              node.addChild(nodeStack.pop());              node.addChild(nodeStack.pop());              funcMap.put(functionName, node);              break;            case SyntaxProductionInit.NoCommaExpr_TO_Args:              node = NodeFactory.createICodeNode(Token.ARGS);              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.NoCommaExpr_Comma_Args_TO_Args:              node = NodeFactory.createICodeNode(Token.ARGS);              node.addChild(nodeStack.pop());              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.Return_Semi_TO_Statement:              node = NodeFactory.createICodeNode(Token.STATEMENT);              break;          case SyntaxProductionInit.Return_Expr_Semi_TO_Statement:              node = NodeFactory.createICodeNode(Token.STATEMENT);              node.addChild(nodeStack.pop());              break;            case SyntaxProductionInit.Unary_StructOP_Name_TO_Unary:              node = NodeFactory.createICodeNode(Token.UNARY);              node.addChild(nodeStack.pop());              node.setAttribute(NodeKey.TEXT, text);              break;            default:              break;      }        if (node != null) {          node.setAttribute(NodeKey.PRODUCTION, production);          nodeStack.push(node);      }        return node;  }

小結

其實構造AST和創建符號表上非常相似,都是依據reduce操作的信息來完成。在AST的構建中的主要任務就是對源代碼語句里的邏輯進行分塊,比如對於一個ifelse語句:

上面的圖是我依據這個意思話的,和上面構造出來的AST不完全一致

另外我的github博客:https://dejavudwh.cn/