Generating a High-Speed Parser, Part 2: Lemon

This article explains how to generate a high-performance text parser using Lemon.

The preceding article explained how re2c can be used to split text into a series of tokens. This article explains how a Lemon-generated parser can use those tokens to analyze text structure. To demonstrate how Lemon is used, this article’s example involves generating a parser for an abridged version of the C programming language.

A parser generated by Lemon performs LALR(1) parsing, where LALR stands for Look Ahead Left-to-Right. This means it matches rules to token sequences by looking ahead at most one token. This parser executes quickly, and the SQLite project relies on a Lemon-generated parser to analyze SQL commands.

1. Preliminary Concerns

This article adopts a hands-on approach in explaining how to generate parsing code with Lemon. To understand this article, you don’t need to be familiar with parsing theory. But you should have a firm grasp of C programming.

The example code presented in this article has been compiled with gcc. However, it should be usable with any C/C++ build system.

One more thing. Many computer scientists take the topic of parsing deathly seriously. I’m not one of those people, so don’t be upset or surprised if the content of this article lacks the academic rigor you’d expect in a professional publication.

2. Overview of Lemon Usage

Like re2c, Lemon is freely available. Its main site provides source code (lemon.c and lempar.c) and a link to the tool’s documentation. To use Lemon, lemon.c needs to be compiled into an executable with a command such as the following:

gcc -o lemon lemon.c

The lemon executable accepts the name of a grammar file, whose suffix is usually *.y or *.yy. As the executable runs, it examines the content of the grammar file and generates parser code. For example, the following command accepts a grammar file called simple_c_grammar.y:

lemon simple_c_grammar.y

This executable will only work properly if lempar.c is in the working directory. If it completes successfully, Lemon will produce three output files:

  • simple_c_grammar.c – The parser source code
  • simple_c_grammar.h – A header file that defines token values
  • simple_c_grammar.out – Text file that describes the parser’s operation

The *.c file contains functions that make it possible to execute the parser. Before I discuss those functions, I want to explain how to write a grammar file.

3. Writing a Grammar File

A Lemon grammar file consists of two types of statements:

  1. Rules – construct grammar patterns (symbols) from other symbols and optionally execute C code
  2. Directives – configure Lemon’s operation, each directive starts with %

These statements can appear in any order. This section discusses both types of statements, but before I get into the details, I want to discuss the example programming language.

3.1  The SimpleC Programming Language

To demonstrate how Lemon is used, this article explains how to generate a parser for an abridged version of C called SimpleC. This has the same syntax as the C language we know and love, but excludes many features including custom types, ifthenelse statements, and preprocessor directives.

The following statements give a basic idea of how SimpleC programs are structured:

  1. A program consists of declarations/definitions of functions and/or variables.
  2. A variable declaration consists of a data type, one or more identifiers separated by commas, optional initializations, and a semi-colon.
  3. A function definition consists of an optional type, an identifier, parentheses surrounding an optional parameter list, and a block of code within curly braces.

These rules don’t cover all of SimpleC, but the structure of every program can be boiled down to similar rules. As a result, the structure of every SimpleC program can be expressed using a tree structure called an abstract syntax tree (AST).

For example, consider the following SimpleC program:

int x = 0;

main() {
  while(x < 5) {
    x = x+1;
  }
}

Figure 1 gives an idea what an AST for this program might look like. As shown, ASTs don’t include punctuation such as parentheses, braces, and semicolons.

Figure 1: An Example Abstract Syntax Tree

Image 1

In this tree, the nodes without children contain characters (tokens) formed directly from the program’s text. These dark-gray nodes are called terminal symbols because they appear at the end of each branch.

The light-gray nodes are nonterminal symbols, and they’re formed by combining their children together. The root node of the tree represents the entire program. In this simple example, the program consists of one variable declaration and one function definition, so the topmost node has two children.

3.2  Rules

Earlier, I presented statements that describe the high level structure of a SimpleC program. The purpose of a grammar file is to express those statements in a form that Lemon can understand. These statements are called rules, and each rule forms a nonterminal symbol from one or more symbols.

The general form of a rule in a Lemon grammar file is given as follows:

nonterminal_symbol ::=

expression. {


code


}Here, expression is a combination of zero or more symbols. { code } is an optional block of code to execute when the parser creates the nonterminal symbol.

By default, Lemon expects that the first rule’s nonterminal symbol corresponds to the root of the syntax tree. For example, the following rule prints text to the console when the program symbol is created from the declarations symbol:

program ::= declarations. { printf("Constructed the root symbol!\n"); }

Lemon supports recursion, so a nonterminal symbol can be included on both sides of the rule. Lemon’s documentation recommends left-recursion, and the following rules state that a declarations symbol may contain one or more declaration symbols:

declarations ::= declarations declaration.
declarations ::= declaration.

A grammar file may contain multiple rules for same nonterminal symbol. When this happens, Lemon checks all of the rules and forms the symbol if any rule applies. Put another way, Lemon doesn’t support ORing expressions with ‘|’, so it’s necessary to write a new rule for each alternative.

So far, every nonterminal symbol has been written in lowercase. This isn’t just a convention. Lemon distinguishes nonterminal symbols from terminal symbols by checking the case of the first letter: lowercase implies a nonterminal symbol and uppercase implies a terminal symbol. In general, terminal symbols are written in all caps.

For example, the following rule states that a nonterminal symbol representing an enum statement can be formed from an ENUM terminal symbol, a NAME terminal symbol, an LBRACE terminal symbol, an enumerator_list nonterminal symbol, and an RBRACE terminal symbol.

enum_specifier ::= ENUM NAME LBRACE enumerator_list RBRACE.

The ENUMNAMELBRACE, and RBRACE terminal symbols correspond to the ENUMNAMELBRACE, and RBRACE tokens provided by the lexer. Later in the article, I’ll explain how to connect the output of an re2c-generated lexer to the input of a Lemon-generated parser.

3.3  Directives

As mentioned earlier, Lemon assumes that the nonterminal symbol created by the first rule in the grammar file is the topmost node in the syntax tree. But if the program symbol should be the topmost node, the following statement should be added:

%start_symbol program

%start_symbol is a directive that configures how the Lemon-generated parser operates. Table 1 lists this and other directives:

Table 1: Special Directives for Lemon Grammar Files
Directive Description
%start_symbol The symbol to serve as the top of the syntax tree
%token_prefix Adds a prefix to the names in the header file
%name Change the names of generated functions
%token_type Data type of each token
%token_destructor Destructor code for terminal symbols
%type Data type of nonterminal symbols
%destructor Destructor code for nonterminal symbols
%include Code to be inserted at the top of the parser
%left Assign left-associative precedence values
%right Assign right-associative precedence values
%nonassoc Assigns non-associative precedence
%parse_accept Code to execute after successful parse
%parse_failure Code to execute after unsuccessful parse
%syntax_error Code to execute after syntax error
%stack_size Sets the size of the parser’s stack
%stack_overflow Code to execute if stack exceeds size
%extra_argument Allows Parse to be called with a fourth parameter

When Lemon generates a parser, it also generates a header file that associates each terminal symbol with a number (similar to tokens.h in the re2c example). By default, Lemon will use the names given in the rules (LPAREN will be LPAREN in the header file, INT will be INT, and so on). But if %token_prefix is associated with text, that text will be prepended to each macro in the header file.

For example, the following directive tells Lemon to prepend TK_ to each token name in the generated header file.

%token_prefix TK_

The %name directive is similar, but instead of changing token names, it changes the names of the parser’s interface functions. As I’ll explain shortly, the functions that access the parser all start with Parse by default. But if %name is associated with a string, the functions will start with that string instead.

The %token_type directive is quite different than %token_name. As will be explained, when a parser receives a token from the lexer, it can also receive a structure containing data. For example, if the parser needs to read identifier names using one-byte values, the token’s value can be set to a char* with the following directive:

%token_type { char* }

If special code needs to be executed when the tokens’ data structures are destroyed, the code block can be associated with the %token_destructor directive. The %type and %destructor directives are similar to %token_type and %token_destructor, but they relate to nonterminal symbols instead of terminal symbols.

If a code block is associated with %include, Lemon will insert the code at the top of the generated parser code. For this reason, this code block usually contains #include statements and declarations of custom data structures. This is shown by the following directive:

%include {
#include <assert.h>
}

4. Interfacing the Parser with re2c

If simple_c_grammar.y is a suitable grammar file, the following command generates parser code in a file called simple_c_grammar.c:

lemon simple_c_grammar.y

The code in simple_c_grammar.c contains four functions that enable applications to access the parser. Table 2 lists the functions and provides a description of each:

Table 2: Interface Functions of a lemon Parser
Function Description
ParseAlloc(malloc_func) Allocates the parser structure using the routine
and returns a pointer to the parser
Parse(void* parser, int token_id,
CustomType tokenData)
Sends the token to the parser
ParseFree(void* parser, free_func) Deallocates the parser using the given routine
ParseTrace(FILE* fp, char* trace) Enables parser tracing by providing a stream

These functions are straightforward to use and understand. ParseAlloc accepts a function that should be used to allocate memory (usually malloc) and returns a pointer to the parser. The parser’s data type isn’t important, so applications usually treat the returned pointer as a void pointer, as shown in the following code:

void* parser = ParseAlloc(malloc);

In the Parse function, the first argument is the parser and second argument identifies the token identifier (LPARENWHILEFOR, and so on), which should be an integer. The third argument is a data structure corresponding to the token. Its data type is set with the %token_type directive presented earlier.

The re2c example code calls the scan function in a loop. Each iteration of the loop corresponds to a token from the input text. The following code calls Parse inside the loop to send each token to the parser as it’s received. No data is sent with the token, so the third argument is set to NULL.

while(token = scan(&scanner, buff_end)) {
  Parse(parser, token, NULL);
}

After the lexer sends the last token to the parser, it needs to call Parse one more time. In this case, the token ID should be set to zero and the token value should be set to NULL. I’m not sure why this is necessary, but the parser won’t complete successfully without it. The following code shows how this works:

Parse(parser, 0, NULL);

When the parser is no longer needed, it can be destroyed with the ParseFree method. This accepts the void pointer returned by ParseAlloc and the name of the deallocation routine to use for deallocation (usually free).

4.1  Sending/Receiving Token Data

It’s common for the lexer to send data along with a Token ID. For example, the lexer may pass strings corresponding to the identifier tokens (NAME). This means setting %token_type to the data type for characters (such as char*) and setting the third parameter of the Parser function to the string. The following code shows one way this can be accomplished:

int token, name_length;
char* name_str;

if(token == NAME) {

  // Get the length of the identifier
  name_length = scanner.cur - scanner.top;

  // Allocate memory and copy the identifier from the scanner
  name_str = (char*)malloc(name_length * sizeof(char));
  strncpy(name_str, scanner.top, name_length);
  name_str[name_length + 1] = '\0';

  // Send the token and the string to the parser
  Parse(parser, token, name_str);

  // Free the allocated memory
  free(name_str);
}
else {
  Parse(parser, token, "");
}

The parser can access this value by following the corresponding terminal symbol with a variable name inside parentheses. In the following rule, the code prints the value of the str variable associated with the NAME symbol:

direct_declarator ::= NAME(str). { printf("The identifier is %s\n", str); }

Nonterminal symbols can also have associated values. The following code associates the direct_declarator symbol with the value from the NAME token:

direct_declarator(strB) ::= NAME(strA). { strB = strA; }

Similar code can be used to pass data between rules. Another method is to call the Parse function with an extra argument. This is the topic of the following discussion.

4.2  Extra Argument

Lemon doesn’t allow global variables to be set in a grammar file. This helps to ensure thread safety, but it makes it hard for different rules to access the same data. To enable global data access, Lemon allows the lexer to add an optional fourth argument to the Parse function.

This extra argument can have any type, and it’s enabled by setting %extra_argument to a block containing the argument’s type and a variable name. In the SimpleC example grammar, the fourth argument of Parse allows the application to count the number of declarations and function definitions. This argument is a pointer to a ParserCount structure, which is defined in the following way:

typedef struct ParserCount {
  int numFunctionDefinitions;
  int numVariableDeclarations;
} ParserCount;

The lexer creates a ParserCount, initializes its fields, and sends it to the parser with the following code:

Parse(parser, token, "", &pCount);

In the grammar file, the ParserCount structure is defined in the %include directive so that it will be inserted at the top of the generated code. The following directive enables the pCount variable to be accessed throughout the parser:

%extra_argument { ParserCount *pCount }

The following rules access pCount to increment the number of function definitions and variable declarations:

external_declaration ::= function_definition. { pCount->numFunctionDefinitions++; }
external_declaration ::= declaration. { pCount->numVariableDeclarations++; }

The extra argument can be set equal to the value of the top-level nonterminal symbol. This value is a pointer to the syntax tree corresponding to the input text.

5. Error Handling

If the parser completes its analysis without error, Lemon will execute the code block associated with the  %parse_accept directive. If an error occurs, Lemon executes the code associated with %syntax_error. Afterward, it backtracks until it reaches a state where it can shift a nonterminal symbol called error. Then it attempts to continue parsing normally.

If the parser backtracks all the way to the beginning and can’t shift the error symbol, it executes the code associated with the %parse_failed directive. Then it resets itself to its start state.

In my experience, the most useful tool for detecting errors is the ParseTrace function from Table 2. This prints messages to a stream that describe the parser’s current state. For example, the following code tells Lemon to print state messages to a file pointer called traceFile and precede each message with “parser >>“.

ParseTrace(traceFile, "parser >>");

As the parser runs, it prints messages such as the following:

parser >> Input 'LPAREN'
parser >> Reduce [direct_declarator ::= NAME], go to state 92.
parser >> Shift 'direct_declarator', go to state 135
parser >> Shift 'LPAREN', go to state 58
parser >> Return. Stack=[external_declaration_list declaration_specifiers direct_declarator LPAREN]

If the parser reaches a syntax error, it provides alert messages.

parser >> Shift 'logical_or_expression', go to state 141
parser >> Reduce [conditional_expression ::= logical_or_expression], go to state 41.
parser >> Shift 'conditional_expression'
parser >> Reduce [constant_expression ::= conditional_expression], go to state 41.
parser >> Shift 'constant_expression', go to state 179
parser >> Syntax Error!

By examining these messages, the application can get an idea of where the error occurred. It’s also a good idea to use %syntax_error with the extra argument of Parse to notify the lexer in the event of a syntax error.

6. Using the code

The code for this article generates a parsing application for SimpleC. It’s provided in an archive called lemon_demo.zip, which contains four files:

  • simple_c.re – lexer definition file, re2c can use this to generate a lexer
  • simple_c_grammar.y – SimpleC grammar file, Lemon can use this to generate parsing code
  • test.dat – Contains C code that can be successfully parsed
  • Makefile – Contains instructions for generating the lexer and parser and building the application

If you have GNU’s make and gcc utilities, you can execute the make command. This reads the Makefile instructions and automatically builds the application.

If you want to build the application manually, you need to perform three steps:

  1. Use Lemon to generate parser code from the grammar (lemon simple_c_grammar.y)
  2. Use re2c to generate the lexer code (re2c -o simple_c.c simple_c.re)
  3. Compile the lexer and parser code (gcc -o simple_c simple_c.c simple_c_grammar.c)

The application parses the code in test.dat. In addition to printing messages to standard output, it uses ParseTrace to print state messages to trace.out.

LEMON源码分析——项目FOLLOW集求法

 

都知道FOLLOW集是针对非终结符的,而LEMON中的项目FOLLOW集是针对项目的,指的是项目产生式左部非终结符FOLLOW集。

 

知道了项目FOLLOW集,它的求法也自然清楚了。这里把项目分成两类,它们的FOLLOW集来源是不一样的:

A.    号点不在最左的项目,FOLLOW集来源于生产式相同,点号在它前面的项目。

B.     点号在最左的项目,FOLLOW集就是生产式左部非终结符FOLLOW集。

 

非终结符FOLLOW集的计算方法:

1.        对于方法的开始符S,将用于表示输入结束的’$’符放入FOLLOW(S)中;

2.    若A->αBβ是一个产生式,则把FIRST(β)/{ ε}加至FOLLOW(B)中;

3.    若A->αB是一个产生式,或A->αBβ是一个产生式,且β->ε,则把FOLLOW(A)加至FOLLOW(B)中。

 

以状态项目的生成过程介绍项目FOLLOW集的计算。

开始状态中的所有项目都是B类型项目,于是计算开始符FOLLOW集,也就是方法1,对就代码为(FindStates):

 

for(rp=sp->rule; rp; rp=rp->nextlhs)

{

struct config *newcfp;

rp->lhsStart = 1;

newcfp = Configlist_addbasis(rp,0);

SetAdd(newcfp->fws,0);//第一个符号为$

}

 

对当前状态中的项目进行closure运算,也就是要产生B类的项目,此时运用方法2与方法3计算,对应代码(Configlist_closure函数中):

 

for(newrp=sp->rule; newrp; newrp=newrp->nextlhs)

{

newcfp = Configlist_add(newrp,0);//生成B型项目

for(i=dot+1; i<rp->nrhs; i++)

{

xsp = rp->rhs[i];

if( xsp->type==TERMINAL )

{

SetAdd(newcfp->fws,xsp->index);//运用方法2

break;

}

else if( xsp->type==MULTITERMINAL )

{

}

else

{

SetUnion(newcfp->fws,xsp->firstset);//运用方法2

if( xsp->lambda==LEMON_FALSE ) break;

}

}

if( i==rp->nrhs ) Plink_add(&cfp->fplp,newcfp);//运用方法3

}

 

运用方法3时,并没有进行真正的FOLLOW拷贝,而只是做了一个记录,在后面一起 “结账”。这里也可以看出,config会将自己的FOLLOW集传给自己的fplp域。

求完闭包后,就开始传播了,传播产生A型项目。些时的FOLLOW集,直接由上级往下传就行了。代码位于函数buildshifts中:

 

for(bcfp=cfp; bcfp; bcfp=bcfp->next)

{

if( bcfp->status==COMPLETE ) continue;    /* Already used */

if( bcfp->dot>=bcfp->rp->nrhs ) continue; /* Can’t shift this one */

bsp = bcfp->rp->rhs[bcfp->dot];           /* Get symbol after dot */

if( !same_symbol(bsp,sp) ) continue;      /* Must be same as for “cfp”*/

bcfp->status = COMPLETE;                  /* Mark this config as used */

new = Configlist_addbasis(bcfp->rp,bcfp->dot+1);

Plink_add(&new->bplp,bcfp); //A型项目FOLLOW集计算方法

}

 

这里也只是做了个标记,并没有真正传播FOLLOW集。从Plink_add(&new->bplp,bcfp);可以看出bplp暂存着可以将FOLLOW集传播给本项目的项目。

最后将标记转换成实际上的传播。

首先,把bplp转换为fplp(FindLinks):

 

for(i=0; i<lemp->nstate; i++)

{

stp = lemp->sorted[i];

for(cfp=stp->cfp; cfp; cfp=cfp->next)//遍历每个状态的所有项目

{

for(plp=cfp->bplp; plp; plp=plp->next)

{

other = plp->cfp;

Plink_add(&other->fplp,cfp);

}

}

}

 

再将项目的FOLLOW集,传给它的fplp(FindFollowSets):

 

void FindFollowSets(struct lemon *lemp)

{

do

{

progress = 0;

for(i=0; i<lemp->nstate; i++)//遍历所有状态

{

for(cfp=lemp->sorted[i]->cfp; cfp; cfp=cfp->next)//遍历状态内所有项目

{

if( cfp->status==COMPLETE ) continue;

for(plp=cfp->fplp; plp; plp=plp->next)

{

change = SetUnion(plp->cfp->fws,cfp->fws);

if( change )

{

plp->cfp->status = INCOMPLETE;

progress = 1;

}

}

cfp->status = COMPLETE;

}

}

}

while( progress );

}

本文出自:https://blog.csdn.net/tms_li/article/details/6305498

SQLite之SQL解析-词法分析-6

SQLite词法分析过程

词法分析入口

 如前面分析,SQLite是前端+虚拟机+后端形式的架构,拿到SQL语句后,需要先在前端编译,得到虚拟机代码指令。编译过程开始的首要任务即是词法分析。SQLite的词法分析比较简单,这块代码是作者手工编写的,而不是使用像Lex的词法分析生成器工具自动生成的。要阅读其源码,不得不提到几个重要函数:

int sqlite3_prepare(xxxx);
int sqlite3_prepare_v2(xxxx);
int sqlite3_prepare_v3(xxxx);

这几个接口均接受输入一组SQL语句,并将编译后的结果记录到sqlite3_stmt结构中。
经过几个核心函数调用后,调用到编译的核心函数sqlite3RunParser:

 

SQL词法分析-prepare

词法分析的层次

 sqlite3RunParser函数是个非常重要的函数,其实现了对SQL语句的编译,即词法、语法分析的全过程。根据前面文章介绍,词法分析器从输入的字符串自左向右逐个将输入的字串流分割成一个个单词符号,并输入给语法分析器进行语法分析。显然,sqlite3RunParser的执行过程也应该是这样的。

 

Parse过程

sqlite3RunParser函数实现:

 

while( 1 ){
      ...
      //词法分析得到token
      n = sqlite3GetToken((u8*)zSql, &tokenType);
      ...
      //将一个个token输入到语法分析器,驱动语法分析器进行语法分析
      sqlite3Parser(pEngine, tokenType, pParse->sLastToken, pParse);
      ...
    }
  }

SQLite词法分析源码

词法分析函数的输入输出

 从源码注释上看,词法分析器函数sqlite3GetToken接受输入一个SQL语句,不断的调用该函数,逐步的返回token距SQL语句起始点的长度,并且每次调用将得到的token类型保存到第二个参数tokenType中。

 

** Return the length (in bytes) of the token that begins at z[0]. 
** Store the token type in *tokenType before returning.
*/
int sqlite3GetToken(const unsigned char *z, int *tokenType)
{
  int i, c;
  switch( aiClass[*z] ){  /* Switch on the character-class of the first byte
                          ** of the token. See the comment on the CC_ defines
                          ** above. */
      //空格
    case CC_SPACE: {
     ...
      for(i=1; sqlite3Isspace(z[i]); i++){}
      *tokenType = TK_SPACE;
      return i;
    }
  //左括号
    case CC_LP: {
      *tokenType = TK_LP;
      return 1;
    }
  //右括号
    case CC_RP: {
      *tokenType = TK_RP;
      return 1;
    }
    ...
  //字母或下划线
    case CC_KYWD: {
      for(i=1; aiClass[z[i]]<=CC_KYWD; i++){}
      if( IdChar(z[i]) ){
        /* This token started out using characters that can appear in keywords,
        ** but z[i] is a character not allowed within keywords, so this must
        ** be an identifier instead */
        i++;
        break;
      }
      *tokenType = TK_ID;
      return keywordCode((char*)z, i, tokenType);
    }
...
  }
  while( IdChar(z[i]) ){ i++; }
  *tokenType = TK_ID;
  return i;
}

 上面简化了的词法分析函数,代码比较简单,对于输入的字符串流逐个字符进行分析,字符被分为若干大类,通过查aiClass表,非常快速高效的得到当前字符类型。代码逻辑很清晰,比如: case CC_SPACE: 检测到空格输入,把相邻的空格全部过滤掉后,返回一个空格的类型TK_SPACE(TK_SPACE是怎么得来的,后面介绍,go ahead);case CC_KYWD: 检测到字母或下划线,则继续读入,直到读到下一个特征字符后,将识别到的一个词组经keywordCode函数处理后返回。

Token识别知识铺垫

 如上述对词法分析函数的解析,它从字符串流不断的提取出关键字、特征字、运算符等,转换成了一个叫做tokenType的数后返回,这个数字背后的逻辑基础是什么?单词符号和数字之间的对应关系是怎样的呢?
 首先,可以观察到,sqlite3GetToken函数返回的tokenType取值是一组TK_开头的宏,这组宏定义在parse.h头文件中,意识到这点特征是非常重要的一步。而且TK_后接的字符串又正好是SQL语句的关键字,是否有想到些什么?

 

parse.h中定义的tokenType值
#define TK_SEMI                             1
#define TK_EXPLAIN                          2
#define TK_QUERY                            3
#define TK_PLAN                             4
#define TK_BEGIN                            5
#define TK_TRANSACTION                      6
#define TK_DEFERRED                         7
...

  要详细解释这些现象背后的缘由,请关注后面语法分析章节。这里大致说明一下,SQLite使用的语法分析器生成器(lemon)会根据语法规则文件给所有的终结符、非终结符分配唯一的数字特征ID,在语法分析的过程中使用这些数字表达一些规则。parse.h即为其产物,它记录了SQL语句中的终结符的数字ID,而它的另一核心产物则是语法分析器parse.c,也就是说,语法分析器期望从词法分析器得到一组数组特征ID作为输入而非单词。因此词法分析器的核心任务除了分词外,还需要将分好的词转成语法分析器能认识的ID。

Token的Type转换方法

  对于特征符号的 type转换直接代码写死,非常容易。对于词组的tokenType怎么获取,需要依赖核心函数keywordCode。下面分析该函数的源码实现:

 

/* Check to see if z[0..n-1] is a keyword. If it is, write the
** parser symbol code for that keyword into *pType.  Always
** return the integer n (the length of the token). */
static int keywordCode(const char *z, int n, int *pType){
  int i, j;
  const char *zKW;
  if( n>=2 ){
    i = ((charMap(z[0])*4) ^ (charMap(z[n-1])*3) ^ n) % 127;
    for(i=((int)aKWHash[i])-1; i>=0; i=((int)aKWNext[i])-1){
      if( aKWLen[i]!=n ) continue;
      j = 0;
      zKW = &zKWText[aKWOffset[i]];
#ifdef SQLITE_ASCII
      while( j<n && (z[j]&~0x20)==zKW[j] ){ j++; }
#endif
#ifdef SQLITE_EBCDIC
      while( j<n && toupper(z[j])==zKW[j] ){ j++; }
#endif
      if( j<n ) continue;
      ...
      *pType = aKWCode[i];
      break;
    }
  }
  return n;
}

什么鬼?(charMap(z[0])4) ^ (charMap(z[n-1])3) ^ n) % 127;x&^K%&$%*@#….?完全无法理解!!!这是什么原理?
冷静一下,看下源文件注释:

/***** This file contains automatically generated code ******
** 这个文件中包含了自动生成的代码
** The code in this file has been automatically generated by
** 这段代码是通过mkkeywordhash.c自动生成的
** sqlite/tool/mkkeywordhash.c

mkkeywordhash.c黑幕

编译调试

 这个文件直接编译可以得到完整的可执行程序,使用命令:

gcc mkkeywordhash.c -o mkkeywordhash

从Main函数开始阅读,下面分析其实现。

1. 初步压缩,删除不支持的关键字

 

/* Remove entries from the list of keywords that have mask==0 */
  for(i=j=0; i<nKeyword; i++){
    if( aKeywordTable[i].mask==0 ) continue;
    if( j<i ){
      aKeywordTable[j] = aKeywordTable[i];
    }
    j++;
  }
  nKeyword = j;

该段函数实现对关键字表的初步压缩,对于mask==0的数组字段进行删除,之所以有这样的操作,是因为SQLite支持编译时裁剪优化,当有些SQL功能不需要支持时,没必要保留在关键字查找表中,影响表的空间大小从而影响到内存和执行效率。该部分代码执行完毕后,将得到一个长度为nKeyword 的数组aKeywordTable,这部分包含的key即当前编译的SQL支持的命令范围。

2. 对明确支持的关键字初始化len、hash、id

 

 /* Fill in the lengths of strings and hashes for all entries. */
  for(i=0; i<nKeyword; i++){
    Keyword *p = &aKeywordTable[i];
    p->len = (int)strlen(p->zName);
    assert( p->len<sizeof(p->zOrigName) );
    memcpy(p->zOrigName, p->zName, p->len+1);
    totalLen += p->len;
    p->hash = (charMap(p->zName[0])*4) ^
              (charMap(p->zName[p->len-1])*3) ^ (p->len*1);
    p->id = i+1;
  }

aKeywordTable数组中的每一项代表SQL的一个关键字,关键字的名字记录在zName字段中,作者使用该哈希函数为关键字进行哈希(算法:关键字的第一个字符*4)^(关键字的最后一个字符*3)^(关键字名字的长度)):

(charMap(p->zName[0])*4) ^(charMap(p->zName[p->len-1])*3) ^ (p->len*1)
从aKeywordTable中的关键字看到,首尾字符已经可以获得很少的哈希冲突了

同时,在代码里有一句断言需要提一下: assert( p->len<sizeof(p->zOrigName) ); zOrigName和zName之间是有什么关联?这里暂时看不出来,其实zName存着的是关键字字串压缩后的结果,而zOrigName存着的是原始的key串,下面会看到压缩的过程。

3. 找出某些关键字是另外关键字的一部分的情况,为再次压缩准备

 

/* Sort the table from shortest to longest keyword */
  qsort(aKeywordTable, nKeyword, sizeof(aKeywordTable[0]), keywordCompare1);

  /* Look for short keywords embedded in longer keywords */
  for(i=nKeyword-2; i>=0; i--){
    Keyword *p = &aKeywordTable[i];
    //从最长的key串开始(数组下标nKeyword-1),依次检查该子串是否完全包含 
    // Keyword  p,显然pOther 必须要比p长,所以j>i。
    for(j=nKeyword-1; j>i && p->substrId==0; j--){
      Keyword *pOther = &aKeywordTable[j];
      if( pOther->substrId ) continue;
      if( pOther->len<=p->len ) continue;
      for(k=0; k<=pOther->len-p->len; k++){
        //pOther的name长度大于p,要检查p是不是pOther完全子串,只需要从
        //pOther起始点到pOther->len-p->len(==k)比较是否相等即可
        if( memcmp(p->zName, &pOther->zName[k], p->len)==0 ){
          p->substrId = pOther->id;
          p->substrOffset = k;
          break;
        }
      }
    }
  }

首先,将key数组排序,排序算法非常简单,就是按长度排序,长度相同的按字符串比较大小排序。keywordCompare1函数不再赘述。
然后,从字符串最长的倒叙查找比其短的的字符串,是否是它的子串。比如SET是OFFSET子串,则将SET的substrId 设置为OFFSET的id,并记录SET在OFFSET中的偏移量substrOffset =3;这样在后面的字符串压缩中,SET和OFFSET就可以完全合并了。合并算法后面介绍,go ahead!

3. 找出关键字名字的最长后缀,为更进一步的压缩准备

 

  /* Compute the longestSuffix value for every word */
  for(i=0; i<nKeyword; i++){
    Keyword *p = &aKeywordTable[i];
    //已经是别的字串的子串了,后面会被合并掉,不需要寻找最长后缀
    if( p->substrId ) continue;
    for(j=0; j<nKeyword; j++){
      Keyword *pOther;
      if( j==i ) continue;
      pOther = &aKeywordTable[j];
       //已经是别的字串的子串了,也不需要用于比对最长后缀
      if( pOther->substrId ) continue;
      for(k=p->longestSuffix+1; k<p->len && k<pOther->len; k++){
        if( memcmp(&p->zName[p->len-k], pOther->zName, k)==0 ){
          p->longestSuffix = k;
        }
      }
    }
  }
/* Sort the table into reverse order by length */
  qsort(aKeywordTable, nKeyword, sizeof(aKeywordTable[0]), keywordCompare2);

最长后缀的意图,是找出有相交但不完全包含的两个字符串的集和中,相交最长值为最长后缀,比如:abcde 和defgh、cdefg、bcdefgei三个字串都是相交但又非完全包含,和defgh相交的长度为2,即de;和cdefg相交的长度为3,即cde;和bcdefgei相交的长度为4,即bcde。因此,abcde 的最长后缀值为4。
达到这样的意图,只需要对每个关键字,都要遍历其他所有的关键字,寻找该关键字的最长后缀值。找完后排序。

4. 字符串压缩核心算法

 

将所有的关键字紧密排列到一起,并计算每个关键字在整个字符串中的偏移

 

//运行到这里,aKeywordTable是按照最长后缀排序过的数组
  /* Fill in the offset for all entries */
  nChar = 0;
  for(i=0; i<nKeyword; i++){
    Keyword *p = &aKeywordTable[i];
    //已经计算好了offset或是其他字串的子串,暂不处理
    if( p->offset>0 || p->substrId ) continue;
    p->offset = nChar;
    nChar += p->len;
    //p为当前有最长后缀的子串,子串长度为p->len,寻找和后缀重叠但不全包含的字 
    //符串进行拼接,因此比对从p->len-1开始,不断寻找后缀匹配的串,找不到则k--,
    //因为aKeywordTable是排序的最长后缀串,因此该循环中一般一定能找到一个串
    for(k=p->len-1; k>=1; k--){
      //遍历比该串还短的子串,寻找最长后缀匹配
      for(j=i+1; j<nKeyword; j++){
        Keyword *pOther = &aKeywordTable[j];
        if( pOther->offset>0 || pOther->substrId ) continue;
        if( pOther->len<=k ) continue;
        //找打p串后缀和pOther前缀重合的串,进行合并
        if( memcmp(&p->zName[p->len-k], pOther->zName, k)==0 ){
          //other将合入p,下次循环找other的后缀重叠串,因此other赋值p
          p = pOther;
          //other串开始offset等于当前串尾向前减去重叠部分
          p->offset = nChar - k;
          //pOther合并后总串的长度改变
          nChar = p->offset + p->len;
          //合并压缩后,zName就只剩未重叠部分,重叠部分被压缩到前面的串中
          //压缩后,zName变短,长度等于原长度-重叠部分,重叠长度k记录到prefix 
          p->zName += k;
          p->len -= k;
          p->prefix = k;
          j = i;
          k = p->len;
        }
      }
    }
  }
  //对于完全子串,其offset等于父串(完全包含它的串)在整个字串中的offset+该子串在父串中的offset
  for(i=0; i<nKeyword; i++){
    Keyword *p = &aKeywordTable[i];
    if( p->substrId ){
      p->offset = findById(p->substrId)->offset + p->substrOffset;
    }
  }
//按offset偏移从小到大排序
/* Sort the table by offset */
  qsort(aKeywordTable, nKeyword, sizeof(aKeywordTable[0]), keywordCompare3);

这点代码比较难理解,下面用事例描述这个过程,以帮助理解这段代码:

  1. 假设有6个字符串,分别为“abc”、“bcd”、“cde”、“def”、“efg”、“fgh”,这6个字符串之间两两之间不互相完全包含。要把这些字符串存起来,需要占用abcbcdcdedefefgfgh共18个字节,这是非常浪费空间的。因为我们发现他们之间两两重叠的部分,实际可以压缩起来的。
  2. 将6个字符串前后重合部分合并起来,字符串变成了tmp = “abcdefgh”,这时我们只需要知道abc是tmp中开始0偏移3的串,bcd是tmp中开始1偏移3的串,cde是tmp中开始2偏移3的串…依次类推。
  3. 对于完全包含的子串,比如有字符串bc,要用tmp描述,只需要指导bc完全包含在abc或者bcd之1中,假定选择abc,那bc是abc的子串,其在abc中偏移1。此时,bc在tmp中的描述就变成了abc在tmp中的偏移0加bc在abc中的偏移1,即在tmp中偏移1,长度2。

上述代码的核心逻辑就是按照这样的原理进行压缩的。

经过这种压缩算法压缩后,aKeywordTable数组中的keywork的zName就变小了,而aKeywordTable又是一个按照offset排序过得数组,此时只要for循环,把非完全字串(substrId ==0)的zName拼接在一起就得到了压缩后的总字符串zKWText:

 

/* zKWText[] encodes 834 bytes of keyword text in 554 bytes */
/*   REINDEXEDESCAPEACHECKEYBEFOREIGNOREGEXPLAINSTEADDATABASELECT       */
/*   ABLEFTHENDEFERRABLELSEXCEPTRANSACTIONATURALTERAISEXCLUSIVE         */
/*   XISTSAVEPOINTERSECTRIGGEREFERENCESCONSTRAINTOFFSETEMPORARY         */
/*   UNIQUERYWITHOUTERELEASEATTACHAVINGROUPDATEBEGINNERECURSIVE         */
/*   BETWEENOTNULLIKECASCADELETECASECOLLATECREATECURRENT_DATEDETACH     */
/*   IMMEDIATEJOINSERTMATCHPLANALYZEPRAGMABORTVALUESVIRTUALIMITWHEN     */
/*   WHERENAMEAFTEREPLACEANDEFAULTAUTOINCREMENTCASTCOLUMNCOMMIT         */
/*   CONFLICTCROSSCURRENT_TIMESTAMPRIMARYDEFERREDISTINCTDROPFAIL        */
/*   FROMFULLGLOBYIFISNULLORDERESTRICTRIGHTROLLBACKROWUNIONUSING        */
/*   VACUUMVIEWINITIALLY                                                */
5. 计算最优哈希表大小及哈希值

 

  /* Figure out how big to make the hash table in order to minimize the
  ** number of collisions */
  bestSize = nKeyword;
  bestCount = nKeyword*nKeyword;
  for(i=nKeyword/2; i<=2*nKeyword; i++){
    for(j=0; j<i; j++) aKWHash[j] = 0;
    for(j=0; j<nKeyword; j++){
      h = aKeywordTable[j].hash % i;
      aKWHash[h] *= 2;
      aKWHash[h]++;
    }
    for(j=count=0; j<i; j++) count += aKWHash[j];
    if( count<bestCount ){
      bestCount = count;
      bestSize = i;
    }
  }

  /* Compute the hash */
  for(i=0; i<bestSize; i++) aKWHash[i] = 0;
  for(i=0; i<nKeyword; i++){
    h = aKeywordTable[i].hash % bestSize;
    aKeywordTable[i].iNext = aKWHash[h];
    aKWHash[h] = i+1;
  }

分词器得到一个字符串key后,要查字符串对应的tokenType,直接遍历整个表查询并不够优化。为了提高查询效率,作者使用哈希索引实现常数级的复杂度,以提高查询效率。那SQL支持的关键字数目是用户可裁剪的,如何取哈希表的大小,以求得比较小的哈希碰撞,从而获取较高的综合查询效率?
 作者将哈希表的范围限定到关键字数量的一半到其平方值之间,以这个区间内的值取模,哪个数值会更优?哈希碰撞会更小?设置一个“惩罚因子”——此处是取值2(也可以是3,4,5,实际上任意除1之外的正整数都可以),每次取模后的结果都要乘以这个“惩罚因子”,重复的次数与2的幂次方成正比,增长速度是非常快的。对所有模值求和,意味着重复项越多,最终求得的和越大。怎么理解?比如当哈希表大小取值为n1时,如果完全没有发生碰撞,aKWHash[n]初始值是0,那么哈希表中所有的值求和count += aKWHash[j];就等于n1;而当有一个元素发生1次碰撞后,求和应该等于n1+2,一个元素发生2次碰撞后求和等于n1+6,三次n1+14….按2的指数次方增长。因此,和的大小很大程度反映了碰撞的激烈程度。随着哈希表的大小i的增加,最终计算得到的哈希值会变得一样的“稀疏”,也就意味着求和的结果都趋向于i,当i使得和最小,则认为其实最优的哈希模值。
 为aKeywordTable中留下来的每一个key计算哈希值,如果没有哈希冲突,意味着每个哈希值都对应唯一的aKWHash[h],由于aKWHash[]数组默认值是0,因此每个关键字的iNext值都是0。如果存在哈希冲突,则aKWHash[h]中是非零值。该非零值减1即表示与当前关键字产生哈希冲突的关键字的索引值。

6. 代码自动生成

 有了上面这些分析,代码自动生成做的工作就比较清晰了,它生成了让我们看不懂的函数int keywordCode(const char *z, int n, int *pType)及其计算使用的那些数组、哈希表等。即zKWText为压缩后的字符串;aKWHash为keyword的哈希值表;aKWLen为第i个key的长度;aKWOffset为第i个key在zKWText的offset值,aKWCode是第i个key的tokenType;在回过头来看keywordCode实现,意图就十分清晰了。

 

static int keywordCode(const char *z, int n, int *pType){
  int i, j;
  const char *zKW;
  if( n>=2 ){
    //根据词法分析到的关键字串,计算其哈希值
    i = ((charMap(z[0])*4) ^ (charMap(z[n-1])*3) ^ n) % 127;
   /*i=((int)aKWHash[i])-1,即i值应该是keywork的索引或和keywork索引发生哈希碰撞的keyword索引。*/
   //1.哈希无冲突时,其值就是keywork的索引;
   //2.当存在哈希冲突时,其值是和keyword冲突的索引值+1
   //有没有哈希碰撞,可以检测aKWNext[i]是不是等于0,等于0无冲突
  //在此基础上检查len和字符串(全转大写比较)是否相等
    for(i=((int)aKWHash[i])-1; i>=0; i=((int)aKWNext[i])-1){
      if( aKWLen[i]!=n ) continue;
      j = 0;
      zKW = &zKWText[aKWOffset[i]];
#ifdef SQLITE_ASCII
      while( j<n && (z[j]&~0x20)==zKW[j] ){ j++; }
#endif
#ifdef SQLITE_EBCDIC
      while( j<n && toupper(z[j])==zKW[j] ){ j++; }
#endif
      if( j<n ) continue;
     ...
    //条件验证通过后,type则可直接通过索引i拿到并赋值
    *pType = aKWCode[i];
      break;
    }
  }
  return n;

至此,SQLite的词法分析部分的核心代码分析完毕,可以看到,作者为了提高执行效率和降低内存使用率方面,做了非常深入的优化。

本文出自:https://www.jianshu.com/p/baee54840756

SQLite之SQL解析-语法分析-7

SQLite语法分析背景

 SQLite语法分析器是由一个叫lemon的应用程序自动生成的,该语法分析器是由美国计算机专家Richard Hipp先生开发。相对于知名的YACC和BISON,Lemon语法分析器生成器有些名不见经传。但Lemon有自己特有的优势:块头更小,整个源代码只有四千多行,短小精悍;Lemon的可执行文件也很小,只有120多K,配合Lemon使用的lempar.c模板文件也只有700余行,非常精炼。
 Lemon的核心任务是一次次的读入一个个有意义的单元,然后寻求这些单元之间的关系。那么首先Lemon是如何工作的呢?

首先,为了能够最终获得人们所要求的处理过程,Lemon要求一份已经被准确推敲、仔细考虑、再三审核的语法文件。这样的语法文件是由上下文无关的句法写成的,并且还附有当记号之间的关系符合某一句型时应该如何处置的代码。

然后,调用Lemon程序,命令它读入该语法文件,并依之自动生成具体的计算机程序,即语法分析器(编译器)。

最后,输入具体的程序代码,先用词法分析器分词获取每个词组的大记号(抽象出来的符号类型)、小记号(单元的具体内容)。顺序的送入语法分析器,语法分析器解析输入符号,生成可执行分析的中间机器代码。

Lemon专业名词解析

记号(Token):如C语言里的变量名、常量、字符串、操作符、标点符号、分界符号等有意义的单元。
记号的类:终结符(terminal symbols)、大记号,它们是一些正整数,分别代表语句中一个个记号的类型,比如 1234是一个int型整数,可归类为INT终结符。
记号的值:标识符、小记号,由记号本身或记号本身所决定的某一种数据结构。比如 1234是一个int型整数,1234数字本身就是它的值。
项目:在每个产生式的右部添加一个圆点(或者其他符号),表示我们在分析过程中已经看到了产生式的多大的部分。比如:A->XYZ共有四个项目,A->*XYZ,A->X*YZ,A->XY*Z,A->XYZ*。 A->X*YZ;表示已经读入了X,接下来期望读入Y。
规约项目:A->a*,结束符已经在产生式最右边,整个候选式已经分析识别了,可以规约了。
接受项目:拓广后的文法开始符号S’,它所对应的一个唯一的规约项目S’->a*称为接受项目,表示句子分析完毕。
移进项目:A->a*bc,(b属于Vt),即下面期待识别到终结符b,接下来遇到后直接将终结符b移进分析栈。
待约项目:A->a*Bc,(B属于Vn),即a已经识别,后面需要识别Bc,并且马上将要规约生成的是一个非终结符B,它将作为A的一部分。
所有的项目连接起来构成一个NFA,即项目作为状态,识别某个字符,*往后移动一位,转到另外一个项目

Lemon核心流程

初始化Lemon

 lemon中有一个非常核心的lemon结构,语法分析生成器所需要的所有状态变量都存放在这个结构中。其中一些非常重要的子结构,都需要初始化后挂到lemon结构之上,包括:

用于存放字符串的s_x1node结构;
用于存放语法符号(终结符和非终结符)以及归属它各种属性的symbol结构;
用于存放各条产生式的rule结构;
用于存放由产生式派生出来的项目config 结构;
用于存放进行语法分析时将如何动作的action结构;
用于存放相关项目和对应的动作集合的state结构;

词法扫描和语法要素构造

 首先说一下,这里的词法扫描是对语法文件.y的词法扫描,而不是词法分析器对某个具体语言的程序所做的词法扫描,但两者原理上相似。该过程核心任务是分析语法文件的语法,以它们在编译原理中的含义,变更、转换、形成计算机内部的表示。主要包括:

把语法文件一次性读入内存;
对语法文件中%ifdef-%endif等字样的代码块进行预处理和取舍;
把语法文件中的字符流,分割成各个词法符号;
把这些符号安家落户,形成语法文件在计算机内部的表示形式;
重现RePrint语法文件(-g选项下执行);

lemon parse.y -g //控制台打印出扫描完语法分析文件后得到的“纯”语法文件

计算符号的First集合

 为了最终得到LALR(1)类型的语法分析器,需要知道某个句型(产生式)里面,能够跟在非终结符后面的终结符,而这些先行符组成组成了一个终结符的集合,这个集合称为Follow集。但为了能够得到这个集合,还需要先计算每一个符号的First集合。
 根据编译原理中的定义,一个串的First集合是指这个串能推出的任意串中,排在开头的终结符构成的集合。通过对有限的产生式进行反复扫描,连续使用First集合计算方法所描述的规则,直至每个First结合不再增大为止。就能得到准确的First集合。

First集合计算方法

这个过程主要工作:

计算各产生式的优先级
找出各非终结符的First集合·

计算LR(0)分析器

 为语法文件建立与之相对应的LR(0)分析器的各个状态,是一项复杂精细的工作。工作的过程大致如下:

  1. 确定语法文件的开始符号。语法文件中各条产生式之间的关系,实质上是以开始符号为根节点的一棵语法树,所以明白无误的确认哪个符号就十分重要,并且还需要保证开始符号不能出现在任何产生式的右边。
    2.找到第一状态的基本项目。如果开始符号仅出现在首个产生式上,则第一状态的基本项目仅有一个;若开始符号出现在数个产生式上,则第一状态的基本项目也就是数个。今后计算LR(0)分析器中所有状态的出发点,就在这些基本项目上,代码也需要在每个基本项目上挂上它们的Follow集合。
    3.对第一状态的基本项目集进行闭包运算,求得此状态中包括基本项目和派生的非基本项目的全体集合,这个全体项目集合就是第一状态的完整形态。
    4.对第一状态的所有项目上的某个符号进行一次移进操作,这是模拟该符号的入栈操作。模拟该符号入栈后,就可以得到一系列的新项目,这时的这些新项目就是另外一个状态的基本项目。在新的基本项目上的bplp域上挂一个入栈前的原来项目,关联起来。新的项目再次移进后,有的会产生新的状态,有的会走到已有的状态上,摆动的记录依然挂到bplp域。
    5.重复上述步骤4
    基本项目集:./lemon parse.y -b
    完全项目集:./lemon parse.y -c

计算符号的Follow集合

 项目传播链上记载了各个后继项目与它们前承项目的次序关系。如果把它们颠倒过来,则编程前承项目和后继项目的次序关系了。每一个项目传播链上都挂有一个Follow集,于是,前承项目的follow集合中的元素,可以从后继项目的先行符First集中获取。

颠倒项目传播链的次序,前承项目的fplp域挂上后继项目
找出符号的Follow集合

计算LALR(1)分析器

 在构造LR(0)分析器各状态的过程中,已经完成移进操作的分析和计算。这部分要解决的是规约操作的分析和计算、冲突的处理策略以及如何压缩表示动作的链表。

  1. 遍历所有的状态中的项目集,取出可规约操作的项目,这些项目的分割符均已在产生式的最右侧。对照这些特定项目Follow集中的终结符,把这些规约项目与对应的先行符绑定,一起装配到动作表的规约区域去。
  2. 找到整个语法分析器的终结态(即接受态);
  3. 得到的动作表中,一个状态对应同一个先行符号,可能会有一种以上的动作,这样一来动作之间就产生了冲突。归类为:移进-规约冲突,规约-规约冲突;程序对两种冲突自动判断和消解,处理不了的报告出来。
  4. 对于正确的动作表,为了减少存储空间,加快运行速度,程序对动作表进行了压缩。
    5.把动作表打印到.out文件中

生成LALR(1)语法分析器

 通过前面的步骤,已经拿到了语法文件的移进表(Goto表)和动作表(Action表),标志着编译原理中介绍的LALR(1)分析理论已经被实现完成。接下来就应该以Goto表和Action表为指导,生成一个真正的语法分析器了。但是光凭lemon程序是无法完成这个工作的。作者提供了一个名字叫lempar.c的模板文件,lemon通过结合lempar.c的使用才能正式生成一个语法分析器。大概过程:

  1. 从模板文件和语法文件中提取并制作头文件包含部分。
  2. 定义分析器中各种数据类型,将goto表和action表转换成数组并压缩,输出到语法分析器中。
  3. 根据模板文件制作和输出移进、规约、接受的操作处理函数。
  4. 生成出错和接受的处理函数。
  5. 生成核心的语法分析器Parse函数。

总结

 SQLite的语法分析器代码Parse.c文件是由lemon根据语法文件parse.y和语法分析器模板lempar.c自动生成的,分析器核心函数是sqlite3Parser,它不断的接受词法分析器输入的分词结果,并根据输入值查询此时接受该单词后是移进还是规约,如果移进,内部保持相关状态,等待下一步的输入。如果规约,则调用规约的动作代码进行规约操作,动作代码在parse.y的产生式后。每一次规约的动作代码都会产生一些机器指令,记录在stmt结构中,等完整的输入一个SQL后,该SQL的执行指令全部被生成出来。此时调用虚拟机的核心函数即可以完成具体的动作。

本文出自:https://www.jianshu.com/p/d1296f6dd8fd