[Python] PLY (Python Lex-Yacc) 정리 - Lex

Programming Language/Python 2021. 4. 28. 10:08

YACC 에 대한 내용은 YACC 정리 문서 를 참조 해주세요.

1. Lex

lex.py 를 이용한다.
문자열을 tokenize 할 때 사용한다.

1.1. The tokens list (tokens 리스트)

tokens 라는 list 객체에 lexer 에서 사용 될 토큰의 이름을 모두 정의해야 한다.

tokens = (
    'NUMBER',
    'PLUS',
    'MINUS',
    'TIMES',
    'DIVIDE',
    'LPAREN',
    'RPAREN',
)

1.2. Specification of tokens (토큰의 명세)

prefix (t_ ) 를 각 토큰이름 앞에 사용해서 rule 을 정의 한다.
- rule 은 re 모듈의 정규 표현식으로 정의 한다.
- re 컴파일러는 re.VERBOSE 옵션을 사용한다. (multiline, comment 를 지원하여 가독성이 높아짐)

# ex1) 아래와 같이 한 줄로 토큰을 간단히 정의 할 수 있다.
# 리턴 값은 매칭되는 string 이다
t_PLUS = r'\+'

# ex2) 아래와 같이 함수 형태로 정의 할 수 있다.
# 값의 리턴 값을 int 로 캐스팅하는 예시
def t_NUMBER(t):
    r'\d+'
    t.value = int(t.value)
    return t

# ex3) ==, = 를 따로 구분하고 싶으면, == 를 먼저 체크 해야한다.
t_DOUBLE_EQ = r'=='
t_EQ = r'='

# ex4) reserved words(예약어) 를 생성하여 특정 이름에 대한 rule 을 정의 할 수 있다.
# 이렇게 예약어 사전을 사용하면 모든 예약어에 대한 rule 을 지정 하지 않고도 빠르게 tokenize 할 수 있다.
reserved = {
    'if' : 'IF',
    'then' : 'THEN',
    'else' : 'ELSE',
    'while' : 'WHILE',
    ...
 }
tokens = ['LPAREN','RPAREN',...,'ID'] + list(reserved.values())

def t_ID(t):
    r'[a-zA-Z_][a-zA-Z_0-9]*'
    t.type = reserved.get(t.value,'ID')    # Check for reserved words
    return t

1.3. Discarded tokens

comment 와 같은 토큰을 삭제 하는 방법으로, rule 의 리턴값을 없애거나, igonre_ prefix 를 이용한다.

# 리턴값을 무시
def t_COMMENT(t):
    r'\#.*'
    pass

# ignore_ prefix 사용
t_ignore_COMMENT = r'\#.*'

1.4. Line numbers and positional information

lex.py 는 line-number 를 알지 못한다.
line rule 을 지정해 주려면, t_newline() rule 을 작성한다

def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)

1.5. Ignored characters

t_ignore rule 을 이용해서 입력 문자열에서 완전히 무시하고 싶은 문자를 지정 할 수 있다.
일반적으로, whitespace 나 필요없는 문자를 skip 하기 위해서 사용된다.
위 1.3 방식으로도 지정을 할 수 있지만, t_ignore 를 이용하는게 lexing 효율이 좋다.
하지만, ignore 패턴이 다른 토큰 rule 에 있으면, 그 rule 안에서는 해당 패턴을 무시 하지 않는다.

1.6. Literal characters

문자열 캐릭터 그대로를 토큰 이름으로 사용하고자 할 때 사용한다.
literals 변수에 해당 값을 할당한다.

# ex1
literals = [ '+','-','*','/' ]

# ex2
literals = "+-*/"

literals 는 다른 정규표현식 이 모두 정의된 후에 체크된다.
- 그래서, 어떤 rule 이 literal 문자로 시작한다고 정의하면, rule 이 항상 우선 체크 된다.
literal 토큰은 type, value 모두 문자 그대로 세팅한다.
literal 도 함수형태로 체크할 수 있다.
- 하지만, 토큰의 type 을 적절하게 세팅해줘야 한다.

# 함수형태로 literal 토큰 생성
literals = [ '{', '}' ]

def t_lbrace(t):
    r'\{'
    t.type = '{'      # Set token type to the expected literal
    return t

def t_rbrace(t):
    r'\}'
    t.type = '}'      # Set token type to the expected literal
    return t

1.7. Error handling

t_error() 함수를 이용해서 lexing 에서 발생하는 에러를 처리 할 수 있다.

# Error handling rule
 def t_error(t):
     print(f"Illegal character '{t.value[0]}'" )
     t.lexer.skip(1)  # 에러가 발생한 lex 토큰을 무시한다.

1.8. EOF handling

t_eof() 함수를 이용해서 EOF(end-of-file) 조건을 처리 할 수 있다.
eof 타입으로 토큰 처리가 되는데, 이때 lineno(라인번호) 와 lexpos(라인에서 몇번째 캐릭터인지) attribute 를 함께 가진다.
이 함수는 주로 추가 input 이 있는 경우에 이를 lexing 할 때 사용한다.

# EOF handling rule
def t_eof(t):
  # Get more input (Example)
  more = raw_input('... ')
  if more:
    self.lexer.input(more)
    return self.lexer.token()
  return None

1.9. Lexer 를 생성하고 사용하는 방법

# build lexer
lexer = lex.lex()

# lexer 에 input string 을 넣는 방법
lexer.input("string")

# 다음 token 을 가져오는 방법, 성공: LexToken instance, input text 의 끝이면: None
lexer.token()

1.10. @TOKEN 데코레이터 사용

토큰의 rule 을 정규표현식으로 정의를 할 수 있는데, 상황에 따라서 변수의 값을 이용해서 정규표현식에 포함 시키고 싶을 경우가 있다.
이럴 때는 @TOKEN 데코레이션을 사용할 수 있다.

digit            = r'([0-9])'
nondigit         = r'([_A-Za-z])'
identifier       = r'(' + nondigit + r'(' + digit + r'|' + nondigit + r')*)'        

from ply.lex import TOKEN

@TOKEN(identifier)
def t_ID(t):
  ...

1.11. Optimized mode (최적화)

성능을 향상 시키기 위해서 python 의 optimized mode 를 사용하는 것이 좋다. (e.g. python 을 -o 옵션과 함께 실행)
하지만 python -o 를 사용하면 docstring 을 무시하여 lex.py 가 제대로 동작하지 못한다.
그래서 아래와 같이 lexer 의 옵션으로 모드를 설정 할 수 있다.

lexer = lex.lex(optimize=1)

1.12. Debugging

lex() 를 디버그 모드로 실행 할 수 있다.
추가된 모든 규칙, lexer 의 정규식, lexing 중에 생성되는 토큰등 다양한 디버깅 정보를 생성한다.

# lexer debug mode 로 실행
lexer = lex.lex(debug=1)

# 추가
# standard input 에 input 을 간단하게 tokenize 하는 방법은 아래와 같이 제공한다.
if __name__ == '__main__':
  lex.runmain()

1.13. 다른 모듈에 정의된 lexer rule 가져오기

다른 모듈에 정의된 lexer rule 을 가져오기 위해서는 lex() 함수에서 module 파라미터를 설정하면 된다.

# module: tokrules.py
# 이 모듈은 token rule 만 정의 되어있다.

# tokens 변수는 항상 정의 되어야 한다.
tokens = (
  'NUMBER',
  'PLUS',
  'MINUS',
  'TIMES',
  'DIVIDE',
  'LPAREN',
  'RPAREN',
)

# 토큰의 정규표현식 정의
t_PLUS    = r'\+'
t_MINUS   = r'-'
t_TIMES   = r'\*'
t_DIVIDE  = r'/'
t_LPAREN  = r'\('
t_RPAREN  = r'\)'

# 추가 action 이 필요한 경우 함수형태로 정의
def t_NUMBER(t):
  r'\d+'
  t.value = int(t.value)    
  return t

# 이 함수를 이용해서 newline 을 유지한다.
def t_newline(t):
  r'\n+'
  t.lexer.lineno += len(t.value)

# space, tab 을 무시한다
t_ignore  = ' \t'

# 에러가 발생했을 때는 print 를 한 후 해당 토큰을 무시한다.
def t_error(t):
  print("Illegal character '%s'" % t.value[0])
  t.lexer.skip(1)

# python 인터프리터로 위의 tokrules.py 모듈을 가져오는 방법
>>> import tokrules
>>> lexer = lex.lex(module=tokrules)
>>> lexer.input("3 + 4")
>>> lexer.token()
LexToken(NUMBER,3,1,1,0)
>>> lexer.token()
LexToken(PLUS,'+',1,2)
>>> lexer.token()
LexToken(NUMBER,4,1,4)
>>> lexer.token()
None
>>>

모듈을 정의 할 때는 위의 예제 처럼, 파일 형태로 각 토큰을 정의 하는 방법도 있고, class 형태로 정의 하는 방법도 있다.
- 아래는 class 형태로 토큰 rule 을 정의 하는 방법이다.

import ply.lex as lex

class MyLexer(object):
  # tokens 변수는 항상 정의 되어야 한다.
  tokens = (
    'NUMBER',
    'PLUS',
    'MINUS',
    'TIMES',
    'DIVIDE',
    'LPAREN',
    'RPAREN',
  )

  # 토큰의 정규표현식 정의
  t_PLUS    = r'\+'
  t_MINUS   = r'-'
  t_TIMES   = r'\*'
  t_DIVIDE  = r'/'
  t_LPAREN  = r'\('
  t_RPAREN  = r'\)'

  # 추가 action 이 필요한 경우 함수형태로 정의
  def t_NUMBER(self,t):
    r'\d+'
    t.value = int(t.value)    
    return t

  # 이 함수를 이용해서 newline 을 유지한다.
  def t_newline(self,t):
    r'\n+'
    t.lexer.lineno += len(t.value)

  # space, tab 을 무시한다
  t_ignore  = ' \t'

  # 에러가 발생했을 때는 print 를 한 후 해당 토큰을 무시한다.
  def t_error(self,t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

  # lexer 를 build
  def build(self,**kwargs):
    self.lexer = lex.lex(module=self, **kwargs)

  # output 을 테스트 하는 함수
  def test(self,data):
    self.lexer.input(data)
    while True:
      tok = self.lexer.token()
      if not tok: 
        break
      print(tok)

# lexer 를 생성하고, 테스트
m = MyLexer()
m.build()
m.test("3 + 4")

1.14. 상태 유지 하는 방법

lexer 의 다양한 상태 정보를 유지 하고 싶을 경우에는 아래와 같은 방법들이 있다.

global variable 사용하기

num_count = 0
def t_NUMBER(t):
  r'\d+'
  global num_count
  num_count += 1
  t.value = int(t.value)    
  return t

lex() 로 생성한 lexer 오브젝트의 attribute 에 값을 넘겨 주는 방법
- 각각 인스턴스를 생성하여 사용하는 경우 유용할 수 있다.

def t_NUMBER(t):
  r'\d+'
  t.lexer.num_count += 1     # Note use of lexer attribute
  t.value = int(t.value)    
  return t

lexer = lex.lex()
lexer.num_count = 0            # Set the initial count

class 로 정의 하여 상태 변수 유지

class MyLexer:
  ...
  def t_NUMBER(self,t):
    r'\d+'
    self.num_count += 1
    t.value = int(t.value)    
    return t

  def build(self, **kwargs):
    self.lexer = lex.lex(object=self,**kwargs)

  def __init__(self):
    self.num_count = 0

1.15. 내부 상태 변수

lexer.lexpos : 토큰의 위치
lexer.lineon : 토큰의 라인 번호
lexer.lexdata : lexer 에 들어온 현재 input string
lexer.lexmatch : 현재 토큰에 대한 re.metch() 정보를 리턴

1.16. Conditional lexing and start conditions

여러 lexing 상태 조건이 존재할 경우 고급 파싱 기술을 사용하면 좋다
- 예를 들어서, 특정 토큰 이나 구문 구조 를 이용해서 다른 종류의 lexing 의 trigger 로 사용 할 수 있다.
PLY는 기본 lexer 를 다른 상태의 series 로 만들 수 있는 기능을 지원한다.
각 상태는 고유의 tokens, lexing rules, 등을 가진다.
구현은 주로 GNU 플렉스의 "start condition" 기능에 기초한다. (자세한 내용은 start condition link 에서 확인)
lexing 상태를 사용하고 싶으면 먼저 states 변수에 정의 해야한다.

# foo, bar 라는 상태를 각각 exclusive, inclusive 조건으로 선언한다.
# exclusive : 
#   - lexer 의 default 행위를 완전히 override 한다. 
#   - lex 는 exclusive 상태 정의된 token 만 반환한다. default 무시
# inclusive : 
#   - default rule 셋에 추가적인 token, rule 을 추가한다.
#   - lex 는 default 로 정의된 token 과 inclusive 상태로 추가된 token 둘 다 반환한다.

states = (
  ('foo','exclusive'),
  ('bar','inclusive'),
)

상태가 정의되면 token 과 rule 은 해당 상태의 이름(foo or bar)을 포함해서 선언한다.

# 기본적으로 각 상태에 포함되는 토큰의 룰을 적용 할 수 있다.
t_foo_NUMBER = r'\d+'                      # Token 'NUMBER' in state 'foo'        
t_bar_ID     = r'[a-zA-Z_][a-zA-Z0-9_]*'   # Token 'ID' in state 'bar'
def t_foo_newline(t):
  r'\n'
  t.lexer.lineno += 1

# 여러 상태에 대한 토큰 룰을 적용 할 수 있다.
t_foo_bar_NUMBER = r'\d+'         # Defines token 'NUMBER' in both state 'foo' and 'bar'

# ANY 문자열을 이용해서 모든 상태에 적용 할 수 있다.
t_ANY_NUMBER = r'\d+'         # Defines a token 'NUMBER' in all states

# 상태이름을 지정 하지 않으면 일반 케이스로 적용이 되는데, 이는 INITIAL 문자열로 선언한 것과 같다.
t_NUMBER = r'\d+'
t_INITIAL_NUMBER = r'\d+'

# t_ignore, t_error(), t_eof() 에도 상태이름을 적용 할 수 있다.
t_foo_ignore = " \t\n"       # Ignored characters for state 'foo'
def t_bar_error(t):          # Special error handler for state 'bar'
  pass

기본적으로 lexing 동작은 INITIAL 상태에서 동작한다.
- INITIAL 상태는 일반적으로 정의된 모든 토큰을 포함한다.
- 다른 상태를 사용하지 않으면 항상 INITIAL 상태로 동작한다는 것이다.
lexing 이나 parsing 동작 중에 lexing 의 상태를 변경하고자 한다면 begin() 메소드를 이용하면 된다.

# foo 렉스를 사용하는 트리거
def t_begin_foo(t):
  r'start_foo'
  t.lexer.begin('foo')             # Starts 'foo' state

# 상태를 벗어나기 위해서는 begin() 을 사용해서 INITIAL 로 변경하면 된다.
def t_foo_end(t):
  r'end_foo'
  t.lexer.begin('INITIAL')        # Back to the initial state

# stack 을 사용하면 상태를 관리 하기 좋다.
def t_begin_foo(t):
  r'start_foo'
  t.lexer.push_state('foo')             # Starts 'foo' state

def t_foo_end(t):
  r'end_foo'
  t.lexer.pop_state()                   # Back to the previous state

stack 형식으로 상태를 관리하면 코드 문법을 lexing 하는 상황에서 사용이 용이하다.
- 아래 예제는 C 코드를 lexing 하는 예제이다.
- { 로 시작하고 } 끝나는 코드 블록을 파싱해내는 것이 아래 예제의 동작이다.

from ply import lex

tokens = ["CCODE"]
# ccode 상태를 정의한다.
states = (
    ('ccode', 'exclusive'),
)
# 처음 { 문자가 매치가 되는 곳 부터 ccode 상태로 진입한다.
def t_ccode(t):
    r'.*\{'
    print(f"ccode lexpos: {t.lexer.lexpos}")
    t.lexer.code_start = t.lexer.lexpos - 1 # Record the starting position
    t.lexer.level = 1  # Initial brace level
    t.lexer.begin('ccode')  # Enter 'ccode' state

# ccode 블록을 lexing 하기 위한 ccode rule
def t_ccode_lbrace(t):
    r'\{'
    t.lexer.level += 1

def t_ccode_rbrace(t):
    r'\}'
    t.lexer.level -= 1

    # 만약 코드 블락이 끝나면 { 부터 } 까지의 모든 문자열을 하나의 토큰으로 리턴한다.
    if t.lexer.level == 0:
        t.value = t.lexer.lexdata[t.lexer.code_start:t.lexer.lexpos + 1]
        t.type = "CCODE"
        t.lexer.lineno += t.value.count('\n')
        t.lexer.begin('INITIAL')
        return t

# C or C++ comment (ignore)
def t_ccode_comment(t):
    r'(/\*(.|\n)*?\*/)|(//.*)'
    pass

# C string
def t_ccode_string(t):
    r'\"([^\\\n]|(\\.))*?\"'

# C character literal
def t_ccode_char(t):
    r'\'([^\\\n]|(\\.))*?\''

# Any sequence of non-whitespace characters (not braces, strings)
def t_ccode_nonspace(t):
    r'[^\s\{\}\'\"]+'

# Ignored characters (whitespace)
t_ccode_ignore = " \t\n"

# For bad characters, we just skip over it
def t_ccode_error(t):
    t.lexer.skip(1)

def t_error(t):
    print(f"error: {t.value}")

lexer = lex.lex()
lexer.input(
   """int main() {
   printf("He(ll}o, World!");
   return 0;
}"""
)
token = lexer.token()
while token:
    print(token)
    token = lexer.token()

Reference

PLY 문서 : https://www.dabeaz.com/ply/ply.html

'Programming Language > Python' 카테고리의 다른 글

[python] CLOSE_WAIT 해결 방법 with TimeoutIterator (0)	2021.05.07
[Python] PLY (Python Lex-Yacc) 정리 - Yacc (0)	2021.04.28
문자열 암호화 / 복호화 with Python (0)	2021.02.01
RST (reStructuredText) & Sphinx 문법 정리 (0)	2019.12.09
Requests Library in python (0)	2019.12.05

ABOUT ME

FWANI's 코딩로그 FWANI's 코딩로그

1. Lex

1.1. The tokens list (tokens 리스트)

1.2. Specification of tokens (토큰의 명세)

1.3. Discarded tokens

1.4. Line numbers and positional information

1.5. Ignored characters

1.6. Literal characters

1.7. Error handling

1.8. EOF handling

1.9. Lexer 를 생성하고 사용하는 방법

1.10. @TOKEN 데코레이터 사용

1.11. Optimized mode (최적화)

1.12. Debugging

1.13. 다른 모듈에 정의된 lexer rule 가져오기

1.14. 상태 유지 하는 방법

1.15. 내부 상태 변수

1.16. Conditional lexing and start conditions

Reference

'Programming Language > Python' 카테고리의 다른 글

티스토리툴바

ABOUT ME

1. Lex

1.1. The tokens list (tokens 리스트)

1.2. Specification of tokens (토큰의 명세)

1.3. Discarded tokens

1.4. Line numbers and positional information

1.5. Ignored characters

1.6. Literal characters

1.7. Error handling

1.8. EOF handling

1.9. Lexer 를 생성하고 사용하는 방법

1.10. @TOKEN 데코레이터 사용

1.11. Optimized mode (최적화)

1.12. Debugging

1.13. 다른 모듈에 정의된 lexer rule 가져오기

1.14. 상태 유지 하는 방법

1.15. 내부 상태 변수

1.16. Conditional lexing and start conditions

Reference

'Programming Language > Python' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바