Tokenization of python code

In this article we shall see how to tokenize the python code using a built-in module called tokenize.

The tokenize module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module returns comments as tokens as well, making it useful for implementing “pretty-printers”, including colorizers for on-screen displays.

What is tokenization?

In Python tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language. 

Get started

Let us learn how to tokenize python programs with the following example. Consider the following python file “sample.py“.

def addition(number1, number2):
    return number1 + number2

print(addition(5, 4))

Let us tokenize this python file using the tokenize module.

import tokenize

with tokenize.open('sample.py') as f:
    tokens = tokenize.generate_tokens(f.readline)
    for token in tokens:
        print(token)

The output of this code is,

TokenInfo(type=1 (NAME), string='def', start=(1, 0), end=(1, 3), line='def addition(number1, number2):\n')
TokenInfo(type=1 (NAME), string='addition', start=(1, 4), end=(1, 12), line='def addition(number1, number2):\n')
TokenInfo(type=53 (OP), string='(', start=(1, 12), end=(1, 13), line='def addition(number1, number2):\n')
TokenInfo(type=1 (NAME), string='number1', start=(1, 13), end=(1, 20), line='def addition(number1, number2):\n')
TokenInfo(type=53 (OP), string=',', start=(1, 20), end=(1, 21), line='def addition(number1, number2):\n')
TokenInfo(type=1 (NAME), string='number2', start=(1, 22), end=(1, 29), line='def addition(number1, number2):\n')
TokenInfo(type=53 (OP), string=')', start=(1, 29), end=(1, 30), line='def addition(number1, number2):\n')
TokenInfo(type=53 (OP), string=':', start=(1, 30), end=(1, 31), line='def addition(number1, number2):\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 31), end=(1, 32), line='def addition(number1, number2):\n')
TokenInfo(type=5 (INDENT), string='    ', start=(2, 0), end=(2, 4), line='    return number1 + number2\n')
TokenInfo(type=1 (NAME), string='return', start=(2, 4), end=(2, 10), line='    return number1 + number2\n')
TokenInfo(type=1 (NAME), string='number1', start=(2, 11), end=(2, 18), line='    return number1 + number2\n')
TokenInfo(type=53 (OP), string='+', start=(2, 19), end=(2, 20), line='    return number1 + number2\n')
TokenInfo(type=1 (NAME), string='number2', start=(2, 21), end=(2, 28), line='    return number1 + number2\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 28), end=(2, 29), line='    return number1 + number2\n')
TokenInfo(type=56 (NL), string='\n', start=(3, 0), end=(3, 1), line='\n')
TokenInfo(type=56 (NL), string='\n', start=(4, 0), end=(4, 1), line='\n')
TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line='print(addition(5, 4))\n')
TokenInfo(type=1 (NAME), string='print', start=(5, 0), end=(5, 5), line='print(addition(5, 4))\n')
TokenInfo(type=53 (OP), string='(', start=(5, 5), end=(5, 6), line='print(addition(5, 4))\n')
TokenInfo(type=1 (NAME), string='addition', start=(5, 6), end=(5, 14), line='print(addition(5, 4))\n')
TokenInfo(type=53 (OP), string='(', start=(5, 14), end=(5, 15), line='print(addition(5, 4))\n')
TokenInfo(type=2 (NUMBER), string='5', start=(5, 15), end=(5, 16), line='print(addition(5, 4))\n')
TokenInfo(type=53 (OP), string=',', start=(5, 16), end=(5, 17), line='print(addition(5, 4))\n')
TokenInfo(type=2 (NUMBER), string='4', start=(5, 18), end=(5, 19), line='print(addition(5, 4))\n')
TokenInfo(type=53 (OP), string=')', start=(5, 19), end=(5, 20), line='print(addition(5, 4))\n')
TokenInfo(type=53 (OP), string=')', start=(5, 20), end=(5, 21), line='print(addition(5, 4))\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(5, 21), end=(5, 22), line='print(addition(5, 4))\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(6, 0), end=(6, 0), line='')

Process finished with exit code 0

As we can see each word in the python file is converted into tokens. Each token will have the following informations.

  1. type
  2. opcode
  3. string
  4. start postion
  5. end position
  6. line in which they are present

Conclusion

Hope this article was helpful. If you have any doubts leave them in the comment box below.

Happy coding!