.. -*- mode: rst -*-
====================
Write your own lexer
====================
If a lexer for your favorite language is missing in the Pygments package, you can
easily write your own and extend Pygments.
All you need can be found inside the `pygments.lexer` module. As you can read in
the `API documentation <api.txt>`_, a lexer is a class that is initialized with
some keyword arguments (the lexer options) and that provides a
`get_tokens_unprocessed()` method which is given a string or unicode object with
the data to parse.
The `get_tokens_unprocessed()` method must return an iterator or iterable
containing tuples in the form ``(index, token, value)``. Normally you don't need
to do this since there are numerous base lexers you can subclass.
RegexLexer
==========
A very powerful (but quite easy to use) lexer is the `RegexLexer`. This lexer
base class allows you to define lexing rules in terms of *regular expressions*
for different *states*.
States are groups of regular expressions that are matched against the input
string at the *current position*. If one of these expressions matches, a
corresponding action is performed (normally yielding a token with a specific
type), the current position is set to where the last match ended and the
matching process continues with the first regex of the current state.
Lexer states are kept in a state stack: each time a new state is entered, the
new state is pushed onto the stack. The most basic lexers (like the
`DiffLexer`) just need one state.
Each state is defined as a list of tuples in the form (`regex`, `action`,
`new_state`) where the last item is optional. In the most basic form, `action`
is a token type (like `Name.Builtin`). That means: When `regex` matches, emit a
token with the match text and type `tokentype` and push `new_state` on the state
stack. If the new state is ``'#pop'``, the topmost state is popped from the
stack instead. (To pop more than one state, use ``'#pop:2'`` and so on.)
``'#push'`` is a synonym for pushing the current state on the
stack.
The following example shows the `DiffLexer` from the builtin lexers. Note that
it contains some additional attributes `name`, `aliases` and `filenames` which
aren't required for a lexer. They are used by the builtin lexer lookup
functions.
.. sourcecode:: python
from pygments.lexer import RegexLexer
from pygments.token import *
class DiffLexer(RegexLexer):
name = 'Diff'
aliases = ['diff']
filenames = ['*.diff']
tokens = {
'root': [
(r' .*\n', Text),
(r'\+.*\n', Generic.Inserted),
(r'-.*\n', Generic.Deleted),
(r'@.*\n', Generic.Subheading),
(r'Index.*\n', Generic.Heading),
(r'=.*\n', Generic.Heading),
(r'.*\n', Text),
]
}
As you can see this lexer only uses one state. When the lexer starts scanning
the text, it first checks if the current character is a space. If this is true
it scans everything until newline and returns the parsed data as `Text` token.
If this rule doesn't match, it checks if the current char is a plus sign. And
so on.
If no rule matches at the current position, the current char is emitted as an
`Error` token that indicates a parsing error, and the position is increased by
1.
Regex Flags
===========
You can either define regex flags in the regex (``r'(?x)foo bar'``) or by adding
a `flags` attribute to your lexer class. If no attribute is defined, it defaults
to `re.MULTILINE`. For more informations about regular expression flags see the
`regular expressions`_ help page in the python documentation.
.. _regular expressions: http://docs.python.org/lib/re-syntax.html
Scanning multiple tokens at once
================================
Here is a more complex lexer that highlights INI files. INI files consist of
sections, comments and key = value pairs:
.. sourcecode:: python
from pygments.lexer import RegexLexer, bygroups
from pygments.token import *
class IniLexer(RegexLexer):
name = 'INI'
aliases = ['ini', 'cfg']
filenames = ['*.ini', '*.cfg']
tokens = {
'root': [
(r'\s+', Text),
(r';.*?$', Comment),
(r'\[.*?\]$', Keyword),
(r'(.*?)(\s*)(=)(\s*)(.*?)$',
bygroups(Name.Attribute, Text, Operator, Text, String))
]
}
The lexer first looks for whitespace, comments and section names. And later it
looks for a line that looks like a key, value pair, separated by an ``'='``
sign, and optional whitespace.
The `bygroups` helper makes sure that each group is yielded with a different
token type. First the `Name.Attribute` token, then a `Text` token for the
optional whitespace, after that a `Operator` token for the equals sign. Then a
`Text` token for the whitespace again. The rest of the line is returned as
`String`.
Note that for this to work, every part of the match must be inside a capturing
group (a ``(...)``), and there must not be any nested capturing groups. If you
nevertheless need a group, use a non-capturing group defined using this syntax:
``r'(?:some|words|here)'`` (note the ``?:`` after the beginning parenthesis).
If you find yourself needing a capturing group inside the regex which
shouldn't be part of the output but is used in the regular expressions for
backreferencing (eg: ``r'(<(foo|bar)>)(.*?)(</\2>)'``), you can pass `None`
to the bygroups function and it will skip that group will be skipped in the
output.
Changing states
===============
Many lexers need multiple states to work as expected. For example, some
languages allow multiline comments to be nested. Since this is a recursive
pattern it's impossible to lex just using regular expressions.
Here is the solution:
.. sourcecode:: python
from pygments.lexer import RegexLexer
from pygments.token import *
class ExampleLexer(RegexLexer):
name = 'Example Lexer with states'
tokens = {
'root': [
(r'[^/]+', Text),
(r'/\*', Comment.Multiline, 'comment'),
(r'//.*?$', Comment.Singleline),
(r'/', Text)
],
'comment': [
(r'[^*/]', Comment.Multiline),
(r'/\*', Comment.Multiline, '#push'),
(r'\*/', Comment.Multiline, '#pop'),
(r'[*/]', Comment.Multiline)
]
}
This lexer starts lexing in the ``'root'`` state. It tries to match as much as
possible until it finds a slash (``'/'``). If the next character after the slash
is a star (``'*'``) the `RegexLexer` sends those two characters to the output
stream marked as `Comment.Multiline` and continues parsing with the rules
defined in the ``'comment'`` state.
If there wasn't a star after the slash, the `RegexLexer` checks if it's a
singleline comment (eg: followed by a second slash). If this also wasn't the
case it must be a single slash (the separate regex for a single slash must also
be given, else the slash would be marked as an error token).
Inside the ``'comment'`` state, we do the same thing again. Scan until the lexer
finds a star or slash. If it's the opening of a multiline comment, push the
``'comment'`` state on the stack and continue scanning, again in the
``'comment'`` state. Else, check if it's the end of the multiline comment. If
yes, pop one state from the stack.
Note: If you pop from an empty stack you'll get an `IndexError`. (There is an
easy way to prevent this from happening: don't ``'#pop'`` in the root state).
If the `RegexLexer` encounters a newline that is flagged as an error token, the
stack is emptied and the lexer continues scanning in the ``'root'`` state. This
helps producing error-tolerant highlighting for erroneous input, e.g. when a
single-line string is not closed.
Advanced state tricks
=====================
There are a few m
没有合适的资源?快使用搜索试试~ 我知道了~
Pygments-1.6.tar.gz
0 下载量 148 浏览量
2024-06-08
23:43:41
上传
评论
收藏 1.36MB GZ 举报
温馨提示
Python库是一组预先编写的代码模块,旨在帮助开发者实现特定的编程任务,无需从零开始编写代码。这些库可以包括各种功能,如数学运算、文件操作、数据分析和网络编程等。Python社区提供了大量的第三方库,如NumPy、Pandas和Requests,极大地丰富了Python的应用领域,从数据科学到Web开发。Python库的丰富性是Python成为最受欢迎的编程语言之一的关键原因之一。这些库不仅为初学者提供了快速入门的途径,而且为经验丰富的开发者提供了强大的工具,以高效率、高质量地完成复杂任务。例如,Matplotlib和Seaborn库在数据可视化领域内非常受欢迎,它们提供了广泛的工具和技术,可以创建高度定制化的图表和图形,帮助数据科学家和分析师在数据探索和结果展示中更有效地传达信息。
资源推荐
资源详情
资源评论
收起资源包目录
Pygments-1.6.tar.gz (417个子文件)
perlfunc.1 43KB
pygmentize.1 3KB
test.adb 5KB
demo.ahk 4KB
antlr_throws 30B
pppoe.applescript 310B
unicode.applescript 111B
as3_test.as 4KB
as3_test2.as 1KB
as3_test3.as 91B
nasm_aoutso.asm 3KB
nasm_objexe.asm 559B
example2.aspx 636B
aspx-cs_example 586B
test.asy 10KB
autoit_submit.au3 692B
AUTHORS 5KB
autopygmentize 2KB
test.awk 4KB
test.bas 846B
pygments.bashcomp 1KB
batchfile.bat 984B
test.bmx 2KB
test.boo 1KB
test.bro 7KB
example.bug 2KB
ceval.c 60KB
example.c 46KB
numbers.c 195B
ca65_example 5KB
Config.in.cache 64KB
cbmbas_example 319B
example.ceylon 887B
setup.cfg 97B
demo.cfm 724B
CHANGES 22KB
genclass.clj 21KB
escape_semicolon.clj 46B
example.cls 318B
main.cmake 2KB
example.cob 150KB
underscore.coffee 19KB
apache2.conf 12KB
nginx_nginx.conf 3KB
squid.conf 951B
lighttpd_config.conf 556B
coq_RelationClasses 15KB
example.cpp 78KB
test.cs 17KB
epydoc.css 13KB
test.css 933B
webkit-transition.css 49B
test.cu 776B
dwarf.cw 830B
test.d 5KB
string_delimiters.d 544B
test.dart 591B
HTML4.dcl 3KB
inet_pton6.dg 3KB
HTML4.dtd 45KB
HTML4-s.dtd 34KB
test.dtd 2KB
HTML4-f.dtd 1007B
classes.dylan 3KB
session.dylan-console 142B
test.ec 19KB
test.ecl 3KB
test.eh 7KB
HTMLsym.ent 14KB
HTMLlat1.ent 12KB
HTMLspec.ent 4KB
test.erl 6KB
erl_session 145B
test.evoque 920B
example_elixir.ex 10KB
zmlrpc.f90 27KB
wiki.factor 10KB
test.fan 25KB
test.flx 915B
glsl.frag 159B
intsyn.fun 22KB
example_file.fy 3KB
ANTLRv3.g 13KB
test.gdc 655B
genshitext_example.genshitext 332B
test.groovy 2KB
example.gs 3KB
example.gst 193B
SmallCheck.hs 11KB
AcidStateAdvanced.hs 8KB
import.hs 78B
lexers.html 208KB
lexerdevelopment.html 50KB
formatters.html 46KB
test.html 35KB
changelog.html 35KB
tokens.html 22KB
formatterdevelopment.html 20KB
api.html 20KB
quickstart.html 19KB
共 417 条
- 1
- 2
- 3
- 4
- 5
资源评论
程序员Chino的日记
- 粉丝: 3715
- 资源: 5万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功