.. -*- mode: rst -*-
====================
Write your own lexer
====================
If a lexer for your favorite language is missing in the Pygments package, you can
easily write your own and extend Pygments.
All you need can be found inside the `pygments.lexer` module. As you can read in
the `API documentation <api.txt>`_, a lexer is a class that is initialized with
some keyword arguments (the lexer options) and that provides a
`get_tokens_unprocessed()` method which is given a string or unicode object with
the data to parse.
The `get_tokens_unprocessed()` method must return an iterator or iterable
containing tuples in the form ``(index, token, value)``. Normally you don't need
to do this since there are numerous base lexers you can subclass.
RegexLexer
==========
A very powerful (but quite easy to use) lexer is the `RegexLexer`. This lexer
base class allows you to define lexing rules in terms of *regular expressions*
for different *states*.
States are groups of regular expressions that are matched against the input
string at the *current position*. If one of these expressions matches, a
corresponding action is performed (normally yielding a token with a specific
type), the current position is set to where the last match ended and the
matching process continues with the first regex of the current state.
Lexer states are kept in a state stack: each time a new state is entered, the
new state is pushed onto the stack. The most basic lexers (like the
`DiffLexer`) just need one state.
Each state is defined as a list of tuples in the form (`regex`, `action`,
`new_state`) where the last item is optional. In the most basic form, `action`
is a token type (like `Name.Builtin`). That means: When `regex` matches, emit a
token with the match text and type `tokentype` and push `new_state` on the state
stack. If the new state is ``'#pop'``, the topmost state is popped from the
stack instead. (To pop more than one state, use ``'#pop:2'`` and so on.)
``'#push'`` is a synonym for pushing the current state on the
stack.
The following example shows the `DiffLexer` from the builtin lexers. Note that
it contains some additional attributes `name`, `aliases` and `filenames` which
aren't required for a lexer. They are used by the builtin lexer lookup
functions.
.. sourcecode:: python
from pygments.lexer import RegexLexer
from pygments.token import *
class DiffLexer(RegexLexer):
name = 'Diff'
aliases = ['diff']
filenames = ['*.diff']
tokens = {
'root': [
(r' .*\n', Text),
(r'\+.*\n', Generic.Inserted),
(r'-.*\n', Generic.Deleted),
(r'@.*\n', Generic.Subheading),
(r'Index.*\n', Generic.Heading),
(r'=.*\n', Generic.Heading),
(r'.*\n', Text),
]
}
As you can see this lexer only uses one state. When the lexer starts scanning
the text, it first checks if the current character is a space. If this is true
it scans everything until newline and returns the parsed data as `Text` token.
If this rule doesn't match, it checks if the current char is a plus sign. And
so on.
If no rule matches at the current position, the current char is emitted as an
`Error` token that indicates a parsing error, and the position is increased by
1.
Regex Flags
===========
You can either define regex flags in the regex (``r'(?x)foo bar'``) or by adding
a `flags` attribute to your lexer class. If no attribute is defined, it defaults
to `re.MULTILINE`. For more informations about regular expression flags see the
`regular expressions`_ help page in the python documentation.
.. _regular expressions: http://docs.python.org/lib/re-syntax.html
Scanning multiple tokens at once
================================
Here is a more complex lexer that highlights INI files. INI files consist of
sections, comments and key = value pairs:
.. sourcecode:: python
from pygments.lexer import RegexLexer, bygroups
from pygments.token import *
class IniLexer(RegexLexer):
name = 'INI'
aliases = ['ini', 'cfg']
filenames = ['*.ini', '*.cfg']
tokens = {
'root': [
(r'\s+', Text),
(r';.*?$', Comment),
(r'\[.*?\]$', Keyword),
(r'(.*?)(\s*)(=)(\s*)(.*?)$',
bygroups(Name.Attribute, Text, Operator, Text, String))
]
}
The lexer first looks for whitespace, comments and section names. And later it
looks for a line that looks like a key, value pair, seperated by an ``'='``
sign, and optional whitespace.
The `bygroups` helper makes sure that each group is yielded with a different
token type. First the `Name.Attribute` token, then a `Text` token for the
optional whitespace, after that a `Operator` token for the equals sign. Then a
`Text` token for the whitespace again. The rest of the line is returned as
`String`.
Note that for this to work, every part of the match must be inside a capturing
group (a ``(...)``), and there must not be any nested capturing groups. If you
nevertheless need a group, use a non-capturing group defined using this syntax:
``r'(?:some|words|here)'`` (note the ``?:`` after the beginning parenthesis).
If you find yourself needing a capturing group inside the regex which
shouldn't be part of the output but is used in the regular expressions for
backreferencing (eg: ``r'(<(foo|bar)>)(.*?)(</\2>)'``), you can pass `None`
to the bygroups function and it will skip that group will be skipped in the
output.
Changing states
===============
Many lexers need multiple states to work as expected. For example, some
languages allow multiline comments to be nested. Since this is a recursive
pattern it's impossible to lex just using regular expressions.
Here is the solution:
.. sourcecode:: python
from pygments.lexer import RegexLexer
from pygments.token import *
class ExampleLexer(RegexLexer):
name = 'Example Lexer with states'
tokens = {
'root': [
(r'[^/]+', Text),
(r'/\*', Comment.Multiline, 'comment'),
(r'//.*?$', Comment.Singleline),
(r'/', Text)
],
'comment': [
(r'[^*/]', Comment.Multiline),
(r'/\*', Comment.Multiline, '#push'),
(r'\*/', Comment.Multiline, '#pop'),
(r'[*/]', Comment.Multiline)
]
}
This lexer starts lexing in the ``'root'`` state. It tries to match as much as
possible until it finds a slash (``'/'``). If the next character after the slash
is a star (``'*'``) the `RegexLexer` sends those two characters to the output
stream marked as `Comment.Multiline` and continues parsing with the rules
defined in the ``'comment'`` state.
If there wasn't a star after the slash, the `RegexLexer` checks if it's a
singleline comment (eg: followed by a second slash). If this also wasn't the
case it must be a single slash (the separate regex for a single slash must also
be given, else the slash would be marked as an error token).
Inside the ``'comment'`` state, we do the same thing again. Scan until the lexer
finds a star or slash. If it's the opening of a multiline comment, push the
``'comment'`` state on the stack and continue scanning, again in the
``'comment'`` state. Else, check if it's the end of the multiline comment. If
yes, pop one state from the stack.
Note: If you pop from an empty stack you'll get an `IndexError`. (There is an
easy way to prevent this from happening: don't ``'#pop'`` in the root state).
If the `RegexLexer` encounters a newline that is flagged as an error token, the
stack is emptied and the lexer continues scanning in the ``'root'`` state. This
helps producing error-tolerant highlighting for erroneous input, e.g. when a
single-line string is not closed.
Advanced state tricks
=====================
There are a few m
没有合适的资源?快使用搜索试试~ 我知道了~
Pygments-0.10.tar.gz
0 下载量 155 浏览量
2024-06-20
23:52:51
上传
评论
收藏 817KB GZ 举报
温馨提示
Python库是一组预先编写的代码模块,旨在帮助开发者实现特定的编程任务,无需从零开始编写代码。这些库可以包括各种功能,如数学运算、文件操作、数据分析和网络编程等。Python社区提供了大量的第三方库,如NumPy、Pandas和Requests,极大地丰富了Python的应用领域,从数据科学到Web开发。Python库的丰富性是Python成为最受欢迎的编程语言之一的关键原因之一。这些库不仅为初学者提供了快速入门的途径,而且为经验丰富的开发者提供了强大的工具,以高效率、高质量地完成复杂任务。例如,Matplotlib和Seaborn库在数据可视化领域内非常受欢迎,它们提供了广泛的工具和技术,可以创建高度定制化的图表和图形,帮助数据科学家和分析师在数据探索和结果展示中更有效地传达信息。
资源推荐
资源详情
资源评论
收起资源包目录
Pygments-0.10.tar.gz (222个子文件)
perlfunc.1 43KB
pygmentize.1 3KB
AUTHORS 766B
test.bas 846B
batchfile.bat 984B
test.boo 1KB
ceval.c 60KB
example.c 46KB
numbers.c 195B
setup.cfg 97B
CHANGES 8KB
apache2.conf 12KB
squid.conf 940B
example.cpp 78KB
test.cs 16KB
epydoc.css 13KB
test.css 793B
dwarf.cw 830B
test.d 5KB
string_delimiters.d 544B
HTML4.dcl 3KB
HTML4.dtd 45KB
HTML4-s.dtd 34KB
HTML4-f.dtd 1007B
classes.dylan 683B
HTMLsym.ent 14KB
HTMLlat1.ent 12KB
HTMLspec.ent 4KB
test.erl 6KB
zmlrpc.f90 27KB
genshitext_example.genshitext 332B
SmallCheck.hs 11KB
lexers.html 92KB
lexerdevelopment.html 52KB
formatters.html 43KB
test.html 35KB
tokens.html 23KB
formatterdevelopment.html 21KB
api.html 20KB
quickstart.html 19KB
changelog.html 17KB
styles.html 15KB
filters.html 15KB
cmdline.html 14KB
filterdevelopment.html 12KB
plugins.html 11KB
installation.html 10KB
unicode.html 9KB
smarty_example.html 9KB
index.html 8KB
moinmoin.html 8KB
authors.html 8KB
integrate.html 8KB
rstdirective.html 7KB
django_sample.html+django 3KB
MANIFEST.in 159B
Intro.java 60KB
test.java 19KB
badcase.java 70B
test.jsp 627B
source.lgt 7KB
DancingSudoku.lhs 15KB
Sudoku.lhs 13KB
LICENSE 2KB
type.lisp 48KB
sources.list 3KB
example.lua 8KB
sample.m 469B
firefox.mak 12KB
python25-bsd.mak 7KB
Makefile 37KB
Makefile 2KB
simple.md 11KB
format.ml 41KB
test.moo 2KB
example.moo 999B
AlternatingGroup.mu 2KB
test.myt 4KB
not-zip-safe 1B
example.pas 61KB
test.pas 21KB
test.php 17KB
html+php_faulty.php 6B
PKG-INFO 2KB
PKG-INFO 2KB
perl5db.pl 30KB
de.MoinMoin.po 69KB
unistring.py 394KB
_phpbuiltins.py 104KB
compiled.py 52KB
agile.py 48KB
_vimbuiltins.py 39KB
text.py 34KB
templates.py 34KB
other.py 32KB
functional.py 27KB
vim2pygments.py 26KB
web.py 24KB
lexer.py 22KB
html.py 22KB
共 222 条
- 1
- 2
- 3
资源评论
程序员Chino的日记
- 粉丝: 3654
- 资源: 5万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功