转载

Python 黑魔法之 Encoding & Decoding

写在前面

本文为科普文
本文中的例子在 Ubuntu 14.04 / Python 2.7.11 下运行成功，Python 3+ 的接口有些许不同，需要读者自行转换

引子

先看一段代码：

example.py ：

# -*- coding=yi -*-  从 math 导入 sin, pi  打印 'sin(pi) =', sin(pi)

这是什么？！是 Python 吗？可以运行吗？——想必你会问。

我可以明确告诉你：这不是 Python， 但它可以用 Python 解释器运行 。当然，如果你愿意，可以叫它 “Yython” （易语言 + Python）。

Python 黑魔法之 Encoding & Decoding

怎么做到的？也许你已经注意到第一行的奇怪注释——没错，秘密全在这里。

这种黑魔法，还要从 PEP 263 说起。

古老的 PEP 263

我相信 99% 的中国 Python 开发者都曾经为一个问题而头疼——字符编码。那是每个初学者的梦靥。

还记得那天吗？当你试图用代码向它示好：

print '你好'

它却给你当头一棒：

SyntaxError: Non-ASCII character '/xe4' in file chi.py on line 1, but no encoding declared

【一脸懵逼】

于是，你上网查找解决方案。很快，你便有了答案：

# -*- coding=utf-8 -*-  print '你好'

其中第一行的注释用于指定解析该文件的编码。

这个特新来自 2001 年的 PEP 263 -- Defining Python Source Code Encodings ，它的出现是为了解决一个反响广泛的问题：

In Python 2.1, Unicode literals can only be written using the Latin-1 based encoding "unicode-escape". This makes the programming environment rather unfriendly to Python users who live and work in non-Latin-1 locales such as many of the Asian countries. Programmers can write their 8-bit strings using the favorite encoding, but are bound to the "unicode-escape" encoding for Unicode literals.

Python 默认用 ASCII 编码解析文件，给 15 年前的非英文世界开发者造成了不小的困扰——看来 Guido 老爹有些个人主义，设计时只考虑到了英文世界。

提案者设想： 使用一种特殊的文件首注释，用于指定代码的编码 。这个注释的正则原型是这样的：

^[ /t/v]*#.*?coding[:=][ /t]*([-_.a-zA-Z0-9]+)

也就是说 # -*- coding=utf-8 -*- 并不是唯一的写法，只是 Emacs 推荐写法而已。诸如 # coding=utf-8 、 # encoding: utf-8 都是合法的——因此你不必惊讶于他人编码声明与你不同。

正则的捕获组 ([-_.a-zA-Z0-9]+) 将会被用作查找编码的名称，查找到的编码信息会被用于解码文件。也就是说， import example 背后其实相当于有如下转换过程：

with open('example.py', 'r') as f:     content = f.read()     encoding = extract_encoding_info(content) # 解析首注释     exec(content.decode(encoding))

问题其实又回到我们常用的 str.encode 和 str.decode 上来了。

可 Python 怎么这么强大？！几乎所有编码它都认得！这是怎么做到的？是标准库？还是内置于解释器中？

一切，都是 codecs 模块在起作用。

codecs

codecs 算是较为冷门的一个模块，更为常用的是 str 的 encode / decode 的方法——但它们本质都是对 codecs 的调用。

打开 /path/to/your/python/lib/encodings/ 目录，你会发现有许多以编码名称命名的 .py 文件，如 utf_8.py 、 latin_1.py 。这些都是系统预定义的编码系统，实现了应对各种编码的逻辑——也就是说：编码系统其实也是普通的模块。

除了内置的编码，用户也可以 自行定义编码系统 。 codecs 暴露了一个 register 函数，用于注册自定义编码。 register 签名如下：

codecs.register(search_function)

Register a codec search function. Search functions are expected to take one argument, the encoding name in all lower case letters, and return a CodecInfo object having the following attributes:

name: The name of the encoding;
encode: The stateless encoding function;
decode: The stateless decoding function;
incrementalencoder: An incremental encoder class or factory function;
incrementaldecoder: An incremental decoder class or factory function;
streamwriter: A stream writer class or factory function;
streamreader: A stream reader class or factory function.

encode 和 decode 是无状态的编码/解码的函数，简单说就是：前一个被编解码的字符串与后一个没有关联。如果你想用 codecs 系统进行语法树解析，解析逻辑最好不要写在这里，因为代码的连续性无法被保证； incremental* 则是有状态的解析类，能弥补 encode 、 decode 的不足； stream* 是流相关的解析类，行为通常与 encode / decode 相同。

关于这六个对象的具体写法，可以参考 /path/to/your/python/lib/encodings/rot_13.py ，该文件实现了一个简单的密码系统。

那么，是时候揭开真相了。

所谓的 “Yython”

黑魔法其实并不神秘，照猫画虎定义好相应的接口即可。作为例子，这里只处理用到的关键字：

yi.py ：

# encoding=utf8  import codecs  yi_map = {     u'从': 'from',     u'导入': 'import',     u'打印': 'print' }   def encode(input):     for key, value in yi_map.items():         input = input.replace(value, key)      return input.encode('utf8')   def decode(input):     input = input.decode('utf8')     for key, value in yi_map.items():         input = input.replace(key, value)      return input   class Codec(codecs.Codec):      def encode(self, input, errors="strict"):         input = encode(input)          return (input, len(input))      def decode(self, input, errors="strict"):         input = decode(input)          return (input, len(input))   class IncrementalEncoder(codecs.IncrementalEncoder):     def encode(self, input, final=False):         return encode(input)   class IncrementalDecoder(codecs.IncrementalDecoder):     def decode(self, input, final=False):         return decode(input)   class StreamWriter(Codec, codecs.StreamWriter):     pass   class StreamReader(Codec, codecs.StreamReader):     pass   def register_entry(encoding):     return codecs.CodecInfo(         name='yi',         encode=Codec().encode,         decode=Codec().decode,         incrementalencoder=IncrementalEncoder,         incrementaldecoder=IncrementalDecoder,         streamwriter=StreamWriter,         streamreader=StreamReader     ) if encoding == 'yi' else None

在命令行里注册一下，就可以看到激动人心的结果了：

>>> import codecs, yi >>> codecs.register(yi.register_entry) >>> import example sin(pi) = 1.22464679915e-16

结语

有时，对习以为常的东西深入了解一下，说不定会有惊人的发现。

References

codecs - Codec registry and base classes

原文 https://segmentfault.com/a/1190000006037333

正文到此结束

所属分类：编程技术

本文标签： 代码 final https 解析 ECS 开发者开发 Ubuntu 初学者 src UI SyntaxError python 目录 CTO lib map value ACE 注释 example key http
版权声明： 本文为互联网转载文章，出处已在文章中说明(部分除外)。如果侵权，请联系本站长删除，谢谢。
本文海报： 生成海报一生成海报二

其他链接

关于本站

本站定位：个人技术类博客

本站作用：写博客、记日志、闲聊扯淡鼓捣技术。

问题交流

Python 黑魔法之 Encoding & Decoding

写在前面

引子

古老的 PEP 263

codecs

所谓的 “Yython”

结语

References

热门推荐

相关文章

说给你听

本文目录

随机标签

书籍教程

近期评论

网站信息

其他链接

关于本站

问题交流