目前来说,处理数据我还是比较喜欢使用pandas,确实爽到飞起,而这种易用性带来的是底层数据结构的复杂性,从而导致性能损失。但是好用真的太重要了,所以出现了像Numba/Datatable等一些列支持并行化的Dataframe的方案,当然也有dask这一类分布式并行架构,话说回来 Numba已经支持GPU加速~说起来还是挺爽的,但是这不是文章的重点,关于pandas并行,前面已经有一篇文章介绍过了。
这次来我们聊一下加速Python的另一种思路,C语言加速,这里我们使用Cython,sklearn大多数计算由这个方案实现。当然可以利用C语言直接扩展python,PyTorch/TensorFlow/Numpy都是这种方案,但是这个需要对C/C++开发比较熟悉,开发效率可能是不及Cython的,可能哈。
为什么python慢,因为它是动态数据类型,运行时解释器要花费大量时间来确定对象的数据类型,从而判定数据类型的属性,C语言等一些列严格的静态数据类型语言就没有这些遗憾,所以C的效率要高很多,在一些特殊情况下能高出几个数量级。Cython的原始文档:
This can make Python a very relaxed and comfortable language for rapid development, but with a price – the ‘red tape’ of managing data types is dumped onto the interpreter. At run time, the interpreter does a lot of work searching namespaces, fetching attributes and parsing argument and keyword tuples. This run-time ‘late binding’ is a major cause of Python’s relative slowness compared to ‘early binding’ languages such as C++.
This指的是python的动态数据类型优势。直接看例子,来自这篇文章:https://pythonprogramming.net/introduction-and-basics-cython-tutorial
# example_original.py
def test(x):
y = 0
for i in range(x):
y += i
return y
``````python
# example_cython.pyx
cpdef int test(int x):
cdef int y = 0
cdef int i
for i in range(x):
y += i
return y
Cython文件后缀名是”.pyx”,相比原生python方法,增加了数据类型的定义,接下来需要编写setup.py文件用于构建pyx文件
from distutils.core import setup
from Cython.Build import cythonize
setup(ext_modules = cythonize('example_cython.pyx'))
三个文件都是在同一个目录下,进shell执行:
python setup.py build_ext --inplace
一切顺利的话~~会得到一个Warning~~
FutureWarning: Cython directive ‘language_level’ not set, using 2 for now (Py2). This will change in a later release! |
可以在pyx文件中加入如下声明:
# cython: language_level=3
没什么问题就OK了,接下来写一个test文件:
import example_cython, example_original, time
if __name__ == '__main__':
times = 10000
add_times = 100
# original
original_total_elapse = 0.0
for i in range(times):
start_time = time.time()
example_original.test(add_times)
original_total_elapse += time.time() - start_time
# cython
cython_total_elapse = 0.0
for i in range(times):
start_time = time.time()
example_cython.test(add_times)
cython_total_elapse += time.time() - start_time
print("Cython is {}x faster.".format(original_total_elapse / cython_total_elapse) )