Linux和Windows之间的numpy性能差异
问题内容:
我试图 sklearn.decomposition.TruncatedSVD()
在2台不同的计算机上运行,并了解性能差异。
计算机1 (Windows 7,物理计算机)
OS Name Microsoft Windows 7 Professional
System Type x64-based PC
Processor Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 3401 Mhz, 4 Core(s),
8 Logical Installed Physical Memory (RAM) 8.00 GB
Total Physical Memory 7.89 GB
计算机2 (Debian,在亚马逊云上)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
width: 64 bits
capabilities: ldt16 vsyscall32
*-core
description: Motherboard
physical id: 0
*-memory
description: System memory
physical id: 0
size: 29GiB
*-cpu
product: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
vendor: Intel Corp.
physical id: 1
bus info: cpu@0
width: 64 bits
计算机3 (Windows 2008R2,在亚马逊云上)
OS Name Microsoft Windows Server 2008 R2 Datacenter
Version 6.1.7601 Service Pack 1 Build 7601
System Type x64-based PC
Processor Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 2500 Mhz,
4 Core(s), 8 Logical Processor(s)
Installed Physical Memory (RAM) 30.0 GB
两台计算机都运行Python 3.2并具有相同的sklearn,numpy,scipy版本
我跑 cProfile
如下:
print(vectors.shape)
>>> (7500, 2042)
_decomp = TruncatedSVD(n_components=680, random_state=1)
global _o
_o = _decomp
cProfile.runctx('_o.fit_transform(vectors)', globals(), locals(), sort=1)
计算机1输出
>>> 833 function calls in 1.710 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.767 0.767 0.782 0.782 decomp_svd.py:15(svd)
1 0.249 0.249 0.249 0.249 {method 'enable' of '_lsprof.Profiler' objects}
1 0.183 0.183 0.183 0.183 {method 'normal' of 'mtrand.RandomState' objects}
6 0.174 0.029 0.174 0.029 {built-in method csr_matvecs}
6 0.123 0.021 0.123 0.021 {built-in method csc_matvecs}
2 0.110 0.055 0.110 0.055 decomp_qr.py:14(safecall)
1 0.035 0.035 0.035 0.035 {built-in method dot}
1 0.020 0.020 0.589 0.589 extmath.py:185(randomized_range_finder)
2 0.018 0.009 0.019 0.010 function_base.py:532(asarray_chkfinite)
24 0.014 0.001 0.014 0.001 {method 'ravel' of 'numpy.ndarray' objects}
1 0.007 0.007 0.009 0.009 twodim_base.py:427(triu)
1 0.004 0.004 1.710 1.710 extmath.py:232(randomized_svd)
计算机2输出
>>> 858 function calls in 40.145 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
2 32.116 16.058 32.116 16.058 {built-in method dot}
1 6.148 6.148 6.156 6.156 decomp_svd.py:15(svd)
2 0.561 0.281 0.561 0.281 decomp_qr.py:14(safecall)
6 0.561 0.093 0.561 0.093 {built-in method csr_matvecs}
1 0.337 0.337 0.337 0.337 {method 'normal' of 'mtrand.RandomState' objects}
6 0.202 0.034 0.202 0.034 {built-in method csc_matvecs}
1 0.052 0.052 1.633 1.633 extmath.py:183(randomized_range_finder)
1 0.045 0.045 0.054 0.054 _methods.py:73(_var)
1 0.023 0.023 0.023 0.023 {method 'argmax' of 'numpy.ndarray' objects}
1 0.023 0.023 0.046 0.046 extmath.py:531(svd_flip)
1 0.016 0.016 40.145 40.145 <string>:1(<module>)
24 0.011 0.000 0.011 0.000 {method 'ravel' of 'numpy.ndarray' objects}
6 0.009 0.002 0.009 0.002 {method 'reduce' of 'numpy.ufunc' objects}
2 0.008 0.004 0.009 0.004 function_base.py:532(asarray_chkfinite)
计算机3输出
>>> 858 function calls in 2.223 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.956 0.956 0.972 0.972 decomp_svd.py:15(svd)
2 0.306 0.153 0.306 0.153 {built-in method dot}
1 0.274 0.274 0.274 0.274 {method 'normal' of 'mtrand.RandomState' objects}
6 0.205 0.034 0.205 0.034 {built-in method csr_matvecs}
6 0.151 0.025 0.151 0.025 {built-in method csc_matvecs}
2 0.133 0.067 0.133 0.067 decomp_qr.py:14(safecall)
1 0.032 0.032 0.043 0.043 _methods.py:73(_var)
1 0.030 0.030 0.030 0.030 {method 'argmax' of 'numpy.ndarray' objects}
24 0.026 0.001 0.026 0.001 {method 'ravel' of 'numpy.ndarray' objects}
2 0.019 0.010 0.020 0.010 function_base.py:532(asarray_chkfinite)
1 0.019 0.019 0.773 0.773 extmath.py:183(randomized_range_finder)
1 0.019 0.019 0.049 0.049 extmath.py:531(svd_flip)
请注意 {内置方法点} 差异,从0.035s /调用到16.058s /调用, 慢了450倍!!
------+---------+---------+---------+---------+---------------------------------------
ncalls| tottime | percall | cumtime | percall | filename:lineno(function) HARDWARE
------+---------+---------+---------+---------+---------------------------------------
1 | 0.035 | 0.035 | 0.035 | 0.035 | {built-in method dot} Computer 1
2 | 32.116 | 16.058 | 32.116 | 16.058 | {built-in method dot} Computer 2
2 | 0.306 | 0.153 | 0.306 | 0.153 | {built-in method dot} Computer 3
我知道应该有性能差异,但是我应该那么高吗?
有什么方法可以进一步调试此性能问题?
编辑
我测试了一台新计算机,即计算机3,其硬件类似于计算机2,但操作系统不同
对于 {内置方法点} ,结果为0.153秒/调用, 仍然比Linux快100倍!
编辑2
计算机1 numpy的配置
>>> np.__config__.show()
lapack_opt_info:
libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd', 'mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd']
library_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/lib/intel64']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/include']
blas_opt_info:
libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd']
library_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/lib/intel64']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/include']
openblas_info:
NOT AVAILABLE
lapack_mkl_info:
libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd', 'mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd']
library_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/lib/intel64']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/include']
blas_mkl_info:
libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd']
library_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/lib/intel64']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/include']
mkl_info:
libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd']
library_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/lib/intel64']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/include']
计算机2 numpy配置
>>> np.__config__.show()
lapack_info:
NOT AVAILABLE
lapack_opt_info:
NOT AVAILABLE
blas_info:
libraries = ['blas']
library_dirs = ['/usr/lib']
language = f77
atlas_threads_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE
lapack_src_info:
NOT AVAILABLE
openblas_info:
NOT AVAILABLE
atlas_blas_threads_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
blas_opt_info:
libraries = ['blas']
library_dirs = ['/usr/lib']
language = f77
define_macros = [('NO_ATLAS_INFO', 1)]
atlas_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE
问题答案:
{built-in method dot}
是np.dot
函数,它是围绕矩阵矩阵,矩阵向量和向量向量乘法的CBLAS例程的NumPy包装器。您的Windows计算机使用经过严格调整的Intel
MKL
版本的CBLAS。Linux机器正在使用缓慢的旧参考实现。
如果您安装ATLAS或OpenBLAS(均可通过Linux软件包管理器安装),或者实际上是Intel
MKL,则可能会看到大量的加速。尝试sudo apt-get install libatlas- dev
,再次检查NumPy配置,以查看它是否拾取了ATLAS,然后再次进行测量。
确定正确的CBLAS库后,您可能需要重新编译scikit-
learn。大多数算法只是将NumPy用于其线性代数需求,但是某些算法(尤其是k均值)直接使用CBLAS。
操作系统与此无关。