Ada Hsu 的胡思亂想: 2019/04

2019年4月28日

啟用 TensorFlow 的 Intel CPU 擴充指令集

昨天老師在課堂示範了使用 Keras 搭配 TensorFlow 做底層的範例程式，以 minst 資料集的 60000 個數字圖檔進行圖像辦識訓練，這個範例程式透過 Keras 建立了 2 個隱藏層，每次訓練 100 筆資料並持續測試 20 輪，Keras 的 Dense 資訊如下：

Layer (type)	Output Shape	Param #
dense_1 (Dense)	(None, 689)	540865
dense_2 (Dense)	(None, 689)	475410
dense_3 (Dense)	(None, 689)	475410
dense_4 (Dense)	(None, 10)	6900

一開始是在 Jupyter 的 console 中看到下面這段訊息，它顯示了 TensorFlow 沒有真正使用CPU 的擴充指令集，而透過 pip search 發現有個可疑的套件 intel-tensorflow，而我才剛剛在 MacOSX 上透過 PlaidML 啟用了 Keras 對 AMD Radeon 560X 的支援，不妨來看看這幾種組合對 Keras 訓練的影響。

I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

先說結論

不一定需要有多強大的 GPU 硬體支出，大部份電腦只要額外安裝 intel-tensorflow 套件就能有效減少訓練時間。

補充說明：

請同學在 Windows 上測試，認為似乎裝不裝 intel-tensorflow 都一樣慢
個人測試 CPU Extension 全開的 tensorflow 比 AMD Radeon 560X GPU 版本還快
但 Evaluating PlaidML and GPU Support for Deep Learning on a Windows 10 Notebook 這篇文章看起來則是 GPU 有絕對優勢，但它的範例程式我跑不了... 會出以下訊息
```
ValueError: `steps_per_epoch=None` is only valid for a generator based on the `keras.utils.Sequence` class. Please specify `steps_per_epoch` or use the `keras.utils.Sequence` class.
```

補充說明 20190806：

在 tensorflow 1.14.0 之後似乎 Intel 優化版已整併到官方版本內，pip 內的版本停在 1.13.1 中，1.14 啟動時的訊息也只剩 AVX2 FMA 未啟動
另外發現在 Mac 上使用 PlaidML 當 Keras 後端時，使用 Apple 自家的 Metal API 效率會比 OpenCL 好約 1/3（94 秒 --> 68 秒）
CPU Extension 全開的客制版目前還沒有可和 tensorflow 1.14.0 版搭配的 MacOS 平台版本，大概只能自行建置

各種 TenforFlow 版本安裝

通用標準版

沒有經過特別設定的的話應該都是安裝這個版本的 TensorFlow，也就是經由以下指令安裝的版本。在執行過程中會在 Console 中偵測 CPU 能提供什麼擴充指令集然後顯示出來。

pip install tensorflow

TensorFlow 對 CPU 擴充指令集的提示像這樣：

I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

Intel 優化版

從名字可以知道是 Intel 調教過的版本，主要是啟用 Intel CPU 的 SSE4.1 SSE4.2 AVX 這 3 個擴充指令集。套件資訊中雖然沒有寫相依性套件資料，但個人覺得應該有依賴於 tensorflow 官方套件。

$ pip show intel-tensorflow
Name: intel-tensorflow
Version: 0.0.1
Summary: Intel Optimized Tensorflow with MKL
Home-page: https://github.com/IntelAI
Author: Intel Tensorflow optimization team
Author-email: [email protected]
License: UNKNOWN
Location: /usr/local/anaconda3/envs/ai/lib/python3.6/site-packages
Requires: pip
Required-by:

安裝這個套件後，TensorFlow 對 CPU 擴充指令集的提示變成這樣：

I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

客制版

就算是 Intel 優化版的 TenforFlow 也沒有完整啟用 CPU 擴充指令集，於是就有強者在 GitHub 上預先編譯 TensorFlow 套件並加入完整擴充指令集設定，如 tensorflow-build 及 tensorflow-windows-wheel。
套件安裝方式只要使用以下 pip 指令即可，但找到對的套件包很麻煩。

pip install --ignore-installed --upgrade "下載的 .whl 檔完整路徑"

各種組態下的訓練狀況

此處針對文章開頭的情境分別進行訓練，每次訓練前都會關閉 Jupyter （因為要安裝套件）並直接對 Jupyter Kernel 執行 Restart & Run All 的結果。
硬體環境為：MacBook Pro (15-inch, 2018), 2.6 GHz Intel Core i7, Radeon Pro 560X 4 GB 獨立顯示卡。

Backend 類型	花費時間（秒）	CPU 使用率	風扇運轉狀態
tensorflow (1.13.1)	211.579	1100% ↑	全速
intel-tensorflow (1.13.1)	113.840	800% ↓	微速
客製版 1.13.1	97.193	700% ↑	微速
PlaidML w/ AMD Radeon Pro 560X	91.528	250% ↓	全速

測試用的 Code 可參考 keras_MNIST.ipynb，要自行驗證的話記得把 backend 改回 tensorflow。

Ada Hsu 的胡思亂想

2019年4月28日

啟用 TensorFlow 的 Intel CPU 擴充指令集

先說結論

補充說明：

補充說明 20190806：

各種 TenforFlow 版本安裝

通用標準版

Intel 優化版

客制版

各種組態下的訓練狀況

Hard to Read ?

搜尋此網誌

文章分類

熱門文章

網誌存檔

追蹤者

關於我自己

總瀏覽量

網路連署