Tensorflow Profiler 踩坑记

众所周知，在 AI 模型落地的时候，我们不仅仅关注模型的准确率，同样也会关注模型的性能。

在使用 VSCode 等 IDE 开发 C++、C# 应用的时候，我们就会去开启 debug 模式，通过可视化工具分析应用的热点与瓶颈在哪里，这个过程一般叫做 profiling。当然，在 linux 上，我们也有 valgrind 等工具可以方便我们分析。

那么具体到 tensorflow 的模型中，我们也想要进行类似的分析。好在 Google 官方提供了 tensorflow profiler 以及官方教程，我们可以 follow 这个教程来分析。此外，也有其他人写过博客来介绍 tensorflow profiler 的使用。

当然 Google 在知乎中也发表过一篇介绍文。

但是这玩意儿它坑就坑在在自己机器上作分析的话，这些东西要自己装。那么要装这些东西就会踩到一些坑。

_甚至用 Google Colab 都不可避免地会踩到一些坑_

包的版本

首先，为了稳定性，工地的 docker 镜像都是统一采用 tensorflow=2.10.0，尽管现在有更新的版本。因此，为了和工地的设置一致，我们本地也安装这个版本，tensorboard 也需要和 tensorflow 版本一致（最好是 2.10.0，我一开始直接pip install tensorflow=2.10 结果安装成了 2.10.1，后面也乖乖地设置成了 2.10.0）

然后就是 tensorboard_plugin_profile 包的版本。我一开始是直接 follow 官方教程的

1	pip install tensorboard_plugin_profile

来进行安装，结果训练时能正常 dump 记录到文件，但是启动 tensorboard 到浏览器上去看 profile 页面的时候，发现如下内容的页面：

**No profile data was found.**

If you have a model running on CPU, GPU, or Google Cloud TPU, you may be able to use the above button to capture a profile.

If you're a CPU or GPU user, please use the IP address option. You may want to check out the [tutorial](https://colab.research.google.com/github/tensorflow/tensorboard/blob/master/docs/tensorboard_profiling_keras.ipynb) on how to start a TensorFlow profiler server and profile a Keras model on a GPU.

If you're a TPU user, please use the TPU name option and you may want to check out the [tutorial](https://cloud.google.com/tpu/docs/cloud-tpu-tools) on how to interpreting the profiling results.

If you think profiling is done properly, please see the page of [Google Cloud TPU Troubleshooting and FAQ](https://cloud.google.com/tpu/docs/troubleshooting) and consider filing an issue on GitHub.

一开始看到这玩意儿一头雾水，然后就去搜了下，发现有几个 Github issue：

里面给出的解决方案基本上是调调文件目录结构，甚至还有给出一个手动下载并覆盖 events.out.tfevents.TIMESTAMP.USER.profile-empty 这种 monkey patch 级别的修复。试了一下，发现都不管用。

后面想到 tensorboard 本质上是一个 B-S 结构，那么看看服务端，结果发现了两行日志：

1
2

W tensorflow/core/profiler/convert/xplane_to_tools_data.cc:231] Can not find tool: tool_names. Please update to the latest version of Tensorflow.
2024-04-17 14:31:57.295365: W tensorflow/core/profiler/convert/xplane_to_tools_data.cc:231] Can not find tool: tool_names. Please update to the latest version of Tensorflow.

然后就在猜想这个 tool_names 是不是 tensorflow profiler 后续版本引入的新字段，但是在老版本的 tensorboard 中，该字段并不能够解析；并且刷新浏览器并仔细观察，发现是有一个 load 的进度条的，走到三分之一的地方才说 No profile data was found，这个现象其实从某红程度上来说作证了我们的猜想。pip 查看 tensorboard_plugin_profile 包的版本，发现为 2.15.1。

然后尝试装 2.10.0 的版本：

1	pip install tensorboard_plugin_profile==2.10.0

结果发现还是没得：

1
2

ERROR: Could not find a version that satisfies the requirement tensorboard_plugin_profile==2.10.0 (from versions: 2.2.0a1, 2.2.0a2, 2.2.0a3, 2.2.0a4, 2.2.0a5, 2.2.0a6, 2.2.0rc0, 2.2.0, 2.3.0rc0, 2.3.0, 2.4.0, 2.5.0, 2.8.0, 2.11.1, 2.11.2, 2.13.0, 2.13.1, 2.14.0, 2.15.0, 2.15.1)
ERROR: No matching distribution found for tensorboard_plugin_profile==2.10.0

猜想就是可能这种插件性质的包更新速度不如 tensorflow 和 tensorboard 更新得频繁，可能 tensorflow=2.8.0 的下一次更新就直接到了 tensorflow=2.11.1，所以我们选择装 2.8.0 版本的 tensorboard_plugin_profile：

1	pip install tensorboard_plugin_profile==2.8.0

然后重新捕捉训练的记录，结果 tensorboard 就能正常显示 profile 信息了。

CUPTI DLL 缺失

在捕捉训练记录的过程中，会看到

1	CUPTI error: CUPTI could not be loaded or symbol could not be found.

然后其实主要就是：

复制 DLL：将 CUDA\v11.6\extras\CUPTI\lib64\cupti64_2022.1.1.dll 移动到 CUDA\v11.6\bin 目录下
改名：CUPTI 新版 DLL 的命名规则发生了改变，需要遵循旧版的命名法则，手动改成 cupti64_112.dll。这里需要特别注意，虽然我本地的 CUDA 版本已经是 11.6，但是因为 tensorflow 只认 cupti64_112.dll 这个文件名，所以就改成这个一模一样的文件名就好
再次确认环境变量：确保 CUDA 的 bin 目录和 CUPTI 目录在 PATH 环境变量下