ai-center
2022.4
true
- 入门指南
- 网络要求
- 单节点要求和安装
- 多节点要求和安装
- 安装后
- 配置 GPU
- 使用配置文件
- 节点调度
- 迁移和升级
- 基本故障排除指南
AI Center 安装指南
上次更新日期 2024年6月6日
AI Center 独立版故障排除
本节提供独立环境中 AI Center 的故障排除信息。
以下部分专门针对 AI Center。
重要提示: AI Center 独立安装使用与 Automation Suite 相同的安装程序。 AI Center 独立版故障排除 部分中的某些页面可以转到 Automation Suite 中的相应页面。 在这种情况下,两种情况下的步骤相同,并且 AI Center 的特定过程没有特殊之处。
确保遵循适合您需求的过程。
在某些情况下,当 AI Center 安装需要一个多小时(通常在离线安装期间)时,
input.json
文件中提供的初始 Orchestrator 令牌将过期,并且 AI Center 向 Identity Server 注册失败。 请按照以下步骤将其恢复。
- 使用
admin
用户名登录到https://alm.<LB DNS>
。 要获取密码,请运行以下命令:kubectl -n argocd get secret argocd-admin-password -o jsonpath={.data.password} | base64 -d
kubectl -n argocd get secret argocd-admin-password -o jsonpath={.data.password} | base64 -d - 转到 ArgoCD,然后单击 aicenter 磁 贴。
- 单击“ 应用程序详细信息 ”,然后转到“ 清单” 选项卡。
- 在“ 清单” 选项卡中,单击“ 编辑” 。
- 通过更新“ 清单” 选项卡中的
accessToken
字段来获取新的身份令牌,然后单击“ 保存”。
同步将自动启动并完成。
安装独立 AI Center 时可能会出现以下错误消息:
curl: (92) HTTP/2 stream 0 was not closed cleanly: HTTP_1_1_REQUIRED (err 13)
。
如果数据库存在问题,则可以在安装后直接从头开始重新创建。
为此,您可以运行 SQL 命令删除所有 DB 并重新创建,如下所示:
USE [master]
ALTER DATABASE [AutomationSuite_AICenter] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [AutomationSuite_AICenter]
CREATE DATABASE [AutomationSuite_AICenter]
GO
USE [master]
ALTER DATABASE [AutomationSuite_AICenter] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [AutomationSuite_AICenter]
CREATE DATABASE [AutomationSuite_AICenter]
GO
在 Fabric 安装期间可能会发生此问题。安装程序可能会失败,并显示以下类似错误。
appproject.argoproj.io/fabric created
configmap/argocd-cm configured
[INFO] [2021-09-02T09:21:15+0000]: Checking if ArgoCD password was reset, looking for secrets/argocd-admin-password.
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:21:16+0000]: Secret not found, trying to log in with initial password...1/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:21:36+0000]: Secret not found, trying to log in with initial password...2/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:21:56+0000]: Secret not found, trying to log in with initial password...3/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:22:16+0000]: Secret not found, trying to log in with initial password...4/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:22:36+0000]: Secret not found, trying to log in with initial password...5/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:22:56+0000]: Secret not found, trying to log in with initial password...6/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:23:17+0000]: Secret not found, trying to log in with initial password...7/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:23:37+0000]: Secret not found, trying to log in with initial password...8/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:23:57+0000]: Secret not found, trying to log in with initial password...9/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:24:17+0000]: Secret not found, trying to log in with initial password...10/10
[ERROR][2021-09-02T09:24:37+0000]: Failed to log in
appproject.argoproj.io/fabric created
configmap/argocd-cm configured
[INFO] [2021-09-02T09:21:15+0000]: Checking if ArgoCD password was reset, looking for secrets/argocd-admin-password.
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:21:16+0000]: Secret not found, trying to log in with initial password...1/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:21:36+0000]: Secret not found, trying to log in with initial password...2/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:21:56+0000]: Secret not found, trying to log in with initial password...3/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:22:16+0000]: Secret not found, trying to log in with initial password...4/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:22:36+0000]: Secret not found, trying to log in with initial password...5/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:22:56+0000]: Secret not found, trying to log in with initial password...6/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:23:17+0000]: Secret not found, trying to log in with initial password...7/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:23:37+0000]: Secret not found, trying to log in with initial password...8/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:23:57+0000]: Secret not found, trying to log in with initial password...9/10
FATA[0000] dial tcp: lookup remusr-sf on 168.63.129.16:53: no such host
[INFO] [2021-09-02T09:24:17+0000]: Secret not found, trying to log in with initial password...10/10
[ERROR][2021-09-02T09:24:37+0000]: Failed to log in
检查所有必需的子域,并确保它们配置正确且可路由,如下所示:
getent ahosts automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts alm.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts registry.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts monitoring.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts objectstore.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts alm.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts registry.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts monitoring.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
getent ahosts objectstore.automationsuite.mycompany.com | awk '{print $1}' | sort | uniq
重要提示: 将
automationsuite.mycompany.com
替换为集群 FQDN。
如果上述命令/行未返回可路由的 IP 地址,则 AI Center 所需的子域配置不正确。
备注:
当 DNS 不是公共的时,会遇到此错误。
您需要添加私有 DNS 区域(适用于 Azure)或路由 53(适用于 AWS)。
如果上述命令返回正确的 IP 地址,请按照以下步骤操作。
- 通过执行以下命令删除 ArgoCD 命名空间:
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml export PATH=$PATH:/var/lib/rancher/rke2/bin kubectl delete namespace argocd
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml export PATH=$PATH:/var/lib/rancher/rke2/bin kubectl delete namespace argocd - Run the following command to
verify:
kubectl get namespace
kubectl get namespace
此命令的输出中不应包含 ArgoCD 命名空间。
注意:删除 ArgoCD 命名空间后,继续安装。
对于与访问 AI Center 相关的问题,请确保按照以下部分中的步骤操作:
注意: 如果您使用的是自签名证书,则还需要对要使用的每个浏览器访问
https://objectstore.${CONFIG_CLUSTER_FQDN}
URL 一次,以便能够与存储交互。