|
- 2016
E级超级计算机故障预测的数据采集方法Keywords: 超级计算机 故障预测 数据采集方法 数据汇集supercomputer failure prediction data collection method data aggregation Abstract: 面向未来E级超级计算机,提出用于故障预测的数据采集框架,能够全面采集与计算结点故障相关的状态数据。采用自适应多层分组数据汇集方法,有效解决随着系统规模增长数据汇集过程开销过大的问题。在TH-1A超级计算机上的实现和测试表明,该数据采集框架具有开销小、扩展性好的优点,能够满足未来大规模系统故障预测数据采集的需求。Aimed at an exascale supercomputer, an FPDC (failure prediction data collection framework) was introduced to fully collect the data related to the state of compute nodes’ health. An adaptive multi-layer data aggregation method was presented for data aggregation with less overhead. Extensive experiments, by implementing FPDC on TH-1A,indicate that the FPDC has the advantage of high efficiency and good scalability.
|