摘要:
日志数据记录着丰富的信息, 具有较高的实用价值, 但在当今大数据时代环境下, 数据量的陡增为日志数据的处理带来了挑战. 为了有效地解决海量日志数据处理面临的瓶颈问题, 本文整合Hadoop和Storm分布式框架, 构建一种融合了实时计算与离线计算的分布式日志实时处理系统. 系统架构由数据服务层、 业务逻辑层和Web展示层组成, 数据服务层使用Flume实时采集日志数据, 并分别采用Kafka与HBase完成实时日志流数据的缓冲和系统数据的持久化存储;业务逻辑层利用Storm对实时日志流数据进行实时分析, 并使用Hadoop的计算引擎MapReduce结合数据挖掘技术完成对海量历史日志数据的离线分析, 离线分析的结果为实时分析提供支持、 参考;Web展示层负责日志数据及其分析结果的展示. 实验结果表明, 系统能有效地解决日志数据的采集存储、 实时日志流数据的实时分析和历史日志数据的离线分析等问题, 并成功地融合了Hadoop与Storm各自的优势, 为日志数据的采集和分析系统的构建提供新的技术参考.
Abstract:
Log data record rich information is of high practical value.In today′s era of big data environment, the amount of data brought challenges to the processing of log data.In order to solve the bottleneck problem of massive log data processing effectively, Hadoop and Storm have been integrated in this paper to design and implement a distributed log real-time processing system integrated with off-line computing and real-time computing.The system architecture consists of data service layer, business logic layer and Web presentation layer.In data service layer, Flume is used to collect log data in real time, Kafka and HBase are used to achieve the real-time log stream data buffer and the system data storage.In business logic layer, Storm is used to carry on the real-time analysis to the real-time log stream data, and Hadoop computing engine MapReduce and Data Mining are used to complete the massive historical log data off-line analysis, the results of off-line analysis provide support for real-time analysis.Web presentation layer is responsible for the log and analysis results show.Experimental results show that the system can effectively solve the problem of log data acquisition and storage, real-time analysis of the real-time log stream data and depth analysis of the massive historical log data.And the advantages of Hadoop and Storm are successfully fused.The system also is a new technical reference for the construction of the log data acquisition and analysis system.