Real-Time Computing at Praise: A Journey from Storm to Flink
Introduction
Praise, a business services company, has been at the forefront of real-time computing efficiency for five years. Our team has been working tirelessly to enhance the practice of real-time computing, and in this article, we will share our journey from Storm to Flink. We will explore the development of real-time computing at Praise, from the initial stage to the current platform stage, and highlight the challenges we faced and the solutions we implemented.
The Initial Stage (2014-2017)
In the initial stage, we lacked a comprehensive real-time computing plan, task management, monitoring, and alarm tools. Users submitted jobs directly to the task using the command line, which made it difficult to meet user requirements for availability. However, we accumulated a large number of internal real-time computing scenes during this period.
2.1.1 Storm Debut
In early 2014, we first applied Storm internally, which enabled us to decouple real-time events from business logic. Storm applications performed real-time computing event listeners by updating MySQL binlog and then updating the results in Redis caching or online systems. We recognized similar business development and gradually began to support a large number of business scenarios.
2.1.2 Introduced Spark Streaming
At the end of 2016, we introduced Spark Streaming due to its higher throughput and performance compared to Storm. We accessed real-time application pre-business system logs and system logs buried points, and the business side also began to access Spark Streaming. However, we still faced challenges in operation and maintenance, mainly in business management, resource management, and alarm monitoring.
2.1.3 Summary
Our initial architecture lacked business management, resource management, and alarm monitoring tools, leading to availability problems and inefficient development. We accumulated a large number of internal real-time computing scenes, but the lack of a unified real-time computing platform made it difficult to manage all aspects of real-time computing.
The Platform Stage (2018-Present)
In response to the challenges faced in the initial stage, we built a real-time computing platform to address business management, resource management, and alarm monitoring. We started the project in 2018, focusing on supporting Spark Streaming real-time computing tasks and migrating all Spark Streaming tasks to the new platform.
2.2.1 Build Real-Time Computing Platform
We connected the following components:
- Business management functions: recording relevant information in real-time and providing interfaces for people and business associations.
- Task-level monitoring: automatic fault pull-up tasks based on user-defined delay/throughput indicators and traffic trends in the market.
- Cluster planning: building a separate cluster computing Yarn for real-time applications to avoid tasks affecting each other.
- Switching cluster computing tasks: ensuring that when the cluster faults, tasks can easily migrate to another cluster.
2.2.2 New Challenge
After completing the above components, we faced new challenges in developing efficient business-side problems. Users accessed the real-time computing framework SDK, which took about half a day. Real-time tasks on downstream application resources took a few hours, and real-time tasks development and testing took about 1-3 days. We encountered problems in code review and testing, leading to a variety of issues.
2.2.2.1 Real-Time Task of SQL
To address the challenges faced in real-time task development, we decided to introduce real-time SQL. We planned to complete the following features:
- Development of real-time tasks flowing stream flow based Kafka
- SQL task flows to developing HBaseSink based storage
- Support for the UDF (User-Defined Function)
2.2.2.2 Introduction of Real-Time OLAP Engine
We observed that real-time applications in business demanded statistics in different dimensions of UV, PV class statistics, and pattern is relatively fixed. We decided to introduce a real-time OLAP engine to support update and query. We studied Kudu and Druid, and chose Druid due to its operation and maintenance costs, current technology stack integration, query performance, and support after scene.
Conclusion
Our journey from Storm to Flink has been a challenging but rewarding experience. We have addressed the challenges faced in the initial stage and built a real-time computing platform to support business management, resource management, and alarm monitoring. We have also introduced real-time SQL and OLAP engine to simplify development and improve efficiency. Our future plan is to improve SQL tasks to cover business scenarios (target 70%) and enable the development of business through the perspective of improving business efficiency.
Future Plan
- Improve SQL tasks to cover business scenarios (target 70%)
- Multiplexed stream data into the highest ROI on measures to improve efficiency
- Begin construction of real-time warehouse positions
About the Author
HE Fei joined Praise Big Data team in July 2017 and has been responsible for the work platform floor and various components of HBase to store data base.