A Conversation with Cheng Li Yafeng: Unlocking the Potential of Spark in Big Data Fusion

A Conversation with Cheng Li Yafeng: Unlocking the Potential of Spark in Big Data Fusion

As the IT industry continues to grapple with the complexities of big data, two processing tools have emerged as frontrunners: Hadoop and Spark. While Hadoop has long been a stalwart in the field, Spark’s innovative approach to memory-based computing has garnered significant attention. In this interview, Cheng Li Yafeng, senior manager of Ctrip’s large data platform, shares his insights on the complementary advantages of Spark and Hadoop, and offers guidance on identifying potential needs of users.

A Brief Background

Cheng Li Yafeng has been working in the IT Internet field since 2002, with a focus on web conferencing, IPTV, security gateways, game structure, search engines, and recommendation engines. After joining Ctrip, he shifted his focus to large data fields, where he currently oversees the operation and development of the company’s underlying data infrastructure platform.

Ctrip’s Big Data Journey

At present, Ctrip’s big data platform comprises a 200-node cluster, with 3PB of data and over 30,000 jobs executed daily. The platform supports various business lines, including massive logs and metrics processing, recommendation engine, reptiles, user behavior log analysis, BI reporting, risk control, search engines, machine learning, monitoring, and alarm.

The DI Team

With a team of six members, including Li Yafeng, the DI team is responsible for supporting various business sectors and data sizes. To improve efficiency, the team employs a DevOps approach, where team members not only develop but also perform operation and maintenance tasks. This requires high expertise and a strong understanding of the underlying technology.

The Role of Spark

Spark’s biggest advantage lies in its speed, achieved through the use of memory-based computing. However, this comes at the cost of higher resource utilization. Li Yafeng emphasizes that Spark and Hadoop should be viewed as complementary technologies, rather than mutually exclusive.

The Future of Hadoop and Spark

Li Yafeng believes that Hadoop will not be replaced by Spark, but rather coexist as complementary technologies. Spark’s memory-based computing approach will continue to evolve, and its future use will depend on the specific needs of users.

The Challenges of Big Data

As the size of data continues to grow, the challenges of analyzing and creating value from it become increasingly complex. Li Yafeng notes that the growth rate of data will far exceed the growth rate of business, making it essential to develop new technologies and approaches to extract value from data.

Quantitative Management and Big Data

Li Yafeng emphasizes the importance of quantitative management and digital management, which rely on the collection and analysis of data. He notes that data is a fact, and decisions should be based on data-driven insights.

Technology Options

In addition to HDFS, HBase, MapReduce, Hive, Spark, and Storm, Li Yafeng mentions Presto, a new product from Facebook, and other emerging technologies like Caffeine, Pregel, and Dremel. He believes that the choice of technology should be based on the specific needs of the business, rather than a comparison between individual technologies.

Conclusion

In conclusion, Li Yafeng’s insights offer valuable guidance on the complementary advantages of Spark and Hadoop, as well as the challenges and opportunities of big data. As the industry continues to evolve, it is clear that Spark and Hadoop will coexist as complementary technologies, each with its own strengths and weaknesses. By understanding the needs of users and the specific challenges of big data, businesses can unlock the potential of these technologies and create value from their data.