Master-Slave Delay in MongoDB: A Real-World Investigation
Introduction
In this article, we will delve into a real-world scenario where a customer experienced a master-slave delay in their MongoDB cluster, causing significant business disruptions. We will guide you through the investigative process, highlighting the challenges we faced and the steps we took to resolve the issue. Our goal is to provide a comprehensive understanding of the root cause and offer practical advice on how to avoid similar pitfalls in the future.
Background
In early September, one of our customers reported an issue with their private cloud Teambition system. The client package, which provides a layer of internal business, was experiencing data consistency abnormalities between two consecutive interface calls. This led to the upper packaging business not working properly, while the system itself remained unaffected. After receiving the feedback, we synchronized the packaging business logic with each other.
Temporary Coping Mechanism
Given the high importance of the top business, our primary goal was to restore the availability of the service. We considered several options, including:
- Canceling SecondaryPreferred to ensure normal calls to interface A.
- Adjusting the logic of interface B to add a layer of protection between the upper caller and interfaces A and B.
- Modifying the customer’s packaging upper layer service.
We chose Option 1, canceling SecondaryPreferred, as the contingency plan. This decision was based on the fact that MongoDB is not a separate read and write system, and the pressure on the cluster is not completely separate. By canceling SecondaryPreferred, we ensured that the request sent to port B of the Primary.
Investigation and Root Cause Analysis
After canceling SecondaryPreferred, the top business returned to normal. We also removed the monitoring information from the customer’s MongoDB cluster. Upon reviewing the monitoring data, we observed the following:
- No surge in flow stability
- QPS in the low range
- Read and write connections stability
- No violent explosion in SaveResident Memory and eviction
- Cache Activity stable without explosion
- Master-slave delay unstable, sometimes soaring to 5s
- Slow query exist local.oplog.rs
We conducted a re-set of the failure investigation, starting with the most intuitive issue: the master-slave delay. We observed that the db.printSlaveReplicationInfo() returns the visualization of the results of the monitoring and display the main customers of the cluster from the delay of instability, from time to time soared to 5s.
Analysis of the Slow Query
We focused on the slow query local.oplog.rs and found that the main key conditions slow query is as follows:
- Type: getmore
- Reflects the number of cursor definition
We determined that the characteristics of this type of statement is not a normal business scenario, is not A or B interface triggered the whole service system under review would have to trace oplog oplog.rs scene so the query, from a the application elasticsearch mongodb to synchronize data.
Verification and Summary
After the initial acquisition to the conclusion that we began corresponding verification. We verified that the cursor result in large additional resources wt cache ram tension, thus causing a delay from the master. We also confirmed that the service reboots repeatedly caused the cursor in the primary accumulation.
Conclusion
In this article, we have demonstrated the importance of proper use of the cursor in MongoDB. We have highlighted the consequences of improper use of the cursor, including cursor accumulation, memory consumption, and delayed master-slave replication. We have also provided practical advice on how to avoid similar pitfalls in the future, including:
- Not using the cursor when using noCursorTimeout
- Setting a reasonable timeout for the cursor
- Controlling the application to retry mechanism after the timeout
- Monitoring the cluster environment
- Avoiding operation on the local library table
- Using changestream to meet the needs of a similar trace oplog
- Lifting 120% attention on monitoring indicators at any time
By following these best practices, you can avoid the master-slave delay in MongoDB and ensure the smooth operation of your database.