Master-Slave Delay in MongoDB: A Real-World Investigation

heera · November 15, 2025, 7:35am

Master-Slave Delay in MongoDB: A Real-World Investigation

Introduction

In this article, we will delve into a real-world scenario where a customer experienced a master-slave delay in their MongoDB cluster, causing significant business disruptions. We will guide you through the investigative process, highlighting the challenges we faced and the steps we took to resolve the issue. Our goal is to provide a comprehensive understanding of the root cause and offer practical advice on how to avoid similar pitfalls in the future.

Background

In early September, one of our customers reported an issue with their private cloud Teambition system. The client package, which provides a layer of internal business, was experiencing data consistency abnormalities between two consecutive interface calls. This led to the upper packaging business not working properly, while the system itself remained unaffected. After receiving the feedback, we synchronized the packaging business logic with each other.

Temporary Coping Mechanism

Given the high importance of the top business, our primary goal was to restore the availability of the service. We considered several options, including:

Canceling SecondaryPreferred to ensure normal calls to interface A.
Adjusting the logic of interface B to add a layer of protection between the upper caller and interfaces A and B.
Modifying the customer’s packaging upper layer service.

We chose Option 1, canceling SecondaryPreferred, as the contingency plan. This decision was based on the fact that MongoDB is not a separate read and write system, and the pressure on the cluster is not completely separate. By canceling SecondaryPreferred, we ensured that the request sent to port B of the Primary.

Investigation and Root Cause Analysis

After canceling SecondaryPreferred, the top business returned to normal. We also removed the monitoring information from the customer’s MongoDB cluster. Upon reviewing the monitoring data, we observed the following:

No surge in flow stability
QPS in the low range
Read and write connections stability
No violent explosion in SaveResident Memory and eviction
Cache Activity stable without explosion
Master-slave delay unstable, sometimes soaring to 5s
Slow query exist local.oplog.rs

We conducted a re-set of the failure investigation, starting with the most intuitive issue: the master-slave delay. We observed that the db.printSlaveReplicationInfo() returns the visualization of the results of the monitoring and display the main customers of the cluster from the delay of instability, from time to time soared to 5s.

Analysis of the Slow Query

We focused on the slow query local.oplog.rs and found that the main key conditions slow query is as follows:

Type: getmore
Reflects the number of cursor definition

We determined that the characteristics of this type of statement is not a normal business scenario, is not A or B interface triggered the whole service system under review would have to trace oplog oplog.rs scene so the query, from a the application elasticsearch mongodb to synchronize data.

Verification and Summary

After the initial acquisition to the conclusion that we began corresponding verification. We verified that the cursor result in large additional resources wt cache ram tension, thus causing a delay from the master. We also confirmed that the service reboots repeatedly caused the cursor in the primary accumulation.

Conclusion

In this article, we have demonstrated the importance of proper use of the cursor in MongoDB. We have highlighted the consequences of improper use of the cursor, including cursor accumulation, memory consumption, and delayed master-slave replication. We have also provided practical advice on how to avoid similar pitfalls in the future, including:

Not using the cursor when using noCursorTimeout
Setting a reasonable timeout for the cursor
Controlling the application to retry mechanism after the timeout
Monitoring the cluster environment
Avoiding operation on the local library table
Using changestream to meet the needs of a similar trace oplog
Lifting 120% attention on monitoring indicators at any time

By following these best practices, you can avoid the master-slave delay in MongoDB and ensure the smooth operation of your database.