Interview question: If you create 10w pieces of test data, how to insert 10w pieces of data into the database without duplicating the data
In recent interviews, SQL-related questions are often asked. Creating test data in the database is a scene that is often used in ordinary work. Generally, stress testing is performed. Performance testing also needs to prepare test data in the database. So how to generate a large amount of test data in batches?
Since python is often used, I thought of using python to generate sql first, and then execute sql to insert data into the database.
Language: python 3.6
First of all, I want to insert the SQL statement, each id must not be repeated, the following is to execute a single insert statement
The INTO the INSERT
apps
.apiapp_card
(id
,card_id
,card_user
,add_time
) The VALUES ( '. 1', '', 'test123', '2019-12-17');
10w is too much and the execution time is too long, use python to generate 1w first to measure the execution time.
1. we need to generate multiple inert statements, here I use python language to write paragraphs to generate sql text.
-Use %s to replace the field value that needs to be changed. If there are multiple values that need to be changed, you can replace the corresponding value with multiple %s. The table I designed here can be inserted successfully as long as the id is different.
-Use a for loop, and add 1 to the id every time, so that the id will not be repeated, otherwise there will be duplicates that cannot be written successfully when inserting into the database.
-a is for additional writing
-Separate each sql with a semicolon
-Every time you write data, add/n line feed at the end
# python3 # Of: Shanghai - yo for i in range(10000): a = "INSERT INTO `apps`.`apiapp_card` (`id`, `card_id`, `card_user`, `add_time`) VALUES ('%s','','test123', '2019-12-17' );"%str(i+1) with open("a.txt", "a") as fp: fp.write(a+"\n")
Execute the python code, generate an a.text file locally, open the generated data, the part is as follows
INSERT INTO `apps`.`apiapp_card` (`id`, `card_id`, `card_user`, `add_time`) VALUES ('1','','test123', '2019-12-17'); INSERT INTO `apps`.`apiapp_card` (`id`, `card_id`, `card_user`, `add_time`) VALUES ('2','','test123', '2019-12-17'); INSERT INTO `apps`.`apiapp_card` (`id`, `card_id`, `card_user`, `add_time`) VALUES ('3','','test123', '2019-12-17'); INSERT INTO `apps`.`apiapp_card` (`id`, `card_id`, `card_user`, `add_time`) VALUES ('4','','test123', '2019-12-17'); ...... INSERT INTO `apps`.`apiapp_card` (`id`, `card_id`, `card_user`, `add_time`) VALUES ('10000','','test123', '2019-12-17');
If id is a mobile phone number, how to generate 10w different mobile phone numbers?
It can be generated according to the number segment starting with the first 3 digits of the mobile phone number, such as the number beginning with 186, first use the initial data 1860000000, and then add 1 to this number each time
Add it to 18600099999, so the number segment 1860000000-18600099999 is 10w mobile phone numbers.
After changing id to mobile phone number, modify the code as follows
# python3 # Of: Shanghai - yo for i in range(10000): a = "INSERT INTO `apps`.`apiapp_card` (`id`, `card_id`, `card_user`, `add_time`) VALUES ('%s','','test123', '2019-12-17' );"%str(i+1860000000) with open("a.txt", "a") as fp: fp.write(a+"\n")
Just change str(i+1) to str(i+1860000000) based on the above to generate a mobile phone number
INSERT INTO `apps`.`apiapp_card` (`id`, `card_id`, `card_user`, `add_time`) VALUES ('1860000000','','test123', '2019-12-17'); INSERT INTO `apps`.`apiapp_card` (`id`, `card_id`, `card_user`, `add_time`) VALUES ('1860000001','','test123', '2019-12-17'); INSERT INTO `apps`.`apiapp_card` (`id`, `card_id`, `card_user`, `add_time`) VALUES ('1860000002','','test123', '2019-12-17');
Copy the generated text, and paste the SQL corresponding to multiple INSERT INTOs to the Navicat client for execution at one time
It took about 5 minutes to complete the execution, which means that it takes 50 minutes for 10w bars. This is too slow. If there are more data, it will take too long, which is not the effect we want!
Because it takes too long for a single execution, now it needs to be optimized to change to an insert statement, to insert data in batches, and only write one insert into to write to the database in batches at one time, which will be much faster.
You can splice SQL statements, use insert into table () values (),(),(),() and then insert them all at once.
Either all of the batch execution is successful, or none of them will be written successfully. When there is a problem with the written SQL syntax, the write will not be successful.
Note:
-Splicing SQL, multiple values are separated by English commas
-The value should correspond to the field of the data table one-to-one
-Be sure to note that the last piece of data is not a comma, change it to a number
# python3 # Of: Shanghai - yo insert_sql = "INSERT INTO `apps`.`apiapp_card` VALUES " with open("b.txt", "a") as fp: fp.write(insert_sql+"\n") for i in range(10000): a = "('%s','','test123', '2019-12-17'),"%str(i+10001) with open("b.txt", "a") as fp: fp.write(a+"\n")
After the execution is complete, copy the content of the b.text file. It should be noted that this must be changed to; the end, otherwise a syntax error will be reported
Part of the data content is shown below
INSERT INTO `apps`.`apiapp_card` VALUES ('10001','','test123', '2019-12-17'), ('10002','','test123', '2019-12-17'), ...... ('20000','','test123', '2019-12-17');
Copy the generated INSERT INTO to the Navicat client for execution
After the execution was completed, the last test result showed that 1w pieces of data only took 0.217 seconds, and the speed was significantly improved.
Next, how long will it take to generate 10 w pieces of data?
# Of: Shanghai - yo # python3 insert_sql = "INSERT INTO `apps`.`apiapp_card` VALUES " with open("b.txt", "a") as fp: fp.write(insert_sql+"\n") for i in range(100000): a = "('%s','','test123', '2019-12-17'),"%str(i+100000) with open("b.txt", "a") as fp: fp.write(a+"\n")
The data generated after execution using the python script is as follows
INSERT INTO `apps`.`apiapp_card` VALUES ('100000','','test123', '2019-12-17'), ('100001','','test123', '2019-12-17'), ...... ('199999','','test123', '2019-12-17');
Insert mysql directly, there will be an error at this time: Err 1153-Got a packet bigger than'max_allowed_packet' bytes
The reason for the error: due to the large amount of data, mysql will limit the SQL with a large amount of data in a single table, and the string of 10w pieces of data exceeds max_allowed_packet
The allowable range.
Solution: need to modify the max_allowed_packet value of the mysql database to a larger value
First enter the command in Navicat to view the max_allowed_packet maximum allowed packet
show global variables like'max_allowed_packet';
It is seen that the value value is 4194304, and the maximum limit is 40 M. The SQL string we only need is too large to exceed this range.
In the Navicat client, we cannot directly modify the corresponding value. You need to log in to mysql and modify it with the command line.
My mysql here is built on docker, need advanced container, log in to mysql.
The operation steps are as follows:
[root@VM_0_2_centos ~]# docker exec -it 934b30a6dc36/bin/bash root@934b30a6dc36:/# mysql -uroot -p Enter password: Welcome to the MySQL monitor. Commands end with; or/g. Your MySQL connection id is 303822 Server version: 5.7.27 MySQL Community Server (GPL) Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type'help;' or'\h' for help. Type'\c' to clear the current input statement. mysql> show global variables like'max_allowed_packet'; +--------------------+-----------+ | Variable_name | Value | +--------------------+-----------+ | max_allowed_packet | 4194304 | +--------------------+-----------+ 1 row in set (0.00 sec) mysql> set global max_allowed_packet=419430400; Query OK, 0 rows affected (0.00 sec) mysql> show global variables like'max_allowed_packet'; +--------------------+-----------+ | Variable_name | Value | +--------------------+-----------+ | max_allowed_packet | 419430400 | +--------------------+-----------+ 1 row in set (0.00 sec) mysql>
As you can see from the above query results, it has taken effect.
Re-execute the above 10w pieces of data again, and it takes about 11 seconds to view the running results.
Affected rows: 100000
Time: 11.678s
The above method can only take effect temporarily. When you restart mysql, you will find that it is restored again.
There is also a permanent method, you need to modify the my.cnf configuration file
Add a sentence in the [mysqld] section, and modify the corresponding value if there is one:
-max_allowed_packet=40M
The value here can be used in M units. After modification, you need to restart mysql to take effect.
How long will it take to execute directly with python without using the Navicat client?
First encapsulate the method of connecting to mysql, and then splice the executed sql statements. When splicing, pay attention to the last character and change it;
Get the current timestamp before executing the code, and get the timestamp again after the code is executed. The time interval between the two is the execution time, and the time unit is s
The python execution mysql code reference is as follows
import pymysql ''' # python3 Author: Shanghai - yo pip install PyMySQL==0.9.3 ''' dbinfo = { "host": "192.168.1.x", "user": "root", "password": "123456", "port": 3306} class DbConnect(): def __init__(self, db_cof, database=""): self.db_cof = db_cof # Open database connection self.db = pymysql.connect(database=database, cursorclass=pymysql.cursors.DictCursor, **db_cof) # Use cursor() method to get operation cursor self.cursor = self.db.cursor() def select(self, sql): # SQL query # sql = "SELECT * FROM EMPLOYEE/ # WHERE INCOME> %s"% (1000) self.cursor.execute(sql) results = self.cursor.fetchall() return results def execute(self, sql): # SQL delete, submit, modify statements # sql = "DELETE FROM EMPLOYEE WHERE AGE> %s"% (20) try: # Execute SQL statement self.cursor.execute(sql) # Submit changes self.db.commit() except: # Roll back when an error occurs self.db.rollback() def close(self): # Close connection self.db.close() if __name__ =='__main__': import time insert_sql = "INSERT INTO `apps`.`apiapp_card` VALUES " insert_values = "".join(["('%s','','test123', '2019-12-17'),/n"%str(i+100000) for i in range(100000)]) # Splicing sql sql = insert_sql + insert_values[:-3]+";" # print(sql) # Execute sql time1 = time.time() db = DbConnect(dbinfo, database="apps") db.execute(sql) db.close() time2 = time.time() print("Total time-consuming: %s"% (time2-time1))
Use python to execute the result:, the 总过耗时:1.0816256999969482
result is beyond my imagination, 10w pieces of data actually only takes 1 second!