The ultimate task of the reptile team is to make a brief book and big data. I have done it once before, and the amount of reading is not bad. Some time ago, Jianshu was also a successful financing, and the Jianshu also has some changes. This time it is also a good opportunity for analysis.
This part has not changed, because Jianshu does not have user-managed urls, we can only start with thematic URLs, which are still popular and cities.
This part is the new idea. Previously, I crawled the author of the feature article, and then crawled the author's fans. After this part, I finished the crawling object. This time, the topic administrator URL is crawled as the first-level user. This part is loaded asynchronously, and the URL of the home page and other topics are different in asynchronous loading (you will know this when you look for the package) ).
We can think of it this way. Basically, the administrator has a lot of fans. Most of this part is like us, people who eat melons; if there are peers, it is to follow users, so that two-way transmission can crawl most users (there are still some users) Can't climb).
This method will be much faster than crawling articles, and there will be much less repeated data (because users will post multiple articles). The disadvantage is that the data may be incomplete.