Python crawler crawls real estate information step by step

Python crawler crawls real estate information step by step


Garfield_Liang , the columnist of the Python Chinese community.

Short book address:

Well, this article is more about sharing my web analysis methods. I have been playing crawlers for almost a year. After I am familiar with the basic code, I feel that writing a crawler is most interesting than studying the loading process behind its web pages, that is, the analysis process. Programming is generally a trivial matter if there are no special requirements for performance. .

Take in Shenzhen as an example. The homepage of is very simple. You can find the corresponding second-hand or first-hand house by entering the corresponding area. This article mainly introduces the analysis process that I am doing XX Fangwang crawler.

Note: This article uses Chrome as the analysis loading work. If you use other browsers, please refer to the specific rules. First thought

Well, you must first jump out of programming and think from the perspective of users or even product managers: when browsing this page, how can you see the second-hand housing situation in the city. Through the input of one area of ​​the homepage, search, and then download the unit of the page, um, this is a method.

The situation on the homepage of Nanshan District

As shown in the figure above, as long as you change the parameters behind the keyword, you can get the data of second-hand houses in different districts. When programming, you only need to manually write a list containing each area, and then cycle to change the parameters behind the keyword, so as to start an area, and then crawl the links in it. This method is indeed feasible, and there are not many districts in Shenzhen. This method I tried is feasible.

What i actually want to say

The above method is feasible, but it is not the method I want to recommend. Look back to the homepage, there is a map next to the search bar to find a room. Click inside and you can see houses in all areas of Shenzhen. If you can get a crawler here, it will be much easier.

Find a room location on the map

Second-hand housing in all areas of Shenzhen

You can see that there are links to all second-hand houses on the right side of the screenshot. Our task is to download the data of all second-hand houses on the right. The first step is to check the source code of the page (Ctrl+U). You can copy some keywords from the linked list on the right, look for it in the source code, and try to search Mission Hills with Ctrl+F in the source code. The result is No, I didn't even try a few more keywords, but by checking the elements (Ctrl+Shift+I), you can locate these keywords. In this way, it can be preliminarily judged that the linked list on the right is loaded through Js, which needs to be confirmed.

Keyword Mission Hills search results in the source code

Keyword Mission Hills' search results in page elements

Try to locate the elements above Mission Hills in the source code, such as no-data-wrap bounce-inup dn, you can find it in the source code. Carefully compare the two contexts, you can see that the content under the node is very different. Use this roomList as a keyword to continue searching.

no-data-wrap bounce-inup dn position within the check element

no-data-wrap bounce-inup dn in the location of the source code

In the inspection element, you can find that the loaded content under roomList is the list of houses we need, and this part of the content is not in the source code. On the source code page, I searched for roomList, but found that it appeared in the script, which confirmed that the content in roomList was loaded through Js:

Where the roomList appears in the source code

The following becomes looking for this roomList, because it is loaded through js, open the console's network, and refresh the page, check the loading process of each element in the page, enter the roomList in the filter, you can find a piece of information:

Search results for roomList

Click to see the downloaded content in the response, and found that is what we are looking for! There is a detailed number of pages (roomPageSize). The eight-digit number is obviously the id of each house, and then the number of loading of each page is a certain number. The corresponding id contains the latitude and longitude of the house and the apartment type below. , Area, orientation, etc. (I’m here to remind students who need to do heatmap. The latitude and longitude here are Baidu coordinates. If you use Google Maps, Gaode or GPS for subsequent visualization, you need to convert the coordinates. of).

The contents of roomList

After finding the content, the next step is to look at his Headers to see how it is loaded.

  • Request Url indicates the link to visit, Request Method indicates that his request method is Post;
  • Request header definitions (Headers) include Host, Origin, Referer, User-Agent, etc.;
  • There are three parameters in the requested parameters (parameters), these three parameters are directly displayed on the Url link, which includes the page number (currentPage), page size (pageSize) and s of the current page (this s is also different at the beginning. It is clear that What, but I found that every request has changed, and later I learned that this is a timestamp, which represents the number of floating-point seconds that have passed since the 1970 epoch);
  • In addition, the Post function can also send data to the server to make a request. The data sent here includes the beginning and end of the latitude and longitude, gardenId (this is the corresponding cell number found later) and zoom (representing the magnification and reduction multiples on the map, the larger the number, the enlargement The higher the multiple)

Header first page

Herader second page

Basically, here, the whole page is clearer, and we also know how our crawler is going to write it. Start writing code

After the logic is sorted out, the entire code is very easy to write. First access it through post, and extract the roomPageSize in the Response or the maximum number of pages through regular expressions. Then crawl the content of each page and output the information.

The first part, loading the library, needs to use requests, bs4, re, time (time is used to generate a timestamp):

from bs4 import BeautifulSoupimport requests, re, time

The second part is to download data through post by setting reasonable post data and headers. The payload includes the latitude and longitude information displayed on the map (how to obtain this information, drag the mouse on the X room web page to find the appropriate location, and then check the latitude and longitude in the console Header at this time), and the headers include The basic information of the visit (plus a certain anti-climbing effect):

After the page is downloaded, for the first download, you first need to use regular expressions to get the maximum number of pages. What we really need is to combine Beautiful get and find and re to grab it:

Give an effect output in the console:

Final effect

Finally , this article gives me the idea of ​​the whole analysis of the X house net crawler.

Reference: Python crawler crawls real estate information step by step-Cloud + Community-Tencent Cloud