How I Web Scraped 10,000 Property Listings of Hyderabad City

Vijay Vankayalapati
4 min readSep 14, 2021
Source: Tourhyd.com

In this post I will briefly walk you through how I scraped 10,000 property listings of Hyderabad, India’s fourth largest city (by population).

I scraped the data from Magicbricks and this is how a typical search results page of their website looks like:

Search Results Page

First we import the necessary libraries and execute the following code which returns an html page that is stored in html_soup, a BeautifulSoup object.

Using prettify() we ‘pretty-print’ html_soup for easier reading of the elements (when compared to inspecting them on browsers by right clicking).

From analysing the printed page, it is clear that information about individual listings are stored in three containers with distinctive classes:

  1. Container: ‘span’, class= ‘domcache js-domcache-srpgtm’
  2. Container: ‘div’, class=’m-srp-card__summary js-collapse__content’
  3. Container: ‘span’, class=’hidden’

We use find_all() method to find the containers and then store them in the following three variables:

Let’s find out how many containers of our interest are present in a single page.

You can see the count of third type of container is higher than the other two. I figured out through a careful inspection that this means the index of third container for a given listing will be different (+1 in this case as you can see below).

Printing out the three containers of 1st listing:

Notice that for first two containers the index is 0 while for third container index is 1. I also noticed some pages have same number of containers for all three types. In that case index is same for all three containers for a given listing. This makes it difficult to collect data while looping through multiple pages.

The problem can be overcome by collecting data from first two containers in one file and using another file for third container. These two files can be merged later on a common column ‘id’ (for 1st listing above, id=”domcache_srp_54364457" can be found in both first and third containers).

Infinite Scroll Page:

On the first page we have data about 22 listings. Since the search results page is of infinite scroll type, we don’t have a button that takes us to next page. In this scenario we need to get the urls of next pages the following way:

Right click on your browser > Inspect Element > Network tab > XHR tab > propertySearch.html> Headers

You can copy url of page 2 from ‘Headers’ section after clicking “propertySearch.html” . And then clicking on the list of “propertySearch.html” gives us urls of next pages. Now we need to find out the pattern in urls.

On careful examination, we see the following pattern in urls (chopped off urls displayed below):

  • ‘groupstart’ value increases by 30 for every page
  • ‘page’ value increases by 1
  • In some cases, ‘maxoffset’ value might be different from page 3 onwards. But I noticed using ‘maxoffset’ value from page 2 url won’t change the results. And vice-versa holds true as well.

We create new urls the following way:

Finally we use the following script to get our data. The scraped data is saved in two csv files which can be merged later on column with property ID (‘prop_id’ and ‘prop_id2’).

Please note that I collected some features from both first and third containers so that if a value is missing in one container we can populate it with corresponding value from the other one.

Full code can be found on my github.

--

--