In this article, I explain how you can:
If available, it is easier/recommended to use an API (similar to twitter, reddit APIs) . However, most of the websites don't have public APIs or if they have they provide very limited public access. In such cases, web scraping might be necessary.
This was the case for me when I wanted to study rental data in Switzerland. So, I decided to develop my own web crawler to get the data for analysis (not any business purposes) from Immoscout24.ch. Below I explain this web crawler. You can find the corresponding Jupyter notebook here.
First step is to download the web html pages. For instance, for city of Lausanne this is how you could do it:
import requests
first_page = requests.get('https://www.immoscout24.ch/en/flat/rent/city-lausanne')
This snippet downloads the first page of the rental data of the city of Lausanne.
Use html2text
script written by Aaron Swartz to convert the html page to a text file. This script gets rid of all html syntaxes:
xxxxxxxxxx
import html2text
h = html2text
c = h.html2text(first_page.text)
You can print(c)
to see how it looks like. You can get meta info such as total number of pages from the first page data and then continue downloading all pages as explained in this Jupyter notebook. I have saved all pages in *.txt
form.
Split each txt file using regular expressions so that you can get each apartment listing: I call it singe_apt_listing
which is simply a string containing one single apartment listing. Here is an example of that:
x### [3 rooms, 97 m²](/en/d/flat-rent-
lausanne/5071748?s=2&t=1&l=2023&ct=609&ci=220&pn=10)
## «Ruchonnet 24 - Appt 3 pièces au rez-de-chaussée»
Av. Ruchonnet 24, 1003 Lausanne, VD
Close
### CHF 2,499.—
Bookmark
Then I use regex
in this script to extract the information such as number of rooms, area, etc. from the string above and convert it to Pandas dataframe. This step is a bit tricky as not all single apartment listings look as nice as the one above. So the script has to cover all different cases without outputing wrong data.
Finally, I concatenate all single apartment data frames to a Pandas data frame which will be convenient for data analysis.
All these steps are integrated into a class ImmoScout
which can be used as follows:
xxxxxxxxxx
from ImmoScout import ImmoScout
import pandas as pd
# instantiate the ImmoScout class by setting the city name and listing type
lausanne = ImmoScout(city_name = 'lausanne', list_type = 'rent')
# if some data is already downloaded set 'in_path' to the already saved csv file,
# otherwise just set the output path, i.e., out_path to save the converted data
lausanne.to_csv(in_path = '', out_path='../data/lausanne.csv')
# convert the csv to pandas data frame
df = pd.read_csv('../data/lausanne.csv')
Once you have the rental data in the form of a Pandas dataframe you can do the usual data analysis pipeline. That is, you start by preprocessing the data (handling the missing data, outliers, etc.). For the data analysis, you can include new interesting features such as rent per room, rent per area, zip code of the apartments, etc. These are all done in this notebook. Perhaps, the most tricky part of the data analysis pipeline for this example is spotting and handling the outliers (which are indeed mostly due to wrong inputs from the users). Here is the first 5 elements of the resulting dataframe:
index | Id | SurfaceArea | NumRooms | Type | Address | Description | Rent | Bookmark | Link | RentPerArea | RentPerRoom | AreaPerRoom | ZipCode | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 4695812 | 41.0 | 2.0 | flat | Av. de Cour 65, 1007 Lausanne, VD | Appartement de 2 pièces au 5ème étage | NaN | New | /en/d/flat-rent-lausanne/4695812?s=2&t=1&l=202... | NaN | NaN | 20.500000 | 1007 |
1 | 1 | 5204125 | 21.0 | 1.5 | studio | Route Aloys-Fauquez 122, 1018 Lausanne, VD | Studio proche de toutes les commodités | 970.0 | New | /en/d/studio-rent-lausanne/5204125?s=2&t=1&l=2... | 46.190476 | 646.666667 | 14.000000 | 1018 |
2 | 2 | 5201717 | 95.0 | 4.0 | flat | Av. de Morges 39, 1004 Lausanne, VD | Joli appartement - centre ville proche commodités | 2085.0 | NaN | /en/d/flat-rent-lausanne/5201717?s=2&t=1&l=202... | 21.947368 | 521.250000 | 23.750000 | 1004 |
3 | 3 | 5201713 | 29.0 | 1.5 | flat | Ch. du Devin 57, 1012 Lausanne, VD | Quartier de Chailly, spacieux 1.5 pièce | 910.0 | NaN | /en/d/flat-rent-lausanne/5201713?s=2&t=1&l=202... | 31.379310 | 606.666667 | 19.333333 | 1012 |
4 | 4 | 5195729 | 11.0 | NaN | single-room | Chemin de Montolivet 19, 1006 Lausanne, VD | 1006 Lausanne - Montolivet 19 - Chambre meublé... | 560.0 | NaN | /en/d/single-room-rent-lausanne/5195729?s=2&t=... | 50.909091 | NaN | NaN | 1006 |
Let's say you are interested in rental prices distribution as a function of zip-code. Then you could use the groupBy()
method of Pandas on the above dataframe as follows:
xxxxxxxxxx
zipVsRentMean = df[['ZipCode', 'RentPerArea', 'RentPerRoom', 'AreaPerRoom', 'SurfaceArea']].groupby(['ZipCode'], as_index = False).mean()
Here is zipVsRentMean
:
ZipCode | RentPerArea | RentPerRoom | AreaPerRoom | SurfaceArea | |
---|---|---|---|---|---|
0 | 1000 | 28.458524 | 729.675634 | 25.975658 | 131.923077 |
1 | 1003 | 33.847367 | 956.716565 | 29.864104 | 77.057471 |
2 | 1004 | 29.348282 | 697.231981 | 25.349916 | 65.488971 |
3 | 1005 | 29.793665 | 718.242015 | 24.671657 | 73.903226 |
4 | 1006 | 32.542634 | 789.203604 | 26.219773 | 61.670213 |
5 | 1007 | 32.612209 | 774.724010 | 26.186800 | 59.460000 |
6 | 1010 | 28.715491 | 714.434524 | 26.260718 | 76.546512 |
7 | 1012 | 30.620128 | 789.353729 | 27.079760 | 69.865385 |
8 | 1015 | 23.448276 | 616.428571 | 24.857143 | 87.000000 |
9 | 1018 | 29.406521 | 668.998769 | 24.128802 | 57.859504 |
For the moment, I will drop rows of zip code 1015 as I did not have enough data points.
Next, we would like to show the results of the zip code table above on a map. To this end, we first should be able to read the maps in Python. Maps are usually available in the shapefile format *.shp
. Let's first download this shapefile map, and then I discuss how you could read this in Python.
Download the Switzerland's zip- code shapefiles from Swiss opendata. I have downloaded the PLZO_SHP_LV95 from here. Extract the folder, and note the address where you saved the zip-code shapefile (called PLZO_PLZ.shp
) . I put it in my data
folder.
Okay, now you have the shapefile. How would you read/manipulate this in Python? Luckily, the Geopandas library of Python, which is a powerful library used for geospatial data processing and analysis, has a method to convert shapefiles to geopandas dataframe:
xxxxxxxxxx
import geopandas as gpd
gdf = gpd.read_file('../data/PLZO_SHP_LV95/PLZO_PLZ.shp')
The Coordinate Reference System (CRS) in which the data is displayed can be found by gdf.crs
. I convert this to a more common CRS by the following command:
xxxxxxxxxx
gdf = gdf.to_crs({'init': 'espg:4326'})
Here is the first two elements of the geopandas dataframe gdf
:
UUID | OS_UUID | STATUS | INAEND | PLZ | ZUSZIFF | geometry | |
---|---|---|---|---|---|---|---|
0 | {0072F991-E46D-447E-A3BE-75467DB57FFC} | {281807DC-9C0B-4364-9A55-0E8956876194} | real | nein | 3920 | 0 | POLYGON ((7.575852862870608 45.98819190048114,... |
1 | {C3D3316F-1DFE-468E-BFC5-5C2308743B4C} | {F065D58C-3F88-46EF-9AA0-DA0A96766333} | real | nein | 3864 | 0 | POLYGON ((8.114458487426123 46.5465644007604, ... |
The geometry
column defines the shape of each polygon. Since we are only looking at the data in the city of Lausanne, I extract the data of Lausanne from gdf
(note that gdf
includes the data of the whole Switzerland):
xxxxxxxxxx
lausanne = [1000, 1003, 1004, 1005, 1006, 1007, 1010, 1010, 1012, 1015, 1018]
gdf_laus = gdf[gdf['PLZ'].isin(lausanne)]
Now you can plot the zip-code map of Lausanne with the following code:
xxxxxxxxxx
gdf_laus.plot()
Which would result in the following figure:
While geopandas
can plot such minimal maps, I would like to have a Choropleth interactive map (where you can hover over the map see the rental results) that also looks a bit nicer than this one. To create such a map I decided to use the use the Altair library.
First off, let's merge the gdf_laus
dataframe which only contains geographical data with zipVsRentMean
Pandas dataframe which included the rental data for each zip-code in Lausanne:
xxxxxxxxxx
gdf_laus = gdf_laus.merge(zipVsRentMean, left_on='PLZ', right_on='ZipCode')
This will simply add the columns of zipVsRentMean
to the right of gdf_laus
. Okay, now we have a geopandas dataframe gdf_laus
, which includes both rental data and geographical information of Lausanne. Next, we want to visualize this on an interactive Choropleth map for which I use the Altair library.
In order for the gdf_laus
data to be readable by the Altair library, we need to do some preprocessing as follows:
xxxxxxxxxx
import json
import Altair as alt
json_laus = json.loads(gdf_laus.to_json())
alt_laus = alt.Data(values = json_laus['features'])
alt_laus
has the data form which is readable by Altair as follows:
xxxxxxxxxx
alt_rentPerRoom = alt.Chart(alt_laus).mark_geoshape(
stroke = 'white'
).encode(
latitude = 'properties.y:Q',
longitude = 'properties.x:Q',
color = 'properties.RentPerRoom:Q'
).properties(
width = 400,
height = 500
)
alt_rentPerRoom # to display the map
Here is the result of the execution of this code:
We are almost there, but we need to display the zip-code on the map too. To do this, I first get the centroid of each zip-code area by the following commands:
xxxxxxxxxx
gdf_laus['x'] = gdf_laus['geometry'].centroid.x
gdf_laus['y'] = gdf_laus['geometry'].centroid.y
Next, plot this with altair:
xxxxxxxxxx
text = alt.Chart(alt_laus).mark_text(
).encode(
longitude = 'properties.x:Q',
latitude = 'properties.y:Q',
text = 'properties.ZipCode:Q',
)
Finally, combine the two maps:
xxxxxxxxxx
chart = alt_rentPerRoom + text
chart
Here is the result. You can hover over the map to see the rent per room average in each zip-code area. We can do the same analysis for any other variable of interest (e.g., rent per surface area, etc.).
Head to this notebook for the python code that I summarized above.