IBM Data Science Course Capstone Project

6 min readJan 22, 2021

Introduction

I have taken my first steps towards acquiring skills related to data science by doing the IBM Data Science Professional Certificate course on Coursera. The last module of this course is a capstone project. This project is about using data science toolset on a real-life problem and demonstrating the creation of value by applying the learned skills. I present here the summary of my project and the findings. The analysis was performed in Python. If you are interested in the details, then you can find a more detailed report and the Jupyter notebook at the end of the post.

Business Problem

For this project, I chose a hypothetical business problem.

A successful owner of multiple mid to high-end restaurants decided to open a new restaurant in Budapest, Hungary. Having visited the city many times in recent years, he couldn’t disregard the big boom in gastronomy. He is keen on opening a new unit, which will focus on the European and Asian fusion kitchen.

Taking into account the price level at which the restaurant will operate, the intent is to find an optimal location in an area, where gastronomy is booming and which is easily accessible for tourists and for wealthier local citizens as well.

The assumption behind the analysis is that we can use unsupervised machine learning to create clusters of districts that will provide us with a list of areas for consideration for the restaurant. The intent is that the restaurant to be situated close to one of the gastronomical centres and touristic hotspots.

Data

To perform this analysis, we will need the following data:

List of the districts of Budapest
Geo-coordinates of the districts in Budapest
Top venues of districts

List of districts will be obtained from Wikipedia. (https://en.wikipedia.org/wiki/List_of_districts_in_Budapest)

Geo-coordinates of districts will be obtained with the help of the geocoder tool in the notebook.

Top venues data will be obtained from Foursquare through an API.

Methodology

On a high level, we do the following. After tidying up and exploring the data, we will apply the K-means machine learning technique for creating clusters of districts. We will use the silhouette score for choosing the optimal number of clusters.

If you are interested in the details, please have a look at the report.

First, let’s create a table, that contains the list of districts in Budapest with the respective geo coordinates. When done, it will look like this. For this exercise, the geocode Python library was used.

Districts of Budapest with geo coordinates

Please note that in this analysis we work with 17 districts, though there are altogether 23 districts in Budapest.

In the next step, we collect venues for each district and see which venues are the most common. In this step, the venue data was collected from Foursquare via API. After collecting the data and organising into a pandas dataframe, we have a table that looks like this (this is only a portion of the whole table).

List of the most common venues for each district

A required step before we can run the clustering algorithm is to use the one-hot encoding technique which converts the categorical values into dummies so they can be used for machine learning.

For the clustering process, the K-means approach was used, which is an unsupervised machine learning algorithm. This process also requires to set the parameter for the number of clusters. To be able to identify the optimal number for this parameter, the silhouette score was used. This provided us with the value 4 as the best number to be used for clusters.

The Silhouette Score for different number of clusters

Then, the K-means process with 4 clusters were performed, which provided us with the following clustering.

Cluster labels for the districts based on K-means clustering

We can visualise this data with folio.

Clusters of districts in Budapest

Results

By looking at the cluster data, we can see that cluster 2 is the one that we are the most interested in.

The first cluster (Cluster label 0) is an outer district where top gastronomy is not really represented (supermarket and fast food are in the top).

Cluster 2 (Cluster label 1) is the biggest cluster, but this is where we see lots of gastronomy related venues (coffee shop, pizza place, Thai restaurant, beer bar, pub, modern European restaurant, etc..).

Cluster 3 (Cluster label 2) contains districts where public travel rated at the top, but behind that, parks and playgrounds are also present. These are mainly areas with family houses where people live, but not really the vibrant, lively part of the city.

Cluster 4 (Cluster label 3) contains only one district. Here we see the restaurant category at the top, but behind that, it is about public transport.

Recommendation

Based on what we learned about the clusters, we can advise the restaurant owner to consider the districts from cluster 2 as a potential location for the new restaurant. These are the districts where gastronomy is well represented and also hotels are frequent. These satisfy the two original criteria that the location should be in a gastronomical centre and in a location that is easily accessible for tourists.

Conclusion

This post discussed the process of coming up with an answer for a hypothetical though real-life like business problem. The analysis was performed based on the toolset of data science and relied heavily on the use of Python and Python libraries such as Pandas, Scikit, Folium to name a few. The output of the analysis provided a thorough base for the recommendation for the business problem in question.

I hope you found the analysis interesting and you might get interested in diving deeper in this field too. I can strongly recommend the IBM Data Science course as I enjoyed learning from it greatly.

You can find the more detailed report here: https://github.com/autitya/Coursera_Capstone/blob/main/capstone_week2.pdf

And the Jupyter notebook with the code here: https://github.com/autitya/Coursera_Capstone/blob/main/capstone_week2.ipynb