My First Data Science Project, Which City Do You Choose, Seoul or Tokyo?

Adryanra
6 min readJan 11, 2021

--

As I am enrolled in one of course. This will be My first Project in Data Science. If you have some points or tips, I will gladly accept them. This project needs 3 hours for me to finish it. Please write it in the comment and I will read it.

Japan and South Korea are major countries and topped at the country you have to visit for a holiday. And it's a great place for living too. Let's find some most common venues in its capital city, Tokyo and Seoul.

Both places are great places and they have their own unique characteristic. Maybe you will move to these cities but don't know which district is great for you or you want to stay holiday and find a hotel that has some venues in its 5 minutes walk. So let's get started.

The first thing I did, is to find the list of the district in each country. Luckily I found both of them on Wikipedia.

Seoul: https://en.wikipedia.org/wiki/List_of_districts_of_Seoul

Tokyo : https://en.wikipedia.org/wiki/Tokyo#:~:text=Notable%20districts%20of%20Tokyo%20include%20Chiyoda%20(the%20site,and%20Shibuya%20(a%20commercial,%20cultural%20and%20business%20hub).

With pandas I can find the table and easily filter it or you can use BeautifulSoup4 too. Here's the code I use to take the data.

and here’s the result:

I will use Foursquare API later to find the nearby venues in each district. To use Foursquare we need coordinate of each district. So I use geopy to find each coordinate.

I use loop to take the coordinate of each district and append it to list ‘Lat’ and ‘Lng’. We can add the data from this list to pandas table. This is the code and the result

Code to Add to Table
The Result

After that use the same method to find the coordinate of each district in Tokyo.

Plot It To Map

I will use folium to plot it to map and we can label each district from the coordinate we got before. This is the code to plot it to Map.

Code For Plot in Folium Map
Map With District Position in Seoul
Map With District Position in Tokyo

I have the district coordinate and plot it to map to help us visualize it. Now we need to connect to Foursquare API to get the data of the nearby venue.

Foursquare

This the code I use to find a nearby venue for each district. Don't forget to change ‘client id’ and ‘client secret’ with your own Foursquare API.

Code for Request Nearby Venue in Each District

I input the data and run the code.

These are some of the results.

Result of Nearby Venue in Each District of Seoul
Result of Nearby Venue in Each District of Tokyo

I will use some machine learning method (K-Means) to cluster it so we can mapping each district base on the same characteristic. But if you read the ‘Venue Category’ is category, we need to change it to another format. Get_dummies is one of the methods to transform it. Here’s the code and result:

Get Dummies Code
Result of Dummies

Transform the value to the column and label it with 1 if there's that value in the column before and 0 if isn't. Now we group it base on the district and take the mean and find its proportion of the total data.

The Code
Some of the result

After that, we sort each district and find the top ten of the most common venues in each district. This is the code and some of the result:

Code To Find Top 10 Venue
Some Result of Top 10 In Each District of Seoul
Some Result of Top 10 in Each District of Tokyo

We have the data and let's try to cluster it to find the cities characteristic. I will use the K-Means method to find the Cluster and will try to label each cluster later.

K-Means Code

And merge the result to the table we create before with this code:

Code for Megre

And here's some of the result:

Tokyo Table
Seoul Table

Lets plot it again to map to help us visualize. Heres the code and the result:

Code to Map The Result
Map of Seoul
Map of Tokyo

I fill the different colors of each cluster. As I use labels in district coordinate it's hard to see the area of each district and the cluster. After this, I will try using geospatial as a method to create the boundary of each district and pop out the top 1 venue so it gives us some valuable image.

The cluster of Each City

After using K-Means, the computer will try to cluster each district and give us some results. K-Means is an unsupervised method of machine learning. I will try to improve the method and the data so it can give more accurate results.

This is the result of the K-Means, until now I keep thinking of each label in the cluster. As K-Means give us the result without telling us the label. So the label is temporary and maybe you can give me some tips or pointer about the label.

Tokto Cluster

Seoul Cluster

Result

Now you find some characteristics of each district, you can choose the district you prefer to live in. They all have their own unique characteristics. And I find something else without the cluster.

Result of Tokyo

In Tokyo, you will find that Convenience Store is almost the most common venue in the all-district. You can try to visit each convenience store and you will get some experience and you can find it within 5 minutes walk from your stay place. While in Seoul:

Result of Seoul

You will find that Seoul has a lot of cafes and restaurants with a unique vibe and it's a good experience for you.

Thank you for reading it…

Alvin

--

--