Pd: The banner image was obtained here. Credits to them.

Context

Peru is one of best culinary destination in the world. This country has diverse climates and ecological floors, where various crops have been developed. In this way, it has a lot of natural and unique inputs. So, peruvian food is a cuisine of opposites: hot and cold on the same plate. Acidic tastes melding with the starchy. Robust and delicate at the same time. This balance occurs because traditional Peruvian food relies on spices and bold flavors, ranging from the crisp and clean to the heavy and deep. Each flavor counters or tames the other. While many people see Peru as a land of cloud-topped mountains and ruins of ancient civilizations, Peru’s true treasure is its rich culinary heritage. Ingredients and cooking techniques from Africa, Europe, and East Asia come together in a delightful melange that is utterly unique the world over. But what kind of food do Peruvians eat? And, what restaurants should you visit? 😊

Content

This kaggle dataset contains information scraped from GooglePlaces and Tripadvisor using Selenium, Requests, BeautifulSoup and Rvest. More info about the used web-scraping in this github repository The content here have a lot (at the moment not all) of restaurants reviews in Lima, Peru between 2010 and 2021. In total exist more of 8791 restaurants and more of 1160666 reviews. With a total of 20 features with a high diversity: geospatial, text, date ,categoric and numeric feafures!

This has two general sections. The first is the Restaurants. This contains general and geospatial information. The second is the Reviews. This contains the interaction between user and restaurant, with this way is possible to see the satisfaction of the client with a service. Exist a possible third section: the Users. This information maybe will be added in two months.

About the collection methodology, this is explained below:

-The sample: The scraped reviews are the most recent reviews in all possible restaurants in the province of Lima.

-Set of items: In one way, the users. In other way: the restaurants.

-Set of variables: Exist two general tables. See the information below

The following diagram and table summarise all.

Table 1: Restaurants

Variable	Description
Id	Id of the restaurant
Name	Name of the restaurant
Tag	The category of the restaurant
x, y	Geospatial information and exact location of restaurant
District	District where the restaurant is located
Direction	District where the restaurant is located
Stars	Mean Stars of restaurant in all time
N_reviews	Number of reviews of restaurant in all time
Min_Price	Minimum price in the menu of restaurant
Max_Price	Maximum price in the menu of restaurant
Platform	Platform where the information was downloaded

Table 2: Reviews

Variable	Description
Id_review	Id of the review
Id_nick	Id of the user. With this is possible to get the profile link
Date	Date when the review was written
Service	Id of the restaurant. Conection with Table 1
Review	Content of the review. This describe the satisfaction of the user
Title	Title of the review. Only available in Tripadvisor
Score	Punctuation in the review
Likes	Number of votes in the publication
Platform	Platform where the information was downloaded

Also, exist auxiliar information related with the sentiment and emotion. This probabilities was obtained with a Spanish NrcLexicon, however, that results is not ok. Anyway, that is a reference and you can propose a fine tuning here. In adittion, also exist the probability to get a specific star, however, this was obtained with a simple logistic regression. Also i showed the information about Spanish NrcLexicon and Geospatial Borders. The author ands more information you can find there and there.

Table 3: Models

Variable	Description
Id_review	Id of the review. Conection with Table 1
Positive	Probability of review that it will show positive sentiment
Negative	Probability of review that it will show negative sentiment
Anger	Probability of review that it will show anger emotion
Anticipation	Probability of review that it will show anticipation emotion
Disgust	Probability of review that it will show disgust emotion
Fear	Probability of review that it will show fear emotion
Joy	Probability of review that it will show joy emotion
Sadness	Probability of review that it will show sadness emotion
Surprise	Probability of review that it will show surprise emotion
Stars_1	Probability of review that it will get 1 star
Stars_2	Probability of review that it will get 2 stars
Stars_3	Probability of review that it will get 3 stars
Stars_4	Probability of review that it will get 4 stars
Stars_5	Probability of review that it will get 5 stars

The entity relationship diagram!

erd

Usage

Text classification: The main topic in this types of datasets. Vectorize the reviews and define a predictive model. Identify strong and weak points of each restaurant.

Find patterns: Compare districts (or restaurants) along the time. What is the common words in an excellent restaurant? Why these restaurants are better?

Reduction of dimention: Detect similarities and then, clustering the reviews.

Acknowledgements

Thanks to Kaggle and its community. In general, thanks to the learners and teachers in machine learning, deep learning, natural language processing and computer vision.

Inspiration

Natural language processing is a great tool. One application that I'm interested is detect bullies messages in any social network. I know that exist many notebooks and papers, but I'd like to build a bot that detect all possible cases and surely, there exist!

关于典枢

下载与支持

服务协议

关于我们

官方公众号

技术交流群