以下为卖家选择提供的数据验证报告:
数据描述
Pd: The banner image was obtained here. Credits to them.
Context
Peru is one of best culinary destination in the world. This country has diverse climates and ecological floors, where various crops have been developed. In this way, it has a lot of natural and unique inputs. So, peruvian food is a cuisine of opposites: hot and cold on the same plate. Acidic tastes melding with the starchy. Robust and delicate at the same time. This balance occurs because traditional Peruvian food relies on spices and bold flavors, ranging from the crisp and clean to the heavy and deep. Each flavor counters or tames the other. While many people see Peru as a land of cloud-topped mountains and ruins of ancient civilizations, Peru’s true treasure is its rich culinary heritage. Ingredients and cooking techniques from Africa, Europe, and East Asia come together in a delightful melange that is utterly unique the world over. But what kind of food do Peruvians eat? And, what restaurants should you visit? 😊
Content
This kaggle dataset contains information scraped from GooglePlaces and Tripadvisor using Selenium, Requests, BeautifulSoup and Rvest. More info about the used web-scraping in this github repository The content here have a lot (at the moment not all) of restaurants reviews in Lima, Peru between 2010 and 2021. In total exist more of 8791 restaurants and more of 1160666 reviews. With a total of 20 features with a high diversity: geospatial, text, date ,categoric and numeric feafures!
This has two general sections. The first is the Restaurants. This contains general and geospatial information. The second is the Reviews. This contains the interaction between user and restaurant, with this way is possible to see the satisfaction of the client with a service. Exist a possible third section: the Users. This information maybe will be added in two months.
About the collection methodology, this is explained below:
-The sample: The scraped reviews are the most recent reviews in all possible restaurants in the province of Lima.
-Set of items: In one way, the users. In other way: the restaurants.
-Set of variables: Exist two general tables. See the information below
The following diagram and table summarise all.
Table 1: Restaurants
Variable | Description |
---|---|
Id | Id of the restaurant |
Name | Name of the restaurant |
Tag | The category of the restaurant |
x, y | Geospatial information and exact location of restaurant |
District | District where the restaurant is located |
Direction | District where the restaurant is located |
Stars | Mean Stars of restaurant in all time |
N_reviews | Number of reviews of restaurant in all time |
Min_Price | Minimum price in the menu of restaurant |
Max_Price | Maximum price in the menu of restaurant |
Platform | Platform where the information was downloaded |
Table 2: Reviews
Variable | Description |
---|---|
Id_review | Id of the review |
Id_nick | Id of the user. With this is possible to get the profile link |
Date | Date when the review was written |
Service | Id of the restaurant. Conection with Table 1 |
Review | Content of the review. This describe the satisfaction of the user |
Title | Title of the review. Only available in Tripadvisor |
Score | Punctuation in the review |
Likes | Number of votes in the publication |
Platform | Platform where the information was downloaded |
Also, exist auxiliar information related with the sentiment and emotion. This probabilities was obtained with a Spanish NrcLexicon, however, that results is not ok. Anyway, that is a reference and you can propose a fine tuning here. In adittion, also exist the probability to get a specific star, however, this was obtained with a simple logistic regression. Also i showed the information about Spanish NrcLexicon and Geospatial Borders. The author ands more information you can find there and there.
Table 3: Models
Variable | Description |
---|---|
Id_review | Id of the review. Conection with Table 1 |
Positive | Probability of review that it will show positive sentiment |
Negative | Probability of review that it will show negative sentiment |
Anger | Probability of review that it will show anger emotion |
Anticipation | Probability of review that it will show anticipation emotion |
Disgust | Probability of review that it will show disgust emotion |
Fear | Probability of review that it will show fear emotion |
Joy | Probability of review that it will show joy emotion |
Sadness | Probability of review that it will show sadness emotion |
Surprise | Probability of review that it will show surprise emotion |
Stars_1 | Probability of review that it will get 1 star |
Stars_2 | Probability of review that it will get 2 stars |
Stars_3 | Probability of review that it will get 3 stars |
Stars_4 | Probability of review that it will get 4 stars |
Stars_5 | Probability of review that it will get 5 stars |
The entity relationship diagram!
Usage
Text classification: The main topic in this types of datasets. Vectorize the reviews and define a predictive model. Identify strong and weak points of each restaurant.
Find patterns: Compare districts (or restaurants) along the time. What is the common words in an excellent restaurant? Why these restaurants are better?
Reduction of dimention: Detect similarities and then, clustering the reviews.
Acknowledgements
Thanks to Kaggle and its community. In general, thanks to the learners and teachers in machine learning, deep learning, natural language processing and computer vision.
Inspiration
Natural language processing is a great tool. One application that I'm interested is detect bullies messages in any social network. I know that exist many notebooks and papers, but I'd like to build a bot that detect all possible cases and surely, there exist!
