以下为卖家选择提供的数据验证报告:
数据描述
Description
There are many contexts where dyadic data are present. In social networks, users are linked to a variety of items, defining interactions. In the social platform of TripAdvisor, users are linked to restaurants by means of reviews posted by them. Using the information of these interactions, we can get valuable insights for forecasting, proposing tasks related to recommender systems, sentiment analysis, text-based personalisation or text summarisation, among others. Furthermore, in the context of TripAdvisor there is a scarcity of public datasets and lack of well-known benchmarks for model assessment. We present six new TripAdvisor datasets from the restaurants of six different cities: London, New York, New Delhi, Paris, Barcelona and Madrid.
Important notice
if you use these data, please cite the datasets using the associated Zenodo DOI
if you use these data, please cite the related paper under submission process (preprint - arXiv) >Botana, Iñigo López-Riobóo, Verónica Bolón-Canedo, Bertha Guijarro-Berdiñas, and Amparo Alonso-Betanzos. "Explain and Conquer: Personalised Text-based Reviews to Achieve Transparency." arXiv preprint arXiv:2205.01759 (2022).
Please notice that these datasets are under a CC-BY-NC 4.0 International license. You must NOT use the material for commercial purposes.
Dataset image by extravigator.com
Features
We exclusively collected the reviews written in English from the restaurants of each city. The tabular data is comprised of a set of six different CSV files, containing numerical, categorical and text features:
- parse_count: numerical (integer), corresponding number of extracted review by the web scraper (auto-incremental)
- author_id: categorical (string), univocal, incremental and anonymous identifier of the user (UID_XXXXXXXXXX)
- restaurant_name: categorical (string), name of the restaurant matching the review
- rating_review: numerical (integer), review score in the range 1-5
- sample: categorical (string), indicating “positive” sample for scores [4-5] and “negative” for scores [1-3]
- review_id: categorical (string), univocal and internal identifier of the review (review_XXXXXXXXX)
- title_review: text, review title
- review_preview: text, preview of the review, truncated in the website when the text is very long
- review_full: text, complete review
- date: timestamp, publication date of the review in the format (day, month, year)
- city: categorical (string), city of the restaurant which the review was written for
- url_restaurant: text, restaurant url
Additional information
This research has been financially supported in part by the Spanish Government [grant number PID2019-109238GB-C22]; by the Xunta de Galicia [grant number ED431G 2019/01 - Research Center on Information and Communication Technologies (CITIC)]; and by European Union ERDF Funds. Special recognition goes to the Spanish Ministerio de Universidades for the predoctoral FPU funds [grant number FPU19/01457]
For the data collection, we designed our own web scraper, selecting a mix of Scrapy python framework and Selenium web driver testing tool. Participants data have been anonymized. We added the field "author_id" as the incremental, univocal and anonymous identifier of each user (UID_XXXXXXXXXX).
