以下为卖家选择提供的数据验证报告:
数据描述
Context
The propagation of covid-19 worried a lot to all us 😷. In that sense, a zombie pandemic was always a very used topic in all times. Certainly, is a horrible way to finish our existence, so, this stories were very violent and the characters were trying to survive.. That's great, however, in this century, many projects considered adding other facets: the social and psychological consequences in the characters in that world.
That's how we got here. The Last of Us is a masterpiece in the industry of the videogames where many experts, critics and web-pages are agree. Justly, its story was based in that hopeless, post-apocalyptic situation. A strong point here was the exploration in this types of events. Other point, and no less important, was the gameplay and the interactions. So, this game won many prizes and maybe was a pioneer in its category 🙌 . You can find the reasons of its success in the section reviews_g1 and then establish insights for future similar games.
In the next year a dlc was released: Left Behind. It’s a prologue to the events of the original game, being Ellie the main character. In this way, the character and her actions are better understood. The game was well received. You can analize it in the section reviews_lb and identify the reviews about Ellie and its friendship. 😄
Finally, The Last of Us Part II (and the reason that I wanted to create this dataset). It shows very opposite reviews 🤔. It's amazing to see this high divergence. Personally, I like this game too, it presents incredible graphics and is very realistic. But i understand the other point of view, surely you know some reasons as the inconsistency in character decisions or the changes in the trailers. But exist other reasons, you can analize it in depth in the section reviews_g2 and if is possible, propose any predictive model. In this case you can start here.
Now, a serie will be released. All of us hope it'll be a success 🎉🎉
Content
This kaggle dataset contains information scraped from metacritics using Scrapy and BeautifulSoup. More info about the used web-scraping in this github repository. The dataset contains 3 main sections: The Last of Us part II, The Last of Us, The Last of Us Left Behind where each one contains two type of files: users and critics.
The collection methodology is explained below: -The sample: The scraped reviews are the most recommend reviews. In one case is possible download all reviews but in other cases was not possible (it's possible but it's not good abuse web scraping in a web-page). However, the retrieved information is sufficient for further analysis. With the 6 files, it has a total of 40000 observations and 8 variables. Have fun! -Set of items: The game-users and/or fans of the sequel (or critics). Maybe a bot, but is just a hypothesis. Another point, the user reviews are more greater thar critic reviews by far. -Set of variables: All user data contains the following variables.
Variable | Description |
---|---|
Id | The nick of the game-user. Is a unique value |
Review | The review of the user |
Type_review | Some reviews are large or present spoilers. Expanded is that and normal is the rest. |
Views | Number of views in a review |
Votes | Number of votes that it was received |
Date | Date when the review was published |
Language | Used language in the review |
Score | Proposed punctuation given for the user. The target |
In the case of critic data, only contain Id, Review, Date and Score.
An update: I created new files. There are the files that ends in u. Those files are a duplicated of the originaI, i only added two new variables:
Variable | Description |
---|---|
Platform | Now, the set contains information about ps3 and ps4 reviews |
Split | For the modeling and the tasks. |
Pd1: Please check out the tasks. If you are interested, please propose any notebook 😊. If the dataset is not enough and you consider that is necessary get more variables, please let me know in the discussions. Pd2: Now, the id is not unique in tables with the variable platform. In fact, this is a gamer-id and he can write a review in both platforms.
Usage
Text classification: The main topic in this types of datasets. Vectorize the reviews and define a predictive model. Identify strong and weak points of the game. Compare each games: What is preferred? In what points? Why did this game is better than other this? Reduction of dimention: Detect similar word and then, clustering the reviews. Pd: Important. Mantain discretion. Some reviews are disrespectful, violent and difficult to read 😅. And obviously contain spoilers.
Acknowledgements
Thanks to Kaggle and its community. In general, thanks to the learners and teachers in machine learning, deep learning and computer vision.
Inspiration
Natural language processing is a great tool. One application that I'm interested is detect bullies messages in any social network. I know that exist many notebooks and papers, but I'd like to build a bot that detect all possible cases and surely, there exist!
