困困

verify-tagAndroid apps metadata (50.000 apps)

computer sciencemobile and wirelessprogrammingsoftware

4

已售 0
16.66MB

数据标识:D17220739584317216

发布时间:2024/07/27

以下为卖家选择提供的数据验证报告:

数据描述

Context

I had a dataset of about 202 million Android smartphone logs (from +14.000 users) at my disposal which we had to contextualize for academic research purposes. Since the database contained the name of the app as registered on the Android phone (e.g. com.nianticlabs.pokemongo), it was relatively easy to build a scraper to collect some additional info on the apps (e.g. genre of app, permissions of app, etc.). In total, I scraped metadata on more than 50.000 apps.

The difference between other available app datasets (on Kaggle) is that:

  1. The scraper collected data from five different platforms, not just Google Play. This decreased the negative impact legacy versions and discarded apps had on the amount of missings in the final dataset. The scraper took on a sequential scraping strategy, meaning that it started its search on the Play Store and sequentially looked on other platforms if the app was not available on the Google platform. All app categories are harmonized with the Google Play categories functioning as the gold standard.

  2. I performed an extensive automated and manual quality check of the data obtained from these repositories (see 'content' paragraph). Although some of these checks are relatively automated (e.g. fuzzymatching), the most laborious check involved ranking the apps by popularity among the users in the database and looking for inconsistencies. For example, both legitimate sport apps (e.g. Strava) and sport games (e.g. FIFA) are categorized in Google Play as 'sports'. For this reason, I created an additional 'sport games' category. Another example would be the creation of a separate dating-app category; as these apps are officially categorized as a "lifestyle" (or sometimes "social") app, which is not only inconsistent but above all vague. The new category column is the end result of this manual check.

Given both the sequential scraping strategy and the multiple data quality checks performed, this is probably one of the most valid and extensive Android app datasets out there.

Obviously, some variables were not available on some of the platforms. Here's a quick overview of the variables, including an indication of whether this specific parameter was available on the platform:

Around 2000 apps were not found in any repository, but are still included in the dataset (indicated by the "not found in databases" string in multiple columns).

The .csv file is the original file format of the dataset, but since dealing with csv files is probably a major cause of anger fits among data analysts around the globe, I also included an Excel version of the file just in case.

If you would like to use this dataset for your own research, but you're afraid the reviewers will question the performed 'manual check', just cite one of these (or both) papers:

Boghe, K., De Grove, F., Herrewijn, L., & De Marez, L. (2020). Scraping application data from the web— Addressing the temporality of online repositories when working with trace data. Extended abstract presented at the 70th International Communication Association Conference

Boghe, K., Herrewijn, L., De Grove, F., Van Gaeveren, K., & De Marez, L. (2020). Exploring the effect of in-game purchases on mobile game use with smartphone trace data. Media and Communication,8(3). doi: 10.17645/mac.v8i3.3007

Citing these two references will probably (and hopefully) serve as some kind of previous validation/'vetting' for your reviewers.

Content

Since I wrote an extended abstract based on my experience with writing the scraper, I'll just shamelessly copy/paste a couple of paragraphs from said abstract to provide some additional info here.

"One of the main objectives of our scraper was to deal with the inherent temporality of web data and app marketplaces. Not only do apps gradually disappear from depositories, but subtle name changes and the existence of legacy versions complicate matters further. While Google Play serves as the golden standard for Android applications, the information-value of this repository diminishes rapidly as the age of the historical data increases. For this reason, we go beyond Google’s Play Store and additionally use alternative repositories. Such repositories are often less well-maintained and thus contain information on legacy versions and deprecated Google Play apps.

In order to collect meta-data for these apps, we used a sequential scraping strategy. Using the Rvest package (Wickham, 2019), the script fetched information from five different online sources in a successive manner. The order was determined by the level of detail provided by the source, although the availability of the app category (i.e. social, communication, game) was a prerequisite for all platforms to be eligible for inclusion in our sample. Given these selection criteria, app-pages were scraped from the following repositories (in order): (1) Google Play, (2) apkmonk, (3) APK Support, (4) APKsHub and (5) APKPure.

Despite the sequential scraping strategy, several alterations to the originally scraped data were necessary to ensure the consistency and validity of the data. First, app categories were harmonized across the different repositories as most platforms adopt the Google Play categorisation. Second, we performed a fuzzy matching procedure to detect undetected legacy versions, regional versions or certain sub-processes of already detected apps. To accomplish this, we calculated a similarity matrix (based on the Jaro metric) between apps that were present versus not present in the online repositories. Subsequently, we coupled each missing app with its closest match. Although one could opt for an automated matching procedure, we preferred to manually check all matches with a Jaro similarity of at least 0.80. Apps matched through this approach adopted all metadata from their related app. Next, we performed what we coin as basematching. This technique follows a similar procedure as fuzzy matching, but only considers the root name of the application (e.g.‘com.ea. tetrisfree_row’, ‘com.ea.gp.starwarsbfcompanion’ both refer to EA games). This allowed us to infer the appropriate application category. Next we employed two procedures that implied significant manual labour. First, we used the temporary results of the app scraper on a subset of around 100 million smartphone logs to produce a summary table of the most popular apps by category. We investigated the top 100 most popular apps for each category, including the missing applications. This procedure allowed us to identify the most prevalent and impactful inconsistencies in app categorisation, which yielded several significant findings.

All in all, the web scraper described here solves some issues stemming from the inherent temporality of web data. Although a large majority of the apps in our sample (75%) were successfully scraped from the preferred Google Play platform, we had to rely on less reliable and information-rich libraries for about 1 out of 5 (18%) apps. Using these and other matching strategies, we reduced the number of apps with missing data from 25% (only using Google Play) to less than 6%. "

Acknowledgements

I'd like to thank the imec-mict-UGent research group for providing me with the impressive dataset of smartphone logs, which was the starting point for this little project. Special thanks to Kyle Van Gaeveren, who was always patient whenever I requested a data export from the log database. Check out their Mobile DNA project on the Google Play Store right here.

data icon
Android apps metadata (50.000 apps)
4
已售 0
16.66MB
申请报告