凤凤

verify-tagGitHub Dataset

earth and naturecomputer scienceexploratory data analysisdata visualizationclassificationclusteringregression

6

已售 0
75.72MB

数据标识:D17222346477700995

发布时间:2024/07/29

以下为卖家选择提供的数据验证报告:

数据描述

We have two versions of dataset available

Version 1 Link

This dataset is a collection of 1052 GitHub repositories, along with other columns such as the primary language used in it, fork count, open pull requests, and issue count.

While working on a repository recommendation project, I curated this data by scraping around 18000+ repositories and filtered those that have at least one issue open so that we can recommend the user a repository to which he/she can contribute.

Columns repositories - the name of the repository (Format - github_username/repository_name) stars_count - stars count of the repository forks_count - fork count of the repository issues_count - active/opened issues in the repository pull_requests - pull requests opened in the repository contributors - contributors contribute to the project so far language - primary language used in the project

Version 2 Link

Found a JSON data on Kaggle, (link) and wrote a preprocessing function to convert them into a CSV file.

This is comparatively a bigger dataset, with 2917951 repositories data.

Columns name - the name of the repository stars_count - stars count of the repository forks_count - forks count of the repository watchers - watchers in the repository pull_requests - pull requests made in the repository primary_language - the primary language of the repository languages_used - list of all the languages used in the repository commit_count - commits made in the repository created_at - time and date when the repository was created license - license assigned to the repository.

Note The data in the dataset is from the time when it was scrapped, so any updates in the actual repository will not be reflected here.

data icon
GitHub Dataset
6
已售 0
75.72MB
申请报告