老下头

verify-tagMBAL: 10 millions crypto address label dataset

data cleaningclassificationensemblingbigquerycurrencies and foreign exchangeenglish

15

已售 0
947.53MB

数据标识:D17174993458318894

发布时间:2024/06/04

以下为卖家选择提供的数据验证报告:

数据描述

This dataset is published in the article "MBAL: A Dataset of 10 Million Annotated Crypto Addresses with Categories and Entities on Leading Blockchain Networks" and includes data related to the dataset and experiments conducted.

The dataset comprises six files, covering three sections, described as follows:

Section 1: The publicly released dataset

  • dataset_10m_ads.csv This file contains labeled data for 10 million addresses, with six columns explained below:
    column_name description
    chain The blockchain network of the address, with five possible values: bitcoin_mainnet, ethereum_mainnet, bnb_chain_mainnet, polygon_mainnet, avalanche_c_chain
    address The cryptocurrency address
    categories The category of the address, as enumerated in the article, with 62 possible values. An address may belong to multiple categories
    entity The entity associated with the address, which may be unique or empty
    source The source of the data, with three possible values: ground_truth, heuristic, external

Second 2: Sample data for Experiment 1 (COMPARATIVE EXPERIMENT BETWEEN MBAL AND BABD-13)

Experiment 1 focuses on addresses in Bitcoin mainnet. The columns in the below three files are consistent, mainly including address, category, and other 144 feature fields. Using these sample data, Experiment 1 described in the article can be fully replicated.

The method of constructing a training/test set based on sample data is shown in this figure . We fused and de-weighted the positive sample data of the two datasets, from which 50,000 data were randomly selected as the positive sample of the test set. Negative samples are constructed in the same way, and a test set of 100,000 data is finally obtained. And the sample data removal corresponding to the test set is the training set data. The white part in this figure indicates the duplicate data, yellow indicates the test data, and light yellow indicates the training data.

  • exp1_bitcoin_sample_test_dd.csv Public test samples for Experiment 1.
  • exp1_bitcoin_sample_train_mbal_dd.csv Training samples from the MBAL dataset for Experiment 1.
  • exp1_bitcoin_sample_train_babd_dd.csv Training samples from the BABD dataset for Experiment 1.

Section 3: Sample data for Experiment 2 (EXPERIMENT ON SPECIFIC CATEGORIES)

Experiment 2 focuses on addresses in Ethereum mainnet. The columns in these files are consistent, mainly including address, category, and other 207 feature fields. Using these sample data, Experiment 2 described in the article can be fully replicated. Sample Dataset Construction: When analyzing the Ethereum category, we select the transaction data from 2022 for sample dataset construction, constructed in the same way as the experiment 1. In total, we got 55571103 addresses and the corresponding 591892912 transaction data. Training/Test Set Construction: We constructed training/test sets for specific categories of analysis experiments using the same methodology as for the experiment 1. However, in terms of quantity, we expanded by selecting 4749952 training data and 1000000 test data (500000 positive and negative samples, respectively).

  • exp2_ethereum_sample_test_mbal_dd.csv Test samples from the MBAL dataset for Experiment 2.
  • exp2_ethereum_sample_train_mbal_dd.csv Training samples from the MBAL dataset for Experiment 2.

About categories

data icon
MBAL: 10 millions crypto address label dataset
15
已售 0
947.53MB
申请报告