以下为卖家选择提供的数据验证报告:
数据描述
This dataset is published in the article "MBAL: A Dataset of 10 Million Annotated Crypto Addresses with Categories and Entities on Leading Blockchain Networks" and includes data related to the dataset and experiments conducted.
The dataset comprises six files, covering three sections, described as follows:
Section 1: The publicly released dataset
- dataset_10m_ads.csv This file contains labeled data for 10 million addresses, with six columns explained below:
column_name description chain The blockchain network of the address, with five possible values: bitcoin_mainnet, ethereum_mainnet, bnb_chain_mainnet, polygon_mainnet, avalanche_c_chain address The cryptocurrency address categories The category of the address, as enumerated in the article, with 62 possible values. An address may belong to multiple categories entity The entity associated with the address, which may be unique or empty source The source of the data, with three possible values: ground_truth, heuristic, external
Second 2: Sample data for Experiment 1 (COMPARATIVE EXPERIMENT BETWEEN MBAL AND BABD-13)
Experiment 1 focuses on addresses in Bitcoin mainnet. The columns in the below three files are consistent, mainly including address, category, and other 144 feature fields. Using these sample data, Experiment 1 described in the article can be fully replicated.
The method of constructing a training/test set based on sample data is shown in this figure . We fused and de-weighted the positive sample data of the two datasets, from which 50,000 data were randomly selected as the positive sample of the test set. Negative samples are constructed in the same way, and a test set of 100,000 data is finally obtained. And the sample data removal corresponding to the test set is the training set data. The white part in this figure indicates the duplicate data, yellow indicates the test data, and light yellow indicates the training data.
- exp1_bitcoin_sample_test_dd.csv Public test samples for Experiment 1.
- exp1_bitcoin_sample_train_mbal_dd.csv Training samples from the MBAL dataset for Experiment 1.
- exp1_bitcoin_sample_train_babd_dd.csv Training samples from the BABD dataset for Experiment 1.
Section 3: Sample data for Experiment 2 (EXPERIMENT ON SPECIFIC CATEGORIES)
Experiment 2 focuses on addresses in Ethereum mainnet. The columns in these files are consistent, mainly including address, category, and other 207 feature fields. Using these sample data, Experiment 2 described in the article can be fully replicated. Sample Dataset Construction: When analyzing the Ethereum category, we select the transaction data from 2022 for sample dataset construction, constructed in the same way as the experiment 1. In total, we got 55571103 addresses and the corresponding 591892912 transaction data. Training/Test Set Construction: We constructed training/test sets for specific categories of analysis experiments using the same methodology as for the experiment 1. However, in terms of quantity, we expanded by selecting 4749952 training data and 1000000 test data (500000 positive and negative samples, respectively).
- exp2_ethereum_sample_test_mbal_dd.csv Test samples from the MBAL dataset for Experiment 2.
- exp2_ethereum_sample_train_mbal_dd.csv Training samples from the MBAL dataset for Experiment 2.
About categories
