悠悠

Product Clustering, Matching & Classification

earth and naturebusinessclassificationclusteringmulticlass classificationratings and reviewse-commerce services

￥2

21.74MB

数据标识：D17222484634660292

发布时间：2024/07/29

Introduction

The continuous growth of the e-commerce industry has rendered the problem of product retrieval particularly important. As more enterprises move their activities on the Web, the volume and diversity of the product-related information increase quickly. These factors make it difficult for the users to identify and compare the features of their desired products. Clustering, classification, and product matching are useful algorithms that can contribute to the organization of product-related information and consequently, enhance the retrieval effectiveness.

This repository is designed to provide multiple datasets which are suitable for such algorithms. Each dataset is accompanied by its corresponding ground truth file that can be used for evaluation purposes.

Content

This repository includes 18 real-world datasets from different product categories, acquired from two online product comparison platforms: PriceRunner and Skroutz. In particular, we partially crawled these two platforms and we constructed 8 datasets out of each one. Each of these 16 datasets represents a specific product category. The categories were selected with two criteria, in order to: i) study the performance difference of the same methods on similar products that were provided by different vendors, and ii) examine the effectiveness of the algorithms on products from diverse categories. For this reason, we included products from both identical and different categories. Moreover, we created one aggregate dataset per platform that contains all the products from all 8 categories combined. These datasets enable the examination of the performance on heterogeneous datasets.

The datasets are provided in standard CSV and XML formats. Each CSV/XML entry includes the following pieces of information:

Product ID
Product Title as it appears in the respective product comparison platform (but in lower case and with punctuation removed)
Vendor ID: this is ID of the electronic store that provides (sells) the product. Vendor ID can be used for refinement purposes, such as the verification algorithm that we developed in [1].
Cluster ID: this is the ID of the cluster that the product belongs to. Useful for entity matching and clustering tasks.
Cluster Label: The title of the aforementioned cluster.
Category ID: this is the ID of the category that the product belongs to. It is meaningful mainly in the two aggregate datasets that contain products from multiple categories. Useful for classification and categorization tasks.
Category Label: The title of the aforementioned category.

Licence

The datasets are licensed under General Public License (GPL 2.0) and can be used by anybody. Nevertheless, in the case they are used for research purposes, the researchers are kindly requested to include the following articles into the References list of their published paper/s:

[1] L. Akritidis, A. Fevgas, P. Bozanis, C. Makris, "A Self-Verifying Clustering Approach to Unsupervised Matching of Product Titles", Artificial Intelligence Review (Springer), pp. 1-44, 2020.

[2] L. Akritidis, P. Bozanis, "Effective Unsupervised Matching of Product Titles with k-Combinations and Permutations", In Proceedings of the 14th IEEE International Conference on Innovations in Intelligent Systems and Applications (INISTA), pp. 1-10, 2018.

[3] L. Akritidis, A. Fevgas, P. Bozanis, "Effective Product Categorization with Importance Scores and Morphological Analysis of the Titles", In Proceedings of the 30th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 213-220, 2018.

看了又看

验证报告

以下为卖家选择提供的数据验证报告：

Product Clustering, Matching & Classification

￥2

21.74MB

申请报告

Product Clustering, Matching & Classification

Introduction

Content

Licence

关于典枢

下载与支持

服务协议

关于我们

官方公众号

技术交流群