淡然若水

verify-tagRandom sample of Common Crawl domains from 2021

text miningtabulartextbinary classificationautoml

1

已售 0
35.37MB

数据标识:D17220551333206450

发布时间:2024/07/27

以下为卖家选择提供的数据验证报告:

数据描述

Context

Common Crawl project has fascinated me ever since I learned about it. It provides a large number of data formats and presents challenges across skill and interest areas. I am particularly interested in URL analysis for applications such as typosquatting, malicious URLs, and just about anything interesting that can be done with domain names.

Content

I have sampled 1% of the domains from the Common Crawl Index dataset that is available on AWS in Parquet format. You can read more about how I extracted this dataset @ https://harshsinghal.dev/create-a-url-dataset-for-nlp/

Acknowledgements

Thanks a ton to the folks at https://commoncrawl.org/ for making this immensely valuable resource available to the world for free. Please find their Terms of Use here.

Inspiration

My interests are in working with string similarity functions and I continue to find scalable ways of doing this. I wrote about using a Postgres extension to compute string distances and used Common Crawl URL domains as the input dataset (you can read more @ https://harshsinghal.dev/postgres-text-similarity-with-commoncrawl-domains/).

I am also interested in identifying fraudulent domains and understanding malicious URL patterns.

data icon
Random sample of Common Crawl domains from 2021
1
已售 0
35.37MB
申请报告