Random sample of Common Crawl domains from 2021

淡然若水

Random sample of Common Crawl domains from 2021

text miningtabulartextbinary classificationautoml

￥1

已售 0

35.37MB

数据标识：D17220551333206450

发布时间：2024/07/27

数据描述

Context

Common Crawl project has fascinated me ever since I learned about it. It provides a large number of data formats and presents challenges across skill and interest areas. I am particularly interested in URL analysis for applications such as typosquatting, malicious URLs, and just about anything interesting that can be done with domain names.

Content

I have sampled 1% of the domains from the Common Crawl Index dataset that is available on AWS in Parquet format. You can read more about how I extracted this dataset @ https://harshsinghal.dev/create-a-url-dataset-for-nlp/

Acknowledgements

Thanks a ton to the folks at https://commoncrawl.org/ for making this immensely valuable resource available to the world for free. Please find their Terms of Use here.

Inspiration

My interests are in working with string similarity functions and I continue to find scalable ways of doing this. I wrote about using a Postgres extension to compute string distances and used Common Crawl URL domains as the input dataset (you can read more @ https://harshsinghal.dev/postgres-text-similarity-with-commoncrawl-domains/).

I am also interested in identifying fraudulent domains and understanding malicious URL patterns.

看了又看

验证报告

以下为卖家选择提供的数据验证报告：

Random sample of Common Crawl domains from 2021

￥1

已售 0

35.37MB

申请报告

Random sample of Common Crawl domains from 2021

Context

Content

Acknowledgements

Inspiration

关于典枢

下载与支持

服务协议

关于我们

官方公众号

技术交流群