verify-tag4367x PII Label-Specific Essays (by 7b Models)

globaleducationtransformerstoken classificationenglish

2

已售 0
19.06MB

数据标识:D17222478226329948

发布时间:2024/07/29

以下为卖家选择提供的数据验证报告:

数据描述

Evaluation of my dataset with my .915 baseline:

F5 score = .690 - Recall = .692, Precision = .639

Distribution of data:

  • 843x Address (ca. 500 US)
  • 496x Names (Incl. Middle Names, Pronounciation or Nicknames)
  • 537x Userid
  • 704x Username (Incl. Name)
  • 531x Phone
  • 755x Email (Incl. Name)
  • 501x URL

See linked notebook for generation.


Remarks on labels:

EMAIL:

  1. Email is always based on name, but random domains
  2. Prompt was to also write about their favourite book, they are heavily favouring “to kill a mockingbird”

PHONE:

  1. Generated from multiple countries for diversity
  2. Labelling of phone numbers should only include the full number (not parts of it)

ADDRESSES:

  1. From multiple countries for diversity
  2. For US Addresses, State abbreviations are mapped to full name, so these are labeled as well
  3. Addresses are only labelled as such if it starts with either of the first two words of the full address (e.g., if house number misses for us address, it is still labelled)

NAMES:

  1. Middle names are sometimes generated, either separeted with " " or "-"
  2. Pronounciations and nicknames were generated and labelled
  3. However, “t’oma” as in my name Thomas is derived from the arameic word “t’oma” was not tagged. Let me know if this is wrong. They are relatively easy to identify in the names dataset by looking for “derived from”

URL:

  1. Short domains, full websites and full URIs

USERID:

  1. Mostly random generated string, number combination - not oriented on other formats
  2. Can mostly easily be augmented by replacing the userid
  3. Userid is sometimes split in text into parts - these splits are not labelled (not sure if this is right)

USERNAMES:

  1. either generated based on name OR animal+birthyear OR colour+fruit
data icon
4367x PII Label-Specific Essays (by 7b Models)
2
已售 0
19.06MB
申请报告