以下为卖家选择提供的数据验证报告:
数据描述
Context
Gathered this dataset as part of my work for the Information Retrieval and Text Mining course at the Faculty of Mathematics and Computer Science, University of Bucharest.
Content
The data is composed of four sources. The initial data was forwarded from Sparktech's 2018 Textract Hackathon. This was enhanced with data from other three kaggle datasets: 150K Lyrics Labeled with Spotify Valence, dataset lyrics musics and AZLyrics song lyrics.
Apart from the original Sparktech data, the other datasets did not provide a Genre
feature. In order to deal with the lack of Genre
labeling , I have built a labeling function using the spotipy library, which uses the Spotify API in order to retrieve the genre of an Artist
. Please note that the Spotify API returns a list of genres for one artist, so I considered the most common genre to be said artists dominant genre.
Aditionally, the AZLyrics data was badly encoded, namely the column delimiter character, the comma, was also used as a verse delimiter in the Lyrics
column. Fortunately, the dataset comes with two URL columns that conveniently separate the Artist
, Song
and Lyrics
columns, so with a bit of regex magic I was able to extract the useful data using https://
as a delimiter.
On a last note, I used Nakatani Shuyo's langdetect library to automatically label the lyrics with a language. In total, the lyrics come in 34 languages.
Acknowledgements
I am greatful to the kaggle users edenbd, Italo Marcelo and Albert Suarez, as well as the Sparktech team who gathered the original data and to my professor who provided it for the project.
Inspiration
In case you stumble across this dataset in the wild, I encourage you to try the Genre classification task on it and different feature engineering approaches. I am excited to see how inventive you can get!
