Context

Gathered this dataset as part of my work for the Information Retrieval and Text Mining course at the Faculty of Mathematics and Computer Science, University of Bucharest.

Content

The data is composed of four sources. The initial data was forwarded from Sparktech's 2018 Textract Hackathon. This was enhanced with data from other three kaggle datasets: 150K Lyrics Labeled with Spotify Valence, dataset lyrics musics and AZLyrics song lyrics.

Apart from the original Sparktech data, the other datasets did not provide a Genre feature. In order to deal with the lack of Genre labeling , I have built a labeling function using the spotipy library, which uses the Spotify API in order to retrieve the genre of an Artist. Please note that the Spotify API returns a list of genres for one artist, so I considered the most common genre to be said artists dominant genre.

Aditionally, the AZLyrics data was badly encoded, namely the column delimiter character, the comma, was also used as a verse delimiter in the Lyrics column. Fortunately, the dataset comes with two URL columns that conveniently separate the Artist, Song and Lyrics columns, so with a bit of regex magic I was able to extract the useful data using https:// as a delimiter.

On a last note, I used Nakatani Shuyo's langdetect library to automatically label the lyrics with a language. In total, the lyrics come in 34 languages.

Acknowledgements

I am greatful to the kaggle users edenbd, Italo Marcelo and Albert Suarez, as well as the Sparktech team who gathered the original data and to my professor who provided it for the project.

Inspiration

In case you stumble across this dataset in the wild, I encourage you to try the Genre classification task on it and different feature engineering approaches. I am excited to see how inventive you can get!

关于典枢

下载与支持

服务协议

关于我们

官方公众号

技术交流群