Project 3: Music Genre Classification based on Lyrics (Multiclass Classification)
In this project I built a model do classify the genre of a music, based on its lyric. The model was trained with some popular brazilian music styles, such as Axé, Funk, MPB, Pagode, Samba and Sertanejo. I collected music lyrics in LETRAS.MUS.BR web site, I developed web scrappers, one for each music genre, using Python package BeaultifulSoup. Almost 1k lyrics was collected for each genre. The web crawlers codes are available here.
Genre Distribution:

First I followed these steps for Data Preparation:
- Text cleaning and normalization, removing html tags and special characters and turning all words in lower case.
- Data partition in train set (66%) and test set (33%).
- Targel label encoding, using sklearn LabelEncoder.
Then I built 4 different models:
- Bag-of-Words (TF-IDF): use of TF-IDF vectorizer, feature selection and Logistic Regression with penalty, using RandomizedSearch to tune trainable tasks.
- Continuous Skipgram (Pre-Trained - Brazilian Portuguese): pre-trained word embedding Continuous Skupgram with a Logistic Regression with penalty, using RandomizedSearch to tune trainable tasks.
- Continuous Skipgram (Task-Specific): training a word embedding Continuous Skupgram, using Gensim package, with a Logistic Regression with penalty, using RandomizedSearch to tune trainable tasks.
- BERT (Pre-Trained - Brazilian Portuguese): use of the BERT model pre-trained in Portuguese, fine-tuning with HuggingFace package class BertForSentenceClassifier.
Results:
Model | AUC Train | AUC Test |
---|---|---|
Bag of Words - TFIDF | 0.79 | 0.56 |
Skipgram - Pre-Trained | 0.46 | 0.43 |
Skipgram - Task-Specific | 0.51 | 0.48 |
BERT - Pre-Trained (Bertimbau) | 0.84 | 0.64 |
- Confusion Matrix (BERT)

The Colab Notebook of this study is available here.
The data sets are available here.