1- Apply the following pre-processing steps to the texts:
* Remove all words that contain numbers;
* Convert words to lowercase;
* Remove punctuation;
* Tokenize the texts into words, generating a unique dictionary
with n tokens and converting each text into an n-dimensional vector
with the respective word count.
Next, find the 10 most frequent words from the text base.
2- Apply the following pr and processing steps to the texts processed in quest to the previous one:
* Remove stopwords;
* Perform POS labeling;
* Perform stemization;
a) display the results in some texts.
b) check which are the 10 most frequent words and compare with the
10 most frequent words from the previous question.
c) repeat letter b) using the stemized tokens.
d) check which are the most frequent parts of speech.