Home / Expert Answers / Computer Science / in-this-homework-assignment-you-will-work-with-text-documents-and-perform-various-tasks-related-to-pa824

(Solved): In this homework assignment, you will work with text documents and perform various tasks related to ...





student submitted image, transcription available below
student submitted image, transcription available below
student submitted image, transcription available below
In this homework assignment, you will work with text documents and perform various tasks related to machine learning including data representation, clustering and classification. You are provided with a csv file containing text descriptions of courses and corresponding subject areas (target/label). In this assignment, you will explore this dataset and find groups of similar points/courses in the collection. Based on this data also, you will build a classifier that takes text as input (or a text file) and predicts the subject/topic. You will need to transform the text files into a representation suitable for processing. You can load the dataset using pandas. Then split it into a training and test set. Because we need to work with numeric data, we need to convert the data into a numeric representation. Features can be extracted using TF-IDF. TF-IDF is one effective way for representing text/documents. You can read more about TF-IDF here. Embeddings are also useful for this purpose and can capture deep semantics but you can use TF-IDF to represent the input feature vectors. Luckily, you do not have to compute the feature values of TF-IDF yourself, scikit learn has a built-in implementation of it. The following sample of code can be useful. inport pandas as pd import numpy an np from sklearn.feature_extraction.text import rfidfVectorizer from sklearn, Iinear_mode1, neighbors import KNeighborsclassifier from aklearn,mode1_selection import train_test_split, cross_val_score vectorizer TfidfVectorizer () vectorizer, folassifier in this part using the as example classifier.fit(X_train, train) X_test = vectorizer, transform ( 'Here' a a tent text: and computing'] ) prediction classifier,predict (X_test) print (predictions) 1. Perform clustering on the dataset provided using -means clustering (or any variant of your choice). Experiment with -values ranging from 2 to 10. Which value gives you the best result? How do you evaluate the clustering results? 2. Given two text files, compute the similarity between them using a. cosine similarity and b. another similarity metric of your choice (you need to specify it in comments). c. Why is it a good metric to use for computing similarity for text data? 3. Given a text description (or input file), compare it with the available data and retrieve the top results with highest similarity, where is entered by the user. 4. Build a classifier to classify a given input; i.e., predict the subject. You may use KNN or another classifier of your choice. Specify the classifier you implement. Given an input description, output the model prediction of the subject. Test the classifier accuracy using 10 -fold cross validation and report the results. Extra credit (optional). Create a simple GUI for the computing solution you built in the previous part. This could be a stand alone application window or a webpage. It should have the following functionalities: 1. Compute the similarity between two input texts (or text files). 2. Given a text description, output the model prediction of the subject. 3. Compute the similarity between a new input and the available data. Retrieve the top results with highest similarity, where is entered by the user. A sample of the layout expected is provided. Improvements are welcome. Using HW4 assignment link on moodle, submit a compressed folder containing one PDF and three python files: answers pdf containing your answers to and 4 above. clustering.py containing the code corresponding to item 1 above compare.py containing the code corresponding to items 2 findsimilar.py containing the code corresponding to items 3 classification.py containing the code corresponding to item 4 above name the folder with your numeric ID number. Submit your extra credit solution using the link specified for the extra credit part on moodle.


We have an Answer from Expert

View Expert Answer

Expert Answer



We have an Answer from Expert

Buy This Answer $5

Place Order

We Provide Services Across The Globe