Exercise 11-2:
Question: Use the d311.csv file as a data source and perform the following steps:
1. Open the file to understand its structure and identify column names.
2. Create a subdirectory RDD/Ex2 in HDFS and upload the d311.csv file to that subdirectory. Start the Spark Shell.
3. Check if there is any header in the file. If there is a header in the first row, then remove it.
4. Create an RDD that reads the d311.csv file and displays the first 10 elements. Provide a screenshot of the results. Use the count action to return the number of items in the RDD.
5. Create a new RDD that captures only the Agency, City, and Descriptor.
6. Display the first few elements of the new RDD. Provide a screenshot of the result.
7. Create a new RDD that captures City and Descriptor, where the descriptor contains the word "Sidewalk". Provide a screenshot of the result.
8. Save the results of the RDD from #7 back into the cluster. Open another terminal and verify that the results are stored in the cluster. Provide a screenshot of the result.