0% found this document useful (0 votes)
3 views2 pages

Appendix D FAQ

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views2 pages

Appendix D FAQ

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Appendix D.

FAQ

1. Versions of tools (Scala, Python, Spark…).


Do install the same version of tools as described in Appendix B to avoid some errors.
• Java8
• Scala2.12.4
• Maven3.5.2
• Python3.6 (If you use PyCharm, you need to install Python3.6 or higher version)
• Spark2.2.1

2. Can I use Windows to set up all of the environments needed for this lab?
Yes. But it’s more difficult.

3. I cannot install Virtual Box on Mac.


Virtual Box does not support Mac Book with M1 or M2. Although the developer preview version
of VirtualBox for Mac with M1 or M2 has been released, it’s only for beta test. It is unstable
with many strange errors. So you need to install the stand-alone Spark or connect to the server
according to Appendix E.

4. Can I use dataframe or other methods as the solution?


No, you need to implement the labs with RDDs.

5. The Ubuntu image already has Python3.5, could I upgrade it to Python3.6?


No, you need to keep Python3.5 and additionally install Python3.6.

6. The order of the output of my wordcount code is different from Appendix C.


This does not matter. The Wordcount code does not sort the result. The order of the output
depends on which worker finishes his job first.

7. There are several files rather than one in my output folder.


Please refer to https://mungingdata.com/apache-spark/output-one-file-csv-parquet/

8. I cannot copy files from Virtual Box to Windows/Mac.


You need to configure the shared folder first. Please refer to
https://helpdeskgeek.com/virtualization/virtualbox-share-folder-host-guest/
9. Notes to process the dataset.
• Please use the parameter “allowBackslashEscapingAnyCharacter=True” when
reading json file with read.json function. (Important)
• If the ”brand” in some records is missing or empty, then drop these records.

10. The virtual machine sometimes runs quite low and may pause.
You could allocate more CPU processors and memory to the virtual machine.

11. I run the program in Appendix C but I do not see any output file.
You could have saved the output to HDFS rather than the local computer.

12. Which IDE should I use?


Choose the IDE you are familiar with. If you are new to CS and are using M1 or M2 Mac, you
may want to use VScode to connect to remote SoC cluster. Otherwise, it is more convenient
to implement and debug with PyCharm.

13. How could I get a high mark?


Grading of Lab 1 is based on the result and documentation for the code. The program for Lab
1 is straightforward, and you should get the result within 1-2 minutes. If we cannot get the
result in 5 minutes, we will give some penalty.

14. Is code documentation sufficient, or do we need to have a README file?


Both are acceptable. Remember that documentation is needed for important steps.

15. If you are using Jupyter Notebook, remember to convert it to a Python file
before submitting.

16. If you have any questions, please post it on Canvas so that other students
can also see it.

17. Please review the instruction of Lab1 and the Appendices carefully before
submitting.

You might also like