This tutorial is a work-in-progress for practising PySpark using Jupyter notebooks. Though I've provided some explanation on some of the basic concepts of Spark, such explanations by no means should be construed as complete. I will add more explanations as and when I get the time to revisit the work already done. I'd appreciate if people following this repo can provide me some feedback so that I can make necessary corrections.
Pre-requisites:
Python, Jupyter Notebook, Basics of distributed computing (theoretical understanding should be enough)
Understanding of Hadoop ecosystem is NOT required to understand Spark. Occasionally, words like HBase, HDFS might pop up. I have included an explanation wherever I felt it is an absolute necessity for the reader to understand the concept. If the explanation isn’t included, it means that the concept can be understood even without an understanding of the keyword in question. Likewise, knowledge of MapReduce is optional to learn Spark. That said, I have included a chapter explaining MapReduce through a word count example, which BTW is HelloWorld program in the BigData world.
Thanks
Slides from Coursera lecture. Included in the repository.
A big thanks to this tutorial which too was created with the same intent. It helped me a lot to understand the concepts by giving me something to build upon this tutorial.
Thanks to numerous Quora users who explain the technical jargons in the most lucid terms.
Spark Installation Notes
I followed this link to install Spark, with the following difference.
As opposed to using Anaconda distribution for Python, I went ahead with the installation which comes with Ubuntu. I am not a huge fan of Anaconda and prefer to install the Python libraries as and when required.
I installed Spark 2.2.0. Note that this version of Spark, does not work with Oracle Java 9. It works with Java 8. While installing Java, you'll be prompted to install version 9. DO NOT install 9. It took me quite some time to figure this out:-)
请发表评论