• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

abhisheksaurabh1985/spark-for-noobs-by-a-noob: Jupyter notebooks for learning Py ...

原作者: [db:作者] 来自: 网络 收藏 邀请

开源软件名称:

abhisheksaurabh1985/spark-for-noobs-by-a-noob

开源软件地址:

https://github.com/abhisheksaurabh1985/spark-for-noobs-by-a-noob

开源编程语言:

Jupyter Notebook 100.0%

开源软件介绍:

Jupyter Notebooks to Practice Spark

Introduction

This tutorial is a work-in-progress for practising PySpark using Jupyter notebooks. Though I've provided some explanation on some of the basic concepts of Spark, such explanations by no means should be construed as complete. I will add more explanations as and when I get the time to revisit the work already done. I'd appreciate if people following this repo can provide me some feedback so that I can make necessary corrections.

Pre-requisites:

  • Python, Jupyter Notebook, Basics of distributed computing (theoretical understanding should be enough)
  • Understanding of Hadoop ecosystem is NOT required to understand Spark. Occasionally, words like HBase, HDFS might pop up. I have included an explanation wherever I felt it is an absolute necessity for the reader to understand the concept. If the explanation isn’t included, it means that the concept can be understood even without an understanding of the keyword in question. Likewise, knowledge of MapReduce is optional to learn Spark. That said, I have included a chapter explaining MapReduce through a word count example, which BTW is HelloWorld program in the BigData world.

Thanks

  • Slides from Coursera lecture. Included in the repository.
  • A big thanks to this tutorial which too was created with the same intent. It helped me a lot to understand the concepts by giving me something to build upon this tutorial.
  • Thanks to numerous Quora users who explain the technical jargons in the most lucid terms.

Spark Installation Notes

I followed this link to install Spark, with the following difference.

  • As opposed to using Anaconda distribution for Python, I went ahead with the installation which comes with Ubuntu. I am not a huge fan of Anaconda and prefer to install the Python libraries as and when required.
  • I installed Spark 2.2.0. Note that this version of Spark, does not work with Oracle Java 9. It works with Java 8. While installing Java, you'll be prompted to install version 9. DO NOT install 9. It took me quite some time to figure this out:-)

Table of Contents

Topic Notebook Content Description
RDD: Definition and its creation 01_rdd_definition_and_creation.ipynb Definition of RDD- Types of Operations- parallelize and textFile method to create RDD
RDD Basic Operations- Part I 02_rdd_basic_operations.ipynb Explanation of immutability and lazy evaluation - Example of a few basic transformations and actions
WordCount in Spark 03_wordcount_mapreduce.ipynb Explanation of Spark transformations through WordCount example
JOIN in Spark 04_join_in_spark.ipynb Simple and Advanced JOIN through a Coursera assignment
Handling Parquet files 05_parquet_file_basics.ipynb Introduction to Parquet and column oriented data storage- Example of reading parquet file



鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
热门推荐
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap