[PySpark] Big Data Fundamentals with PySpark(1)

less than 1 minute read

Spark

  • Big Data terminology
  • Spark Modes of Deployment

Big Data terminology

  • Clustered Computing - Collection of resources of multiple machines
  • Parallel Computing - Simultaneous computation
  • Distributed Computing - Collection of nodes(networked computers) that run in parallel
  • Batch Processing - Breaking the job into small pieces and running them on individual machines
  • Real-time Processing - Immediate processing of data

Spark Modes of Deployment

  • Local Mode - Single machine such as a laptop
    • convenient for testing, debugging, and demonstartion
  • Cluster Mode - Set of pre-defined machines
    • for production
  • Workflow (Local Mode ➡️ Cluster mode)
    • No code change necessary

from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext('local', 'lernen2-1')
spark = SparkSession.builder.getOrCreate()
print(sc)
print(spark)
print(sc.master)
<SparkContext master=local appName=lernen2-1>
<pyspark.sql.session.SparkSession object at 0x7f0ea9577e50>
local
numb = range(1, 100)

# Load num into PySpark
spark_data = sc.parallelize(numb)
print(spark_data)
print(type(spark_data))
PythonRDD[1] at RDD at PythonRDD.scala:53
<class 'pyspark.rdd.PipelinedRDD'>