[PySpark] Introduction to PySpark(1)

1 minute read


PySpark

  • Spark
  • SparkContext
  • SparkSession
  • Spark & Pandas

Spark

  • Spark should be connected to a cluster in order to be used
  • This cluster will be hosted on a remote machine that’s connected to all other nodes
  • There is one computer, the master that manages splitting up the data and the computations
  • The master is connected to the rest of the computers in the cluster, which are called worker
  • The master sends the workers data and calculations to run, and they send their results back to the master

SparkContext

  • connection to the cluster
  • SparkContext is a class whcih contains optional arguments that specifies the attributes of the cluster
  • SparkConf() - An object that can be created whcih holds all the attributes
from pyspark import SparkContext
sc = SparkContext('local', 'test')
print(sc)
print(sc.version)
<SparkContext master=local appName=test>
3.2.0

SparkSession

  • Needed to use DataFrame which is more optimized than RDD(Resilient Distributed Dataset
  • SparkContext can be thought as a connection to the cluster and SparkSession as the interface with that connection.
  • SparkSession.builder.getOrCreate() - recommended since multiple SparkSessions and SparkContexts can cause issues
  • catalog - lists all the data inside the cluster
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
print(spark)
# show names of all the tables in the cluster as a list
spark.catalog.listTables()
<pyspark.sql.session.SparkSession object at 0x7fb98450aaf0>





[]

Spark & Pandas

# dataset
import pandas as pd
pd_flight = pd.read_csv("flights.csv").head(3)
  • Pandas to Spark
spark_temp = spark.createDataFrame(pd_flight)
print(spark.catalog.listTables())

# add spark_temp in the catalog as "flights"
spark_temp.createOrReplaceTempView("flights")
print(spark.catalog.listTables())
[]
[Table(name='flights', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
  • Spark to Pandas
flights_count = spark.sql("select origin, dest, COUNT(*) from flights GROUP BY origin, dest")
pd_flights_count = flights_count.toPandas()
pd_flights_count.head(3)
origin dest count(1)
0 SEA LAX 1
1 SEA SFO 1
2 SEA HNL 1