[PySpark] Introduction to PySpark(1)
PySpark
- Spark
- SparkContext
- SparkSession
- Spark & Pandas
Spark
- Spark should be connected to a cluster in order to be used
- This cluster will be hosted on a remote machine that’s connected to all other nodes
- There is one computer, the master that manages splitting up the data and the computations
- The master is connected to the rest of the computers in the cluster, which are called worker
- The master sends the workers data and calculations to run, and they send their results back to the master
SparkContext
- connection to the cluster
- SparkContext is a class whcih contains optional arguments that specifies the attributes of the cluster
SparkConf()
- An object that can be created whcih holds all the attributes
from pyspark import SparkContext
sc = SparkContext('local', 'test')
print(sc)
print(sc.version)
<SparkContext master=local appName=test>
3.2.0
SparkSession
- Needed to use DataFrame which is more optimized than RDD(Resilient Distributed Dataset
- SparkContext can be thought as a connection to the cluster and SparkSession as the interface with that connection.
SparkSession.builder.getOrCreate()
- recommended since multiple SparkSessions and SparkContexts can cause issuescatalog
- lists all the data inside the cluster
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
print(spark)
# show names of all the tables in the cluster as a list
spark.catalog.listTables()
<pyspark.sql.session.SparkSession object at 0x7fb98450aaf0>
[]
Spark & Pandas
# dataset
import pandas as pd
pd_flight = pd.read_csv("flights.csv").head(3)
- Pandas to Spark
spark_temp = spark.createDataFrame(pd_flight)
print(spark.catalog.listTables())
# add spark_temp in the catalog as "flights"
spark_temp.createOrReplaceTempView("flights")
print(spark.catalog.listTables())
[]
[Table(name='flights', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
- Spark to Pandas
flights_count = spark.sql("select origin, dest, COUNT(*) from flights GROUP BY origin, dest")
pd_flights_count = flights_count.toPandas()
pd_flights_count.head(3)
origin | dest | count(1) | |
---|---|---|---|
0 | SEA | LAX | 1 |
1 | SEA | SFO | 1 |
2 | SEA | HNL | 1 |