Spark Diagnostics and UDFs

By default, RDDs are recomputed every time you run an action on them. This can be expensive if you need to use a dataset more than once.
Spark allows you to control what is cached in memory.
To tell Spark to cache an object in memory, use persist() or cache():
cache() — shortcut for default storage level (memory only)persist() — customizable to memory and/or diskcollect CAUTION
A handy reference for common PySpark operations:
Spark Application UI shows important facts about your Spark job:
Adapted from AWS Glue Spark UI docs and Spark UI docs






A single row of data with one or more columns used
Some work written in Python that processes the input using Python syntax. No PySpark needed!
Output with a declared return type
Problem: make a new column with ages for adults only
+-------+--------------+
|room_id| guests_ages|
+-------+--------------+
| 1| [18, 19, 17]|
| 2| [25, 27, 5]|
| 3|[34, 38, 8, 7]|
+-------+--------------+
Adapted from UDFs in Spark
from pyspark.sql.functions import udf, col
@udf("array<integer>")
def filter_adults(elements):
return list(filter(lambda x: x >= 18, elements))
# alternatively, with explicit type annotation
from pyspark.sql.types import IntegerType, ArrayType
@udf(returnType=ArrayType(IntegerType()))
def filter_adults(elements):
return list(filter(lambda x: x >= 18, elements))+-------+----------------+------------+
|room_id| guests_ages | adults_ages|
+-------+----------------+------------+
| 1 | [18, 19, 17] | [18, 19]|
| 2 | [25, 27, 5] | [25, 27]|
| 3 | [34, 38, 8, 7] | [34, 38]|
| 4 |[56, 49, 18, 17]|[56, 49, 18]|
+-------+----------------+------------+
When possible, prefer built-in Spark functions — they’re optimized and run on the JVM without serialization overhead.
Separate function definition form — lets you test the function locally first:
from pyspark.sql.functions import udf
from pyspark.sql.types import LongType
# define the function — can be tested without Spark
def squared(s):
return s * s
# wrap in udf and define the output type
squared_udf = udf(squared, LongType())
# execute the udf
df = spark.table("test")
df.select("id", squared_udf("id").alias("id_squared")).show()Single function definition form using decorator:
Register the function with Spark SQL:
Tip
Consider all the corner cases — where could the data be null or an unexpected value? Use Python control structures to handle them.

Costs of Python UDFs:
Other ways to make Spark jobs faster (source):
From PySpark docs:
Pandas UDFs are user-defined functions that are executed by Spark using Apache Arrow to transfer data and Pandas to work with the data, which allows vectorized operations. A Pandas UDF is defined using
pandas_udfas a decorator or to wrap the function.
import pandas as pd
from pyspark.sql.functions import pandas_udf
@pandas_udf("string")
def to_upper(s: pd.Series) -> pd.Series:
return s.str.upper()
df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(to_upper("name")).show()
# +--------------+
# |to_upper(name)|
# +--------------+
# | JOHN DOE|
# +--------------+@pandas_udf("first string, last string")
def split_expand(s: pd.Series) -> pd.DataFrame:
return s.str.split(expand=True)
df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(split_expand("name")).show()
# +------------------+
# |split_expand(name)|
# +------------------+
# | [John, Doe]|
# +------------------+Vectorizing scalar operations — same-size input and output Series:
Regular UDF form:
Pandas UDF form — faster vectorized form (Spark 3.0+ syntax):
Note
PandasUDFType.SCALAR was deprecated in Spark 3.0. Use Python type annotations (pd.Series → pd.Series) instead.
Split-apply-combine using Pandas syntax (Spark 3.0+ syntax):
Note
PandasUDFType.GROUPED_MAP was deprecated in Spark 3.0. Use applyInPandas() with type annotations instead.
Input:
pandas.Seriespandas.DataFrameOutput:
pandas.Seriespandas.DataFrameGrouping semantics:
groupby clauseOutput size:
Note
Use Scalar Pandas UDFs for row-wise vectorized operations.
Use Grouped Map when you need split-apply-combine with arbitrary output sizes.