Spark ML and Spark NLP
Gather and collect data
Clean and inspect data
Perform feature engineering
Split data into train/test
Evaluate and compare models



MLlib?org.apache.spark.ml (High level API) - Built on top of DataFrames - Allows construction of ML pipelines
org.apache.spark.mllib (Predates DataFrames) - Original API built on top of RDDs
scikit-learnDataFrame
Transformer
Estimator
Pipeline
Parameter

Transformers take DataFrames as input, and return a new DataFrame as output.
Transformers do not learn any parameters from the data — they simply apply rule-based transformations to either prepare data for model training or generate predictions using a trained model.
Transformers are run using the .transform() method

StringIndexerOneHotEncoder — can act on multiple columns at onceNormalizerStandardScalerTokenizerStopWordsRemoverWord2VecPCABucketizerQuantileDiscretizerMLlib needs a single, numeric features column as inputEach row in this column contains a vector of data points corresponding to the set of features used for prediction.
Use the VectorAssembler transformer to create a single vector column from a list of columns.
All categorical data must be numeric for machine learning.
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
dataset = spark.createDataFrame(
[(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],
["id", "hour", "mobile", "userFeatures", "clicked"])
assembler = VectorAssembler(
inputCols=["hour", "mobile", "userFeatures"],
outputCol="features")
output = assembler.transform(dataset)
output.select("features", "clicked").show(truncate=False)Estimators learn (or “fit”) parameters from your DataFrame via the .fit() method, and return a model which is a Transformer.


Pipelines combine multiple steps into a single workflow:
stagesThe Pipeline constructor takes an array of pipeline stages.

Cleaner Code: No need to manually keep track of training and validation data at each step.
Fewer Bugs: Fewer opportunities to misapply a step or forget a preprocessing step.
Easier to Productionize: Pipelines help transition a model from prototype to deployed.
More Options for Model Validation: Cross-validation and other techniques apply easily to Pipelines.
Using the HMP Dataset — sensor data for human motion prediction.
The StringIndexer is an estimator with both fit and transform methods.
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol='class', outputCol='classIndex')
indexed = indexer.fit(df).transform(df)
indexed.show(5)
# +---+---+---+-----------+--------------------+----------+
# | x| y| z| class| source|classIndex|
# +---+---+---+-----------+--------------------+----------+
# | 29| 39| 51|Drink_glass|Accelerometer-201...| 2.0|
# ...Unlike StringIndexer, OneHotEncoder is a pure transformer with only a transform method.
from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder(inputCol='classIndex', outputCol='categoryVec')
encoded = encoder.transform(indexed)
encoded.show(5, False)
# +---+---+---+-----------+...+----------+--------------+
# |x |y |z |class |...|classIndex|categoryVec |
# +---+---+---+-----------+...+----------+--------------+
# |29 |39 |51 |Drink_glass|...| 2.0|(13,[2],[1.0])|The Pipeline constructor takes an array of stages in the right sequence:
Feature selection using easy R-like formulas:
from pyspark.ml.feature import RFormula
dataset = spark.createDataFrame(
[(7, "US", 18, 1.0),
(8, "CA", 12, 0.0),
(9, "NZ", 15, 0.0)],
["id", "country", "hour", "clicked"])
formula = RFormula(
formula="clicked ~ country + hour",
featuresCol="features",
labelCol="label")
output = formula.fit(dataset).transform(dataset)
output.select("features", "label").show()Pick categorical variables most dependent on the response variable:
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([
(7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0),
(8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0),
(9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0)],
["id", "features", "clicked"])
selector = ChiSqSelector(numTopFeatures=1, featuresCol="features",
outputCol="selectedFeatures", labelCol="clicked")
result = selector.fit(df).transform(df)
result.show()Hyperparameter search with ParamGridBuilder:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
data = spark.read.format("libsvm").load("data/mllib/sample_linear_regression_data.txt")
train, test = data.randomSplit([0.9, 0.1], seed=12345)
lr = LinearRegression(maxIter=10)
paramGrid = ParamGridBuilder() \
.addGrid(lr.regParam, [0.1, 0.01]) \
.addGrid(lr.fitIntercept, [False, True]) \
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
.build()# TrainValidationSplit tries all parameter combinations
tvs = TrainValidationSplit(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(),
trainRatio=0.8) # 80% train, 20% validation
# Run and find best parameters
model = tvs.fit(train)
# Make predictions on test data
model.transform(test) \
.select("features", "label", "prediction") \
.show()CrossValidator is more rigorous than TrainValidationSplit — each parameter combination is evaluated on k separate train/test splits.
TrainValidationSplit
CrossValidator
Built into PySpark:
String SQL functions: F.length(col), F.substring(str, pos, len), F.trim(col), F.upper(col), …
ML transformers for text:
Tokenizer() — split text into tokensStopWordsRemover() — remove common wordsWord2Vec() — word embeddingsCountVectorizer() — term frequency vectorsfrom pyspark.ml.feature import StopWordsRemover
df = spark.createDataFrame([(["a", "b", "c"],)], ["text"])
remover = StopWordsRemover(stopWords=["b"])
remover.setInputCol("text")
remover.setOutputCol("words")
remover.transform(df).head().words # ['a', 'c']
# Multiple columns
df2 = spark.createDataFrame([(["a", "b", "c"], ["a", "b"])], ["text1", "text2"])
remover2 = StopWordsRemover(stopWords=["b"])
remover2.setInputCols(["text1", "text2"]).setOutputCols(["words1", "words2"])
remover2.transform(df2).show()
# +---------+------+------+------+
# | text1| text2|words1|words2|
# +---------+------+------+------+
# |[a, b, c]|[a, b]|[a, c]| [a]|
# +---------+------+------+------+
Example: find A’s within a certain distance of a Y
# within 2 → 0
X X X X Y X X X A
# within 2 → 1
X X A X Y X X X A
# within 2 → 2
X A X A Y A X X A
# within 4 → 3
A X A X Y X X X A
Y’s in the text that have an A near enough to themTokenizer() and StopWordsRemover()Which NLP libraries have the most features?

Just because it is scalable does not mean it lacks features!


Why is Spark NLP faster?
And it scales across a cluster!
Reusing the Spark ML Pipeline
Reusing NLP Functionality

fit() method to make an Annotator Model/Transformertransform() method only.pretrained() methodQ: Do transformer ML methods ever remove columns in a Spark DataFrame?
No!! They only add columns.


PretrainedPipeline.transform()PretrainedPipeline.annotate()from pyspark.ml import Pipeline
from sparknlp.base import *
from sparknlp.annotator import *
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentences")
tokenizer = Tokenizer() \
.setInputCols(["sentences"]) \
.setOutputCol("token")
normalizer = Normalizer() \
.setInputCols(["token"]) \
.setOutputCol("normal")
word_embeddings = WordEmbeddingsModel.pretrained() \
.setInputCols(["document", "normal"]) \
.setOutputCol("embeddings")import sparknlp
from sparknlp.base import DocumentAssembler
data = spark.createDataFrame([
["Spark NLP is an open-source text processing library."]
]).toDF("text")
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
result = documentAssembler.transform(data)
result.select("document").show(truncate=False)
# +----------------------------------------------------------------------------------------------+
# |document |
# +----------------------------------------------------------------------------------------------+
# |[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
# +----------------------------------------------------------------------------------------------+Spark NLP integrates with HuggingFace Transformers models:
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizer
from sparknlp.annotator import DistilBertForSequenceClassification
MODEL_NAME = 'distilbert-base-uncased-finetuned-sst-2-english'
# Save tokenizer and model locally
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained(f'./{MODEL_NAME}_tokenizer/')
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)
model.save_pretrained(f'./{MODEL_NAME}', saved_model=True)
# Load into Spark NLP
sequenceClassifier = DistilBertForSequenceClassification \
.loadSavedModel(f'{MODEL_NAME}/saved_model/1', spark) \
.setInputCols(["document", "token"]) \
.setOutputCol("class") \
.setCaseSensitive(True)Required:
Encouraged: