content – DATS 6450.13

About today’s class

This week we cover machine learning at scale using Spark’s MLlib library and natural language processing using the JohnSnowLabs Spark NLP library.

MLlib mirrors scikit-learn’s design — Transformers, Estimators, Pipelines, and Parameters — but executes across a distributed cluster. We’ll do a full feature engineering walkthrough (StringIndexer → OneHotEncoder → VectorAssembler → Normalizer → Pipeline), train and tune models, and then shift to text analytics: tokenization, stop-word removal, and building NLP pipelines using Spark NLP’s annotator model.

Readings

Readings for this lecture (to be completed before class):

Learning Spark, 2nd Edition by Damji et al. (2020). Chapter 10: Machine Learning with MLlib. (Free PDF from Databricks)
Spark NLP Code Concepts (JohnSnowLabs documentation)

Encouraged:

Slides

The slides for this week are available online.

Lab

This week’s lab will use Spark ML to build a complete feature engineering and model training pipeline on a real dataset.

Lab link: here

Assignment

Details for this week’s assignment will be announced in class and posted to the course website.