I have been teaching IST 718: Advanced Information Analytics for some time now. The majority of the students who take this class are from the iSchool and occasionally students from Maxwell and other schools in Syracuse University.
This course is a broad introduction to modern techniques in data science including elastic net regression, random forests, gradient boosting, and deep learning. It emphasizes a statistical learning point of view, and a careful examination of generalization error, model interpretability, feature engineering, and the bias-variance tradeoff.
The tool of choice is Apache Spark on Hadoop’s HDFS. The environment we use is Databricks Community Edition. We are exploring the possibility of building an in-house cluster with YARN, Jupyter notebook, and Spark all running on Kubernetes.
The pre-requistes for this course are a basic knowledge of discrete mathematics, calculus, probability, and Python.
We use the following books:
- Python for Data Analysis (PFDA), 2nd Edition
- An introduction to Statistical Learning with Applications in R (ISLR) by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
- Spark: The Definitive Guide (STDG), Upcoming (expected 2018) by B. Chambers and M. Zaharia,
- Deep Learning (DL) by Ian Goodfellow, Yoshua Bengio, and Aaron Courville