IST 718: Big Data Analytics
This is an advanced course: There seem to be no official pre-requisites
in the Syracuse University’s catalog system for taking this class.
Most students have already taken IST 687 - Introduction to Data Science,
which is a nice introduction to the field. However, students will be
expected to know programming in Python or R and have
some background in linear algebra, calculus, probability, and statistics as well. This means
that even if you register for the class, you might not have the necessary
background to fully take advantage of what this class has to offer.
If you are in doubt, take the following test, which you should be able to solve relatively
easily
Preliminary test
In the past, I have suggested students go through the following courses to grasp the basic math required to be a good data scientist:
- Linear algebra: This MIT OCW’s Linear Algebra course, which is free The first couple of lectures cover most you need
- Calculus: Another MIT OCW’s Calculus free course. I would recommend Part A and B for IST 718.
- Probability and statistics I would recommend the first chapters of DeGroot and Schervish’s book “Probability and Statistics”
- Programming: There are plenty of resources online about programming. For programming in Python, I would recommend Jake VanderPlas’s “Python Data Science Handbook”.
Goal
This course is a broad introduction to modern techniques in data science including elastic net regularized regression, random forest, gradient boosting, and deep learning. It emphasizes a statistical learning point of view, and a careful examination of generalization error, model interpretability, feature engineering, and bias-variance tradeoff.
Tools
The tool of choice is Apache Spark on Hadoop’s HDFS. The environment we use is Databricks Community Edition, which runs a highly customized version of the Jupyter Notebook.
Prerequisites
The pre-requistes for this course are a basic knowledge of discrete mathematics, calculus, probability, and Python.
We use the following books:
- Python for Data Analysis (PFDA), 2nd Edition
- An introduction to Statistical Learning with Applications in R (ISLR) by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
- Spark: The Definitive Guide (STDG), Upcoming (expected 2018) by B. Chambers and M. Zaharia,
- Deep Learning (DL) by Ian Goodfellow, Yoshua Bengio, and Aaron Courville