Course Description
This course is an introduction to statistical inference, broadly construed as the process of drawing conclusions from data, and of quantifying uncertainty about said conclusions. The goal is to introduce the basic ideas of statistical learning and predictive modeling from a statistical, theoretical and computational perspective, together with applications to real data. Topics cover the major schools of thought that influence modern scientific practice, including classical frequentist methods, machine learning and Bayesian inference. The course aims to provide a very applied overview of some classical linear approaches such as Linear Regression, Logistic Regression, Linear Discriminant Analysis, as well as some non-linear methods such as K-Means Clustering, K-Nearest Neighbors, Generalized Additive Models, Decision Trees, Boosting, Bagging and Support Vector Machines.
Prerequisites/Corequisites
Knowledge of basic multivariate calculus, statistical inference, and linear algebra is expected. Students should be comfortable with the following concepts: probability distribution functions, expectations, conditional distributions, likelihood functions, random samples, estimators and linear regression models. This course will make extensive use of the statistical software R and build on knowledge of introductory probability and statistics, as well as multiple regression. If you have any doubt about your preparation for this course, feel free to chat with me on the first day.
Textbook and Materials
- Required textbook: An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Available online
- Recommended textbook: The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Friedman. The required text is a simplified version of this book. Please refer to it if you want to go more into depth on any topic covered in the course text, particularly from a theoretical perspective. Available online
- Statistical Software: this class will primarily use the open source statistical software R. R has several advantages; in addition to supporting all of the statistical learning methods that will be covered, it is also the choice for research statisticians.
- Go to https://www.r-project.org to download R for free
- Downloading and getting familiar with R Studio from here is strongly recommended. It is free, and it runs on Windows, Mac and Linux operating systems
- Make sure to install the ISLR package, which includes the datasets used in the course book.
- An excellent introduction to R is the book Using R for Introductory Statistics by J. Verzani. The book is freely available here