Description
Data mining and statistical learning methods use a variety of computational tools for understanding large, complex datasets. In some cases, the focus is on building models to predict a quantitative or qualitative output based on a collection of inputs. In others, the goal is simply to find relationships and structure from data with no specific output variable. This course takes an applied approach to understand the methodology, motivation, assumptions, strengths, and weaknesses of the most widely applicable methods in this field.
Tags
Syllabus
This course covers methodology, major software tools, and applications in data mining. By introducing principal ideas in statistical learning, the course will help students to understand the conceptual underpinnings of methods in data mining. It focuses more on the usage of existing software packages (mainly in R) than developing the algorithms by the students. Students will be required to work on projects to practice applying the existing software. The topics include statistical learning; resampling methods; linear regression; variable selection; regression shrinkage; dimension reduction; non-linear methods; logistic regression, discriminant analysis; nearest-neighbors; decision trees; bagging; boosting; support vector machines; principal components analysis; clustering.
Prerequisites
- STAT 501 (Regression Methods) or a similar course that covers analysis of research data through simple and multiple regression and correlation; polynomial models; indicator variables; step-wise, piece-wise, and logistic regression.
- Basics of probability, expectation, and conditional distributions. Review the Basic Statistical Concepts notes on the STAT online site.
- Matrix algebra and multivariate calculus will be beneficial but is not required. Review the Matrix Algebra Review notes on the STAT online site.
- The examples in the course use R and students will do weekly R Labs to apply statistical learning methods to real-world data. Extensive guidance in using R will be provided, but previous basic programming skills in R or exposure to a programming language such as MATLAB or Python will be useful. Introductions to R are available at Statistical R Tutorial and Cran R Project Intro Manual.
Textbooks
Required: An Introduction to Statistical Learning, with applications in R (2013), G. James, D. Witten, T. Hastie, R. Tibshirani (Springer).
Recommended Reading
- The Elements of Statistical Learning, 2nd edition, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
- Pattern Recognition and Machine Learning by C. M. Bishop
- All of Statistics: A Concise Course in Statistical Inference by L. Wasserman.
- Classification and Regression Trees by L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
- Principles of Data Mining by H. Mannila, P. Smyth and D. J. Hand
- Pattern Recognition and Neural Networks by B. Ripley
Other Resources
- Learning Online Orientation
- Obtaining Statistical Software
- Datasets were taken from the UCI machine learning database repository:
- Iris: iris.data, source (including data set information)
- Datasets were taken from An Introduction to Statistical Learning:
- Auto.data
- Advertising.data
- Credit.data
- Other datasets:
- smsa.data
- prostate.data
STAT 508: Applied Data Mining & Statistical Learning
-
TypeOnline Courses
-
ProviderOPEN.ED@PSU
Data mining and statistical learning methods use a variety of computational tools for understanding large, complex datasets. In some cases, the focus is on building models to predict a quantitative or qualitative output based on a collection of inputs. In others, the goal is simply to find relationships and structure from data with no specific output variable. This course takes an applied approach to understand the methodology, motivation, assumptions, strengths, and weaknesses of the most widely applicable methods in this field.
This course covers methodology, major software tools, and applications in data mining. By introducing principal ideas in statistical learning, the course will help students to understand the conceptual underpinnings of methods in data mining. It focuses more on the usage of existing software packages (mainly in R) than developing the algorithms by the students. Students will be required to work on projects to practice applying the existing software. The topics include statistical learning; resampling methods; linear regression; variable selection; regression shrinkage; dimension reduction; non-linear methods; logistic regression, discriminant analysis; nearest-neighbors; decision trees; bagging; boosting; support vector machines; principal components analysis; clustering.
Prerequisites
- STAT 501 (Regression Methods) or a similar course that covers analysis of research data through simple and multiple regression and correlation; polynomial models; indicator variables; step-wise, piece-wise, and logistic regression.
- Basics of probability, expectation, and conditional distributions. Review the Basic Statistical Concepts notes on the STAT online site.
- Matrix algebra and multivariate calculus will be beneficial but is not required. Review the Matrix Algebra Review notes on the STAT online site.
- The examples in the course use R and students will do weekly R Labs to apply statistical learning methods to real-world data. Extensive guidance in using R will be provided, but previous basic programming skills in R or exposure to a programming language such as MATLAB or Python will be useful. Introductions to R are available at Statistical R Tutorial and Cran R Project Intro Manual.
Textbooks
Required: An Introduction to Statistical Learning, with applications in R (2013), G. James, D. Witten, T. Hastie, R. Tibshirani (Springer).
Recommended Reading
- The Elements of Statistical Learning, 2nd edition, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
- Pattern Recognition and Machine Learning by C. M. Bishop
- All of Statistics: A Concise Course in Statistical Inference by L. Wasserman.
- Classification and Regression Trees by L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
- Principles of Data Mining by H. Mannila, P. Smyth and D. J. Hand
- Pattern Recognition and Neural Networks by B. Ripley
Other Resources
- Learning Online Orientation
- Obtaining Statistical Software
- Datasets were taken from the UCI machine learning database repository:
- Iris: iris.data, source (including data set information)
- Datasets were taken from An Introduction to Statistical Learning:
- Auto.data
- Advertising.data
- Credit.data
- Other datasets:
- smsa.data
- prostate.data