Understand and be able to use Ensemble Models for Classification and Regression.
Understand and be able to use Vote. It uses a majority or average vote from the predictions of the inner learners. The inner learners can be a variety of types.
Understand and be able to use Bagging. Bootstrap Aggregating (Bagging) uses the same model type repeatedly by training it on different data samples.
Understand and be able to use Random Forest. It is like a specialized version of Bagging. It uses only Decision Trees as inner learners with different data samples, and only a subset of attributes is available for any given split.
Understand and be able to use Boosting. It gradually improves estimations with each successive model. Common methods are Gradient Boosted Trees and AdaBoost.
Understand and be able to use Support Vector Machine Models for Classification and Regression. A support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite- dimensional space.
Understand and be able to use Deep Learning Models for Classification and Regression. It is based on a multi-layer feed-forward artificial neural network.
Understand how Cross Validation works, and what each of the operator output ports provide. Cross Validation is a very powerful tool for validation because it provides a distribution of performances rather than a single point estimate. It can play a worthwhile role in building better models.
Be able to perform automated Parameter Optimization.
Understand Optimize Parameters (Grid) in detail. It is a common technique that is easy to understand and implement.
Be able to use Optimize Parameters (Evolutionary). It involves a more advanced method that can often get a better solution faster than a grid approach, particularly in large or unbounded parameter spaces.
Understand Model Selection considerations and tradeoffs. These tradeoffs not only change from one business problem to another, but also over CRISP-DM iterations as the project matures. New projects may have a higher demand for fast understandable models, and shift as they mature towards better model performance.
Know when Understandability of the model is important. This is often when recommendations or decisions need to be explained or justified. It is something that is different depending on the business objective and the audience that may need to understand the model
Know how to evaluate Model Performance or predictive power. It can be measured many ways. Accuracy, AUC, and squared correlation are a few of the most common methods. Costs can be a powerful way to handle situations where not all errors or correct predictions have the same cost or value; it can also help translate model performance into business value.
Know when Runtime or computational performance is important. It can be different for training and scoring. It is also different depending on the business objective and audience.
Know when and how to perform Feature Engineering. It can lead to better models in terms of Understandability, Model Performance, and Runtime. It can involve both generating features that are more useful for model performance, and eliminating features that are less useful. This can result in simpler models based on fewer attributes.
Be able to handle Dates. They may not be usable with transformation, and are often transformed into one or more numeric attributes. The Date to Numerical operator is flexible, but care is needed with the time unit and relative to parameters. Date to Nominal can also be useful in special cases.
Understand how and when to use the Nominal to Numerical operator, and the different coding methods. This operator uses dummy coding (one-hot) encoding by default and this or effect coding are usually good choices. It also has Integer coding which should be avoided in most situations.
Understand how and when to Remove Attributes. They are often removed with Select Attributes, Remove Useless Attributes, and Remove Correlated Attributes.
Understand Forward Selection in detail. It provides an intuitive way to automatically select attributes.
Understand Backward Elimination in detail. It provides an intuitive way to automatically select attributes and often slightly outperforms Forward Selection.
Be able to use Optimize Selection (Evolutionary). It takes an advanced approach and commonly outperforms both Forward and Backward selection. With the right selection schemes, it can be used for Multi-Objective Feature Selection.
Know how and when to use Automatic Feature Engineering. This operator can generate attributes based on common transformations and perform feature selection like Optimize Selection (Evolutionary). However, the parameters are trimmed down to the most commonly recommended settings.
Understand Time Series and Forecasting. It can be considered a type of supervised learning and regression, but it has enough special situations and techniques to stand as an important subject on its own. In this case, the label that we are trying to predict, is primarily dependent on its own past values, rather than other attributes.
Be able to perform Transformation and Feature Extraction with Time Series.
Be able to perform and understand Time series Decomposition. It is used to identify how the values trend over large scale time frames, and the seasonality effect over smaller time frames. This can help to interpret forecast models, and generate features for forecast models.
Be able to understand and use ARIMA and Holt-Winters. They are common forecasting techniques. The ARIMA operator takes a single time-series attribute and parameters for AutoRegression (p), dIfferencing (d), and Moving Average (q). These parameters control the forecast model fit, including different trends and seasons. For creating new predictions, use Apply Forecast.
Know how and when to use Windowing. It generates new features of past values; the new features become attributes that will be used for prediction. Then you use the regression model type of your choice. This is a powerful and flexible technique that can allow you to handle complex situations where past values provide useful information, but other attributes also provide useful information.
Understand how to integrate R and Python models. It is easy if the code is already available for both training and scoring.
Be able to connect Execute Python or R operator for training, and one Execute Python or R operator for scoring. Regardless of whether scoring is for new predictions, or validation, The R or Python model can be stored and passed through processes just like any other model.