Understand the basics of Analytics, Machine Learning, and Data Science.
Know ways of categorizing Analytics. One way is separate Business Intelligence (BI), from Advanced Analytics.
Know ways of categorizing Machine Learning (ML). At a high-level, machine learning can be categorized into Supervised and Unsupervised learning. Supervised learning is when there is some predefined label. Unsupervised learning is when there is no label, or defined outcome.
Know what Classification is. It is Supervised Learning where the label is nominal. There are many algorithms that can handle this type of problem. Although it is not necessary to full details for how each of the algorithms are implemented, it is useful to understand little about how they work, because it helps you understand when they will be useful.
Know what Regression is. It is commonly used in the general sense to describe Supervised Learning where the label is numeric. Just like classification, it’s important to understand a little about the algorithms and how to implement them.
Understand and be able to use common Machine Learning algorithms:
Understand and be able to use k-NN. It is a lazy learner based on comparing an unknown Example with the k training Examples which are the nearest neighbors of the unknown Example.
Understand and be able to use Naïve Bayes. It is a high-bias, low-variance classifier, and it can build a good model even with a small data set. It is simple to use and computationally inexpensive.
Understand and be able to use Decision Tree. It is a tree like collection of nodes intended to create a decision on values affiliation to a class or an estimate of a numerical target value.
Understand and be able to use Linear Regression. It models the relationship between a scalar variable and explanatory variables by fitting a linear equation.
Understand and be able to use Logistic Regression. It is a form of a linear model specialized for binominal predictions. It uses logit as the link.
Understand and be able to use General Linear Model. It is like a generalized form of Linear Regression and Logistic Regression. It can be used for both regression and classification.
Understand and be able to use Neural Net. It is a feed-forward neural network trained by a back propagation algorithm (multi-layer perceptron).
Know what Scoring is and how to implement it. Model Scoring, scoring, or scoring the data refers to the application of a model to new data. This is often when the model is in production and there’s new data where the label has not yet been seen or observed. New predictions are made. The same techniques are used in model validation, except in that case there is an observed label that can be compared to the prediction.
Understand, be able to implement, and interpret the results of Split Validation. Split Validation or Hold-out Validation is the simplest way to validate a predictive model. You reserve some of the data that has a known label, and do not use that data during model training. Then use the model to make predictions on the hold-out or test data and measure the success of the predictions. There are several ways to implement this including Split Data and Split Validation. Then performance or error rates are available for both the training data and the test data.
Know when and how to inspect Correlations between attributes. They provide useful information for Data Understanding and Feature Selection. There are many ways to get this information with visualizations and models. The simplest way to get it is the Correlation Matrix operator.
Understand, be able to use, and interpret the results of Feature Importance or Weighting. It can be used by some, but not all model types. There are different ways to weight the attributes and care is needed to select a good technique for a given problem. Some of the most common choices are the Weight by Relief, Weight by Information Gain, and Weight by Correlation operators.
Understand and be able to implement Clustering. It is an important aspect of unsupervised learning and there are many ways to approach it. One of the most common and versatile methods is k-Means. k-Means requires knowing the number of desired clusters, but it is often difficult to know the desired number of clusters in advance. It often makes sense to perform the clustering, and then visually inspect the results with scatter plots and other visualizations, then repeat with some different values of k or clustering techniques. The X-Means operator performs k-Means with many values of k and uses a heuristic to pick one with a good balance of precision and complexity. It can often provide a good starting point.
Be able to perform Association Analysis. It can be thought of as a three-step process:
Data Prep varies widely; different item set methods can take a variety of data structures.
Identify frequent item-sets is commonly done with the FP-Growth operator.
The Create Association Rules operator is used for rule generation. It’s important to understand each of the different criteria including support, confidence, and lift.
Be able to use Auto Model effectively. It puts together many capabilities for supervised learning, clustering, correlations, and feature engineering. It uses many best practices to help make sure that predictive models go through a correct validation process. Then the model can be used with the simulator, or for scoring with new data.