Be able to use and understand Regular Expressions. They are used frequently in RapidMiner, and are a critical part of many processes that are designed for complex situations. They are based on the Java implementation and many operators that use them have a specialized regular expression editor with testing capabilities.
Be able to use and understand Macros. They are like global variables in a process and they are used frequently. They are a critical part of many processes that are designed for complex situations. They are very flexible and can be generated or set and used many different ways.
Understand how to implement Scripts. They are most often in R, Python, or Groovy. There are also other scripting options available. Scripts can be called with data, macros, and other objects, and access other RapidMiner objects from the parent environment. Although scripting is rarely used, it is worthwhile to understand scripting capabilities.
Be able to use and understand Loops and Branches. They are common control structures in a RapidMiner Process. They are used to scale repetitive operations and make processes more general and robust. It’s important to be familiar with all of the core loop operators, two branch operators, and Handle Exceptions.
Loop Attributes is one of the most common loop operators. It runs a subprocess once for each selected attribute. A macro can be used to access the relevant attribute.
Loop Examples is another common loop operator. It runs a subprocess once for each example in the provided Example Set and the example number or row can be accessed with a macro.
Collections are sets of like items that can be easily looped over.
Branches are commonly accomplished with either the Branch operator for simple if-then-else splits, and Select Subprocess can be used to run one subprocess out of any number of subprocesses.
Know how to Handle Exceptions and errors. The Handle Exception operator always tries the try subprocess and in the event of error then runs the catch subprocess. It can be used to make sure that expected errors are handled gracefully. The catch subprocess may log critical information, send alerts, Throw Exception, perform other tasks, or silently swallow the error.
Know how to implement logging. Logs can be created with the core logging operators or manually by writing to the repository or file with other operators. The most common method is to use the Log operator which can write to the standard log or a file. The Log operator can directly log values or parameters from any of the other operators. It is often used with other operators like Provide Macro as Log Value.
Be able to implement the most common cleansing operations.
Know how to use the Sample operator. It has different sample methods available and can balance data based on the label, and can sample by absolute, relative, or probability.
Be able to handle Missing Values. They can be handled by removing examples or attributes, or they can be replaced. The Replace Missing Values operator provides the most common techniques; however, Impute Missing Values is also available.
Understand when and how to use the Normalize operator, including the different normalization techniques. It is a data transformation that is often completed as a part of the validated modeling process but can still be considered a common data cleansing task.
Know when and how to use the different Discretize by operators. Binning or Discretize takes numeric attributes and generates nominal in their place.
Understand how to inspect and handle Columns with Many Values. If they are used in modeling they may lead to overfit. They may be ID-like and inappropriate for modeling. They might be identified and then removed with the Select Attributes operator, or in one step with the Remove Useless Attributes operator. Alternatively, they may provide useful information after transformations are applied, possibly with the Generate Attributes operator.
Be able to Remove Duplicate examples. They are sometimes valid, but often simply need to be removed with the Remove Duplicates operator.
Understand and be able to use different approaches for Outlier Detection. This can be a complex but important step. Sometimes the outlier points are of interest, sometimes they are a valid observation, and sometimes they need to be removed. There are many techniques to identify them depending on the situation. There are some core operators, and there are more in the Anomaly Detection Extension. Many of the techniques are sensitive to different scales and require normalized data. Some of the most common techniques are: Distances, Densities, Local Outlier Factor, Connectivity based Outlier Factor, and Histogram-based Outlier Score.
Know when and how to perform Dimensionality Reduction. It is usually used to reduce the number of attributes, decrease the overall size of the data, and improve computational efficiency, occasionally, they can also improve model performance. The most common operators are PCA and Singular Value Decomposition.
Understand and be able to use the Text Processing extension for common tasks. It has many tools for handling unstructured text. It’s important to understand how the Process Documents operators work with document objects in the nested subprocesses, and some of the common operators that work on documents including Tokenize, Filter Stopwords, Stemming, and Generate n-Grams.
Understand and be able to use the Web Mining extension for common tasks. It provides a small but flexible set of tools for accessing text on web pages or web APIs. Some of the most important operators include, Process Documents from Web, Read RSS Feed, and Enrich Data by Webservice.