Understand how to access data from RapidMiner.
Understand and be able to Import data. It creates a static copy of the data and stores it in the RapidMiner Repository as an example set. You can Import From File or Import from Database.
Know what Processes are and how to use them. They are important for any repeatable task. Most processes will load the data into memory, regardless of whether the data resides in the repository, database, file, or another source. The process may include retrieve operators that load ExampleSets or other RapidMiner objects from the repository. The process may also include read operators that read data directly from the source.
Understand when Wizards can be helpful. They are commonly used the first time you perform a task. There is a wizard for importing data, and there are also wizards in some of the operators you may use in a process like Read CSV and Read Excel.
Be able to bring data in to Auto Model. It requires data that is already in the repository and ready for modeling. A common way to prepare for Auto Model is to use Turbo Prep.
Be able to use Turbo Prep. It can load data from the repository, access the import data wizard to add data to the repository and Turbo Prep, or accept data from a process. Turbo Prep can be used on its own as for one time tasks, or it can be used as a part of building repeatable processes.
Understand and be able to perform Basic Data Transformations with process design and with Turbo Prep.
Know how to rename attributes. The Rename operator can be used to rename one or more attributes of an ExampleSet. The Rename by Replacing can be used in some situations to apply a renaming rule across many attributes.
Know how to perform Filtering. The Filter Examples selects which Examples of an ExampleSet are kept and which Examples are removed. Important settings include invert filter, and Match all or Match any. Filter Example Range works by row index instead of values.
Know how to perform Attribute Selection. The Select Attributes operator has several filter types that provide different methods for keeping or removing attributes. The three most common are single, subset, and regular expression. Key settings include invert selection and include special attributes. There are also other useful operators including Remove Useless Attributes and Remove Attribute Range.
Understand and be able to perform Type Conversion. There are many type conversion operators. Some of them, such as Numerical to Polynominal, are easy to use. Some of them require extra care. With Nominal to Numerical key settings include the coding types.
Understand the use of Roles. There are some predefined attribute roles such as id and label that are very important and commonly used. The Set Role operator can be used to manually set attribute roles to predefined or custom values.
Know how to Generate Columns. The Generate Attributes operator is extremely versatile. It can overwrite existing attributes, or construct new attributes using mathematical expressions.
Understand and be able to use the Generate Aggregation operator. It is useful for applying a given aggregation method to a set of attributes.
Be able to perform Value Replacements. They are often accomplished with either of two different operators. The Map operator takes a list of specified and new values. The Replace operator does not take a list of values, but is designed to find values based on a regular expression.
Understand how to work with Multiple Data Sets with process design and Turbo Prep.
Be able to Join Data Sets. The Join operator can perform join types of: inner, left, right, or outer. It can also perform joins based on either id attributes, or a specified list of key attributes.
Be able to append rows. The Append operator is designed to work with ExampleSets having and an exact match of attributes.
Be able to use Set Operators. These include Set Minus, Intersect, and Union.
Understand how and when to Pivot and Aggregate data with process design and Turbo Prep.
Be able to Aggregate Data. The Aggregate operator takes a list of aggregation attribute and associated aggregation functions. Aggregations are computed for each within the values specified by the group by attributes. This is similar to aggregation in SQL.
Be able to Pivot or Unstack Data. The Pivot operator can do this and is often used to aggregate the data in the same step.
Be able to De-Pivot, Unpivot, or Stack Data. The De-Pivot operator can do this. It uses regular expressions to determine which attributes should become rows.
Be able to Transpose Data. The Transpose operator performs a simple transpose, much like a matrix transpose of the data.
Understand and be able to use Routines. There are several techniques to encapsulate and reuse functionality. They can help you build standardized processes that are easy to read and maintain.
Understand and be able to use Subprocesses. The Subprocess operator introduces a process within a process. The functionality within a subprocess is not shared with any other process.
Be able to Execute a process from another process. The Execute Processes operator embeds a process from a repository location. This means the embedded process is maintained in one place but can be called from many processes.
Be able to create, use, and manage Building Blocks. They can be created for common process fragments or collections of operators that work as a unit. First the process fragment can be collected into a single Subprocess, and then the subprocess can be saved as a Building Block and then copied into in other processes.
Know the basics of basic Text Processing in RapidMiner. The Text Vectorization operator can be used as a simple way to create attributes for each word using TFIDF. This can turn unstructured text into attributes that will be usable for modeling.
Know how and when to use Turbo Prep and Process Design. Turbo Prep provides the ability to work directly on the data, and then CREATE PROCESS, or ADD TO PROCESS. That enables you to keep a repeatable process for the future. It’s worthwhile to learn to work in Turbo Prep because it’s expressive and often provides a quick way to work.