Operational engines

ETL

Knowage allow the upload of data from source systems according to a common ETL logic, as well as the monitoring of data flows continuously feeding the data warehouse. To this end, Knowage provides the ETL Knowage Talend Engine.

Important

Enterprise Edition only

Please note that in the Enterprise Edition, KnowageTalendEngine is shipped with KnowageBD and KnowageSI only.

KnowageTalendEngine

Knowage Talend Engine integrates the open source tool Talent Open Studio (TOS). Talend Open Studio (TOS) is a graphical designer for defining ETL flows. Each designed flow produces a self-alone Java or Perl package. TOS is based on Eclipse and offers a complete environment including test, debug and support for documentation.

The integration between Talend and Knowage is twofold. TOS includes Knowage as a possible execution target for its job, while Knowage implements the standard context interface for communicating with Talend. Jobs can be directly executed from Knowage web interface or possibly scheduled.

Furthermore, the analytical model for monitoring ETL flows can be successfully applied to the analysis of audit and monitoring data generated by a running job. Note that this is not a standard functionality of Knowage suite, but it can be easily realized within a project with Knowage. To create an ETL document, you should perform the following steps:

  • Job design (on Talend);
  • Job deploy;
  • Template building;
  • Analytical document building;
  • Job execution.

In the remainder of the section, we discuss in detail all steps by providing examples.

Job design

The job is designed directly using Talend.

Designing an ETL job requires to select the proper components from Talend tool palette and connect them to obtain the logic of the ETL flow. Talend will map to appropriate metadata both the structure of any RDBMS and the structure of any possible flow (e.g., TXT, XLS, XML) acting as input or output in the designed ETL.

To design the ETL, several tools are available: from interaction with most RDBMS engines (both proprietary and open source) to support for different file formats; from logging and auditing features to support for several communication and transport protocols (FTP, POP, code, mail); from ETL process control management to data transformation functionalities.

Talend also supports data quality management. Furthermore, it enables the execution of external processes and can interact with other applications, e.g., open source CRM applications, OLAP and reporting tools.

The tMap tool allows the association of sources to targets according to defined rules. The main input is the source table in the data warehouse, while secondary (or lookup) inputs are dimensions to be linked to data. The output (target) is the data structure used for aggregation.

It is also possible to design parametric ETL jobs. We will see how to manage them in the next steps.

Once you have designed the ETL job, you should deploy it on Knowage Server. First of all, configure connections properties to Knowage Server. Select Preferences > Talend > Import/export from within Talend. Then set connection options as described below.

Table 10 Connection Settings.
Parameter Value
Engine name KnowageTalendEngine
Short description Logical name that will be used by Talend
Host Host name or IP address of the connection URL to Knowage
Port Port of the connection URL to Knowage
Host Host name or IP address of the connection URL to Knowage
Password Password of the user that will perform the deploy

Once you have set the connection, you can right click on a job and select Deploy on Knowage. This will produce the Java code implementing the ETL and make a copy of the corresponding jar file at \\resources\\talend\\RuntimeRepository\\java\\Talend project name of Knowage Server. It is possible to deploy multiple jobs at the same time. Exported code is consistent and self-standing. It may include libraries referenced by ETL operators and default values of job parameters, for each context defined in the job. On its side, Knowage manages Talend jobs from an internal repository at resources/talend/RuntimeRepository, under the installation root folder.

Template building

As with any other Knowage document, you need to define a template for the ETL document that you wish to create. The ETL template has a very simple structure, as shown in the example below:

Listing 27 ETL template.
1
2
3
4
5
6
7
8
 <etl>
       <job    project="Foodmart"
               jobName="sales_by_month_country_product_familiy"
               context="Default"
               version="0.1"
               language="java"
       />
 </etl>

Where the tag job includes all the following configuration attributes:

  • project is the name of the Talend project
  • jobName is the label assigned to the job in Talends repository.
  • context is the name of the context grouping all job parameters. Typically it is the standard context, denoted with the name Default.
  • version is the job version
  • language is the chosen language for code generation. The two possible options are: Java and Perl.

Values in the template must be consistent with those defined in Talend, in order to ensure the proper execution of the ETL document on Knowage Server.

Creating the analytical document

Once we have created the template, we can create a new analytical document.

Before starting to create the document, it is recommended to check whether the engine is properly installed and configured. In case the engine is not visible in the Engine Configuration list (Data Providers > Engine Management), you should check that the web application is active by invoking the URL http://myhost:myport/KnowageTalendEngine.

Now you can create the analytical document on the Server, following the standard procedure. The template for this document is the one we have just created. If the job has parameters, they should be associated to the corresponding analytical drivers, as usually. In other words, you have to create an analytical driver for each context variable defined in the Talend job.

Job execution

A Talend job can be executed directly from the web interface of Knowage Server and of course from a Talend client. To execute the job on Knowage, click on the document icon in the document browser, like with any other analytical document. The execution page will show a message to inform that the process was started.

Job scheduling

Most often it is useful to schedule the execution of ETL jobs instead of directly running them. You can rely on Knowage scheduling functionality to plan the execution of Talend jobs. While defining a scheduled execution, you can set a notification option which will send an email to a set of recipients or a mailing list once the job has completed its execution. To enable this option, check the flag Send Mail.

External processes

Knowage support the execution of processes that are external to its own activity. When analyzing data, for example through the real time console, it may be useful to perform activities such as sending notification emails or taking actions on the components of the monitored system (e.g., business processes, network nodes).

These products provide the KnowageProcessEngine, which supports the execution and management of external processes.

With the term process we refer to a Java instruction, however complex it may be. Processes can be executed in background or via the interface of the Console Engine. It is also possible to schedule their start and stop.

To enable the management of an external process, the following steps are required:

  • Create a Java class defining the execution logic;
  • If needed, create a Java class defining the logic of the process, i.e., which tasks the process is supposed to perform (optional);
  • Create a template that will be associated to the Knowage document;
  • Create the Knowage CommonJ analytical document;

In the following sections, we provide details about both class and template creation, and document creation.

Class definition

First of all, the developer should write a Java class that defines the desired logics for processing start and stop. In particular, this class must extend one of these two classes of the engine:

KnowageWork
In this case the class to be defined only needs to reimplement the run() method. This class is the base case: the logic of the external process will be contained in the run() method.
CmdExecWork

In this case, the class to be defined must implement the method execCommand(). The logic of the external process can be delegated to an external class, which will be invoked by the execCommand() method. To stop the process, the developer is in charge of checking programmatically whether the process is still running, using the method isRunning(), or not.

Listing 28 Class template
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
package it.eng.spagobi.job;

        import java.util.Iterator;
        import it.eng.spagobi.engines.commonj.process.SpagoBIWork;

        public class CommandJob extends SpagoBIWork {
            @Override
            public boolean isDaemon() {
                return true;
        }

    @Override
    public void release() {
        System.out.println("Release!!");
        super.release();
    }

    @Override public void run() {
        super.run();
        System.out.println("Job started! ");
        java.util.Map parameters = getSbiParameters();
        for (Iterator iterator = parameters.keySet().iterator(); iterator.hasNext();) {
            String type = (String) iterator.next();
            Object o = parameters.get(type);
            System.out.println("Parameter " + type + " value" + o.toString());
        }
        for(int i = 0; i < 50 && isRunning(); i++) {
            System.out.println("job is running!");
            try {
                Thread.sleep(2000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
        System.out.println("Job finished!");
    }
}

Note that the class CmdExecWork extends SpagoBIWork by providing additional methods. To better understand the difference between the two options, let us have a look at some code snippets. Here you can see a class implemented as an extension of SpagoBIWork:

Note also that we only implement the run() method, embedding the logic of the process in it. Below you can see an example extension of CmdExecWork, called CommandJob:

Listing 29 Example extension of CmdExecWork.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
   package it.eng.spagobi.job;
   import it.eng.spagobi.engines.commonj.process.CmdExecWork;
   import java.io.IOException;
   public class CommandJob extends CmdExecWork{
   public boolean isDaemon() {
   return true;}
   public void release() {
   super.release();}
   public void run() {
   super.run();
   if(isRunning()){
   try {
   execCommand();
   } catch (InterruptedException e) {
   } catch (IOException e) {}}}}

Note that this class implements the execCommand() method and uses the isRunning() method. No logic is directly embedded in this class. Therefore, we also define an external class, called ProcessTest, which contains the actual logic (in our example printing the content of a file):

Listing 30 ProcessTest
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
package it.eng.test;
   import java.io.FileNotFoundException;
   import java.io.FileOutputStream;
   import java.io.PrintStream;
   public class ProcessTest {
   public static void main(String[] args) {
   FileOutputStream file=null;
   try {
   file = new FileOutputStream("C:/file.txt");
   } catch (FileNotFoundException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();}
   PrintStream output = new PrintStream(file);
   while (true){
   output.println("New row");
   output.flush();
   try {
   Thread.currentThread().sleep(5000l);
   } catch (InterruptedException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
   output.close();}}}}

Now that classes are ready, we pack them in .jar file containing all classes and their paths. Then we copy the jar file under the resource folder of Knowage at RESOURCE_PATH]/commonj/ CommonjRepository/[JAR\\_NAME. In the next section we will explain how to define the template, based on the class definition chosen above.

Template definition

As with any other Knowage document, we need to define a template for an external process document. The example below shows a template that corresponds to the classes CommandJob and ProcessTest defined in the examples above. Let us note that this template corresponds to the option of implementing an extension of CmdExecWork.

Listing 31 Template Definition
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
       <COMMONJ>
                     <WORK workName='JobTest' className='it.eng.spagobi.job.CommandJob'>
                             <PARAMETERS>
                                     <PARAMETER name='cmd' value='C:/Programmi/Java/jdk1.5.0_16/bin/java'/>
                                     <PARAMETER name='classpath' value='C:/resources/commonj/CommonjRepository/JobTest/process.jar'/>
                                     <PARAMETER name='cmd_par' value='it.eng.test.ProcessTest'/>
                                     <PARAMETER name='sbi_analytical_driver' value='update'/>
                                     <PARAMETER name='sbi_analytical_driver' value='level'/>
                             </PARAMETERS>
         </WORK>
       </COMMONJ>

Where:

  • <COMMONJ> is the main tag and includes all the document.
  • The tag <WORK> specifies the process. In particular:
    • workName is the id of the process
    • className contains the name of the class implementing the process (as defined above).
  • The tag <PARAMETERS> contains all parameters. Each <PARAMETER> tag includes a parameter. Some of them are mandatory
Table 11 CommonJ document template parameters.
Parameter Value
cmd Specifies the java command that will be launched, with its complete path
classpath Specifies the classpath containing the jar file. This path will be added to the classpath for the process to run correctly.
cmd_par Optional. In case it is defined, its value contains the Java class that will be launched instead of the job (i.e., the extension of CmdWorkExec or KnowageWork).
sbi_analytical_driver Optional and repeatable. Each line with this attribute defines an analytical driver that should be associated with the process.

The class CmdExecWork (and its extensions) allows the execution of the command specified in the template. In particular, the template above would produce the following command at runtime:

Listing 32 Runtime command line
1
      C:/Programmi/Java/jdk1.5.0_16/bin/java 'it.eng.test.ProcessTest' update={val} level={val}