Thursday, December 8, 2011

Talend an ETL Tool


ETL is an abbreviation of the three words Extract, Transform and Load. It is an ETL process to extract data, mostly from different types of systems, transform it into a structure that’s more appropriate for reporting and analysis and finally load it into the database and or cube(s).
ETL – EXTRACT FROM SOURCE
ETL – TRANSFORM THE DATA


Real-life ETL cycle
Build reference data
Extract (from sources)
Validate
Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates)
Stage (load into staging tables, if used)
Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair)
Publish (to target tables)
Archive


Extract
The purpose of the extraction process is to reach to the source systems and collect the data needed for the data warehouse.
Usually data is consolidated from different source systems that may use a different data organization or format so the extraction must convert the data into a format suitable for transformation processing.
The complexity of the extraction process may vary and it depends on the type of source data.
The extraction process also includes selection of the data as the source usually contains redundant data or data of little interest. 

For the ETL extraction to be successful, it requires an understanding of the data layout. A good ETL tool additionally enables a storage of an intermediate version of data being extracted. This is called "staging area" and makes reloading raw data possible in case of further loading problem, without re-extraction. The raw data should also be backed up and archived. 

Transform
The transform stage applies a series of rules or functions to the extracted data from the source to derive the data for loading into the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformation types may be required to meet the business and technical needs of the target database:
Selecting only certain columns to load (or selecting null columns not to load). For example, if the source data has three columns (also called attributes), for example roll_no, age, and salary, then the extraction may take only roll_no and salary. Similarly, the extraction mechanism may ignore all those records where salary is not present (salary = null).
Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female)
Encoding free-form values (e.g., mapping "Male" to "1")
Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
Sorting
Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data
Aggregation (for example, rollup — summarizing multiple rows of data — total sales for each store, and for each region, etc.)
Generating surrogate-key values
Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns)
Disaggregation of repeating columns into a separate detail table (e.g., moving a series of addresses in one record into single addresses in a set of records in a linked address table)
Lookup and validate the relevant data from tables or referential files for slowly changing dimensions.
Applying any form of simple or complex data validation. If validation fails, it may result in a full, partial or no rejection of the data, and thus none, some or all the data is handed over to the next step, depending on the rule design and exception handling. Many of the above transformations may result in exceptions, for example, when a code translation parses an unknown code in the extracted data.
The transform stage of an ETL process involves an application of a series of rules or functions to the extracted data. It includes validation of records and their rejection if they are not acceptable as well as integration part. The amount of manipulation needed for transformation process depends on the data. Good data sources will require little transformation, whereas others may require one or more transformation techniques to to meet the business and technical requirements of the target database or the data warehouse. The most common processes used for transformation are conversion, clearing the duplicates, standardizing, filtering, sorting, translating and looking up or verifying if the data sources are inconsistent. A good ETL tool must enable building up of complex processes and extending a tool library so custom user's functions can be added.

Load
The loading is the last stage of ETL process and it loads extracted and transformed data into a target repository. There are various ways in which ETL load the data. Some of them physically insert each record as a new row into the table of the target warehouse involving SQL insert statement build-in, whereas others link the extraction, transformation, and loading processes for each record from the source. The loading part is usually a bottleneck of the whole process. To increase efficiency with larger volumes of data we may need to skip SQL and data recovery or apply external high-performance sort that additionally improves performance.
Talend Profile
Its GUI is made of a metadata repository and a graphical designer. The metadata repository contains the definitions and configuration for each job - but not the actual data being transformed or moved.
The information in the metadata repository is used by all of the components of Talend Open Studio.
Talend can be used
Synchronization or replication of databases
Right-time or batch exchanges of data
ETL (Extract Transform Load) for analytics
Data migration
Complex data transformation and loading
Data quality
Best Part
Talend Open Studio operates as a code generator allowing data transformation scripts and underlying programs to be generated either in Java (since v2.0) or Perl (since v1.0).
Of course we can customize wherever needed.