•ETL
is an abbreviation of the three words Extract, Transform and Load. It is an
ETL process to extract data, mostly from different types of
systems, transform it into a structure that’s more appropriate for reporting
and analysis and finally load it into the database and or cube(s).
•ETL – EXTRACT FROM SOURCE
•ETL – TRANSFORM THE DATA
Real-life
ETL cycle
•Build
reference data
•Extract
(from sources)
•Validate
•Transform
(clean, apply business rules, check for data integrity, create aggregates or
disaggregates)
•Stage
(load into staging tables, if used)
•Audit
reports (for example, on compliance with business rules. Also, in case of
failure, helps to diagnose/repair)
•Publish
(to target tables)
•Archive
Extract
•The purpose of the
extraction process is to reach to the source systems and collect the data
needed for the data
warehouse.
•Usually data is
consolidated from different source systems that may use a different data organization or format so the extraction must convert the data into a format
suitable for transformation processing.
•The complexity of the
extraction process may vary and it depends on the type of source data.
•The extraction process
also includes selection of the data as the source usually contains redundant
data or data of little interest.
For the ETL extraction to
be successful, it requires an understanding of the data layout. A good ETL tool
additionally enables a storage of an intermediate version of data being
extracted. This is called "staging area" and makes
reloading raw data possible in case of further loading problem, without
re-extraction. The raw data should also be backed up and archived.
Transform
•The transform stage
applies a series of rules or functions to the extracted data from the source to
derive the data for loading into the end target. Some data sources will require
very little or even no manipulation of data. In other cases, one or more of the
following transformation types may be required to meet the business and
technical needs of the target database:
•Selecting only
certain columns to load (or selecting null columns not to
load). For example, if the source data has three columns (also called
attributes), for example roll_no, age, and salary,
then the extraction may take only roll_no and salary. Similarly, the extraction mechanism may
ignore all those records where salary is not present (salary = null).
•Translating coded
values (e.g., if the source
system stores 1 for male and 2 for female, but the warehouse stores M for male
and F for female)
•Encoding free-form
values (e.g., mapping
"Male" to "1")
•Deriving a new
calculated value (e.g., sale_amount = qty * unit_price)
•Sorting
•Aggregation (for
example, rollup — summarizing multiple rows of data — total sales for each
store, and for each region, etc.)
•Generating surrogate-key values
•Splitting a column
into multiple columns (e.g.,
putting a comma-separated list specified as a
string in one column as individual values in different columns)
•Disaggregation of
repeating columns into a separate detail table (e.g., moving a series of addresses in one record into single
addresses in a set of records in a linked address table)
•Lookup and validate
the relevant data from tables or referential files for slowly changing
dimensions.
•Applying any form of
simple or complex data validation. If validation fails, it may result in a
full, partial or no rejection of the data, and thus none, some or all the data
is handed over to the next step, depending on the rule design and exception handling.
Many of the above transformations may result in exceptions, for example, when a
code translation parses an unknown code in the extracted data.
The
transform stage of an ETL process involves an application of a
series of rules or functions to the extracted data. It includes validation of
records and their rejection if they are not acceptable as well as integration
part. The amount of manipulation needed for transformation process depends on
the data. Good data sources will require little transformation, whereas others
may require one or more transformation techniques to to meet the business
and technical requirements of the target database or the data warehouse. The
most common processes used for transformation are conversion, clearing the
duplicates, standardizing, filtering, sorting, translating and looking up or
verifying if the data sources are inconsistent. A good ETL tool must enable
building up of complex processes and extending a tool library so custom user's
functions can be added.
Load
•The
loading is the last stage of ETL process and it loads extracted and transformed
data into a target repository. There are various ways in which ETL load the
data. Some of them physically insert each record as a new row into the table of
the target warehouse involving SQL insert statement build-in, whereas others
link the extraction, transformation, and loading processes for each record from
the source. The loading part is usually a bottleneck of the whole process. To increase
efficiency with larger volumes of data we may need to skip SQL and data
recovery or apply external high-performance sort that additionally improves
performance.
Talend Profile
Talend Profile
•Its GUI is
made of a metadata
repository and a graphical
designer. The metadata repository contains the definitions and configuration
for each job - but not the actual data being transformed or moved.
•The
information in the
metadata repository is used by all of the components of Talend Open Studio.
Talend
can be used
•Synchronization
or replication of
databases
•Right-time
or batch exchanges of data
•ETL (Extract
Transform Load) for analytics
•Data
migration
•Complex
data transformation and loading
•Data
quality
Best
Part
•Talend
Open Studio operates as a code generator allowing data transformation scripts
and underlying programs to be generated either in Java (since v2.0)
or Perl (since v1.0).
•Of
course we can customize wherever needed.