Sunday 24 January 2016

DataStage

Extraction, Transformation & Loading

In computing, Extract, Transform and Load (ETL) refers to a process in database usage and especially in data warehousing that:
  • Extracts data from homogeneous or heterogeneous data sources
  • Transforms the data for storing it in the proper format or structure for the purposes of querying and analysis
  • Loads it into the final target (database, more specifically, operational data store, data mart, or data warehouse)
Usually all the three phases execute in parallel since the data extraction takes time, so while the data is being pulled another transformation process executes, processing the already received data and prepares the data for loading and as soon as there is some data ready to be loaded into the target, the data loading kicks off without waiting for the completion of the previous phases.

ETL systems commonly integrate data from multiple applications (systems), typically developed and supported by different vendors or hosted on separate computer hardware. The disparate systems containing the original data are frequently managed and operated by different employees. For example, a cost accounting system may combine data from payroll, sales, and purchasing.
The ETL process consists of the following steps:
  1. Initiation
  2. Build reference data
  3. Extract from sources
  4. Validate
  5. Transform
  6. Load into stages tables
  7. Audit reports
  8. Publish
  9. Archive
  10. Clean up

How do the ETL tools work?

  • The first task is data extraction from internal or external sources. After sending queries to the source system data may go indirectly to the database. However usually there is a need to monitor or gather more information and then go to Staging Area . Some tools extract only new or changed information automatically so we dont have to update it by our own. 
  • The second task is transformation which is a broad category:  transforming data into a stucture wich is required to continue the operation (extracted data has usually a sructure typicall to the source) 
  1. sorting data 
  2. connecting or separating 
  3. cleansing 
  4. checking quality 
  • The third task is loading into a data warehouse.
Comercial ETL Tools:

  • Informatica - Power Center

  • IBM - Websphere DataStage(Formerly known as Ascential DataStage)
  • SAP - BusinessObjects Data Integrator
  • IBM - Cognos Data Manager (Formerly known as Cognos DecisionStream)
  • Microsoft - SQL Server Integration Services
  • Oracle - Data Integrator (Formerly known as Sunopsis Data Conductor)
  • SAS - Data Integration Studio
  • Oracle - Warehouse Builder

  • F
    reeware, open source ETL tools:
  • Pentaho Data Integration (Kettle)
  • Talend Integrator Suite
  • CloverETL
  • Jasper ETL
  • DataStage  Introduction

    IBM InfoSphere DataStage is an ETL Tool and it is client-server technology and integrated tool set used for designing, running, monitoring and administrating the “data acquisition” application is known as “job”.
    IBM InfoSphere DataStage is an ETL tool and part of the IBM Information Platforms Solutions suite and IBM InfoSphere. It uses a graphical notation to construct data integration solutions and is available in various versions such as the Server Edition, the Enterprise Edition, and the MVS Edition.

    In the year, 2005, it was under the acquisition of IBM. DataStage was added to the family of WebSphere. Then, from the year 2006, it officially came to be known as IBM WebSphere DataStage. The name was again changed in the year to IBM InfoSphere DataStage.

    A job is graphical representation of data flow from source to target and it is designed with source definitions and target definition and transformation Rules. The data stage software consists of client and server components




    PC is having 4 components These are the client components.
    1.  DATA STAGE ADMINISTRATOR : This components will be used for to perform create or delete the projects. , cleaning metadata stored in repository and install NLS.
    2.  DATA STAGE DESIGNER :  It is used to create the Datastage application known as job. The following activities can be performed with designer window.
      a) Create the source definition.
      b) Create the target definition.
      c) Develop Transformation Rules
      d) Design Jobs.
    3.  DATA STAGE DIRECTOR :  It is used to validate, schedule, run and monitor the Data stage jobs.
    4.  DATA STAGE MANAGER :  it will be used for to perform the following task like..
      a) Create the table definitions.
      b) Metadata back-up and recovery can be performed.
      c) Create the customized components.
    Data Stage Repository:-It is one of the server side components which is defined to store the information about to build out Data Ware House.
    Jobs:
    job is nothing but it is ordered series of individual stages which are linked together to describe the flow of data from source and target. There are three types of jobs can be designed.
    a) Server jobs
    b) Parallel Jobs
    c) Mainframe Jobs


    Difference between server jobs and parallel jobs
    1. Server jobs:-

    •  In server jobs it handles less volume of data with more performance.
    •  It is having less number of components.
    •  Data processing will be slow.
    •  It’s purely work on SMP (Symmetric Multi Processing).
    •  It is highly impact usage of transformer stage.

    2. Parallel jobs:-
    •  It handles high volume of data.
    •  It’s work on parallel processing concepts.
    •  It applies parallelism techniques.
    •  It follows MPP (Massively parallel Processing).
    •  It is having more number of components compared to server jobs.
    •  It’s work on orchestrate framework

    1. SMP (symmetric multiprocessing)
    • in which some hardware resources might be shared among processors. The processors communicate via shared memory and have a single operating system.
    • Symmetric Multi-Processing. In a symmetrical multi-processing environment, the CPU's share the same memory,and as a result code running in one CPU can affect the memory used by another.
    2. Cluster or MPP (massively parallel processing)
    • also known as shared-nothing, in which each processor has exclusive access to hardware resources. MPP systems are physically housed in the same box, whereas cluster systems can be physically dispersed. The processors each have their own operating system, and communicate via a high-speed network.
    • MPP: Massively Parallel Processing. computer system with many independent arithmetic units or entire microprocessors, that run in parallel.
    Parallelism in Data Stage
    • it is a process to perform ETL task in parallel approach need to build the data warehouse. The parallel jobs support the following hardware system like SMP, MPP to achieve the parallelism.
    There are two types of parallel parallelism techniques.
    1. Pipeline parallelism.:- the data flow continuously throughout it pipeline . All stages in the job are Operating simultaneously. For example, my source is having 4 records as soon as first record starts processing, then all remaining records processing simultaneously.
    2. Partition parallelism:- in this parallelism, the same job would effectively be run simultaneously by several processors. Each processors handles separate subset of total records. For example, my source is having 100 records and 4 partitions. The data will be equally partition across 4 partitions that mean the partitions will get 25 records. Whenever the first partition starts, the remaining three partitions start processing simultaneously and parallel.
    Configuration file
    It is normal text file. it is having the information about the processing and storage resources .that are available for usage during parallel job execution.

    The default configuration file is having like
    1.  Node:- it is logical processing unit which performs all ETL operations.
    2.  Pools:- it is a collections of nodes.
    3.  Fast Name: it is server name. by using this name it was executed our ETL jobs.
    4.  Resource disk:- its permanent memory area which stores all Repository components.
    5.  Resource Scratch disk:-it is temporary memory area where the staging operation will be performed.
    Partitioning Techniques
    1. Partitioning mechanism divides a portion of data into smaller segments, which is then processed independently by each node in parallel. It helps make a benefit of parallel architectures like SMP, MPP, Grid computing and Clusters.
    2. Collecting is the opposite of partitioning and can be defined as a process of bringing back data partitions into a single sequential stream (one data partition). 


    Data partitioning methods

    Datastage supports a few types of Data partitioning methods which can be implemented in parallel stages:


  • Auto(default) : Datastage Enterprise Edition decides between using Same or Round Robin partitioning. Typically Same partitioning is used between two parallel stages and round robin is used between a sequential and an EE stage.


  • Same : existing partitioning remains unchanged. No data is moved between nodes.
  • Round robin : rows are alternated evenly accross partitions. This partitioning method guarantees an exact load balance (the same number of rows processed) between nodes and is very fast.


  • Hash : rows with same key column (or multiple columns) go to the same partition. Hash is very often used and sometimes improves performance, however it is important to have in mind that hash partitioning does not guarantee load balance and misuse may lead to skew data and poor performance.


  • Entire : all rows from a dataset are distributed to each partition. Duplicated rows are stored and the data volume is significantly increased.


  • Random : rows are randomly distributed accross partitions.


  • Range : an expensive refinement to hash partitioning. It is imilar to hash but partition mapping is user-determined and partitions are ordered. Rows are distributed according to the values in one or more key fields, using a range map (the 'Write Range Map' stage needs to be used to create it). Range partitioning requires processing the data twice which makes it hard to find a reason for using it.


  • Modulus : data is partitioned on one specified numeric field by calculating modulus against number of partitions. Not used very often.


  • Data collecting methods

    A collector combines partitions into a single sequential stream. Datastage EE supports the following collecting algorithms:

  • Auto : the default algorithm reads rows from a partition as soon as they are ready. This may lead to producing different row orders in different runs with identical data. The execution is non-deterministic.


  • Round Robin : picks rows from input partition patiently, for instance: first row from partition 0, next from partition 1, even if other partitions can produce rows faster than partition 1. 


  • Ordered : reads all rows from first partition, then second partition, then third and so on.


  • Sort Merge : produces a globally sorted sequential stream from within partition sorted rows. Sort Merge produces a non-deterministic on un-keyed columns sorted sequential stream using the following algorithm: always pick the partition that produces the row with the smallest key value.


  • 4 comments: