A major obstacle preventing companies from utilizing data effectively is the lack of reliable data consolidated in a standardized format. This data often resides in disparate sources, rendering it unusable for critical business decisions.
Consider a scenario where you are the IT manager for a nationwide cell phone case retailer. Your board of directors needs sales data to make informed decisions, but the information is scattered across various systems. You’re tasked with creating a sales dashboard for the next board meeting.
However, consolidating this data is near impossible due to its varied formats and structures. Some kiosks utilize a legacy Oracle database, larger stores use Salesforce, and the newest kiosks rely on shared spreadsheets.
The challenge lies in merging this data from diverse locations, formats, and structures into a unified database for dashboard creation.
ETL Fundamentals
ETL, which stands for Extract, Transform, and Load, is a set of processes that address this challenge. It transforms this complex data landscape into an organized, reliable, and repeatable process, enabling companies to leverage existing data for increased sales.
In our example, data from the Oracle database, Salesforce, and spreadsheets needs to be extracted, transformed if required, and loaded into a central data warehouse. This data warehouse then becomes accessible to reporting tools, facilitating the generation of dashboards and scorecards.
Let’s delve into the three steps of ETL:
Extraction
Extraction involves gathering data from one or more sources. These sources can range from documents and spreadsheets to CSV files, flat files, relational databases like Oracle, MySQL, and SQL Server, as well as non-relational databases.
There are two primary extraction types: full and partial.
- Full extraction is suitable for initial data pulls or when the data volume and extraction time are manageable.
- Partial extraction is preferred when extracting only updated or new data, especially when the data volume is large.
Factors like data availability and integrity influence the choice between full and partial extraction.
This implies considering only complete transactions to avoid data integrity issues. For instance, in an online test, only completed responses would be extracted, preventing incomplete data from jeopardizing integrity.
Transformation
Once extracted, data undergoes transformation, which includes cleaning, manipulation, and reformatting based on business rules and technical standards. This ensures data quality.
A staging area, a temporary storage space, is often used to hold the extracted data before transformation.
While smaller projects may not require a staging area, it’s common in larger projects.
Several tasks are performed during transformation:
- Selection: Criteria for filtering extracted data.
- Integration: Combining extracted data in the staging area through addition, deletion, and updates.
- Joins: Linking extracted data similar to SQL joins (inner, left, right, full).
- Cleansing or scrubbing: Removing inconsistent, invalid, or erroneous data to enhance quality. Common issues include:
- Referential integrity (customer linked to a nonexistent category)
- Missing values (customer without an ID)
- Uniqueness violations (duplicate Social Security numbers)
- Misspellings (e.g., “Sun Diego,” “Cannada”)
- Contradictory values (conflicting birthdates)
- Summarizations: Condensing data sets.
- Aggregations: Grouping and summarizing data.
- Consolidations: Combining data from various sources into a unified set.
Common transformation tasks:
- Removing duplicate entries
- Splitting and merging data
- Converting formats (date, time, numbers, measurements)
- Encoding data (e.g., “Male” to “M”)
- Performing calculations (e.g., item_value = unit_Price * quantity)
- Generating unique keys
Loading
The final stage, loading, involves transferring the transformed data into the target repository, typically a data warehouse.
Three primary loading types exist:
- Full or initial load: Transferring all extracted and transformed data into the destination.
- Incremental load: Comparing transformed data with the destination and loading only new data.
- Refresh load: Updating the destination data to reflect source changes, either fully or incrementally.
Essentially, ETL empowers companies of all sizes to integrate existing data, create valuable insights, and unlock new revenue streams.
Testing
Testing is crucial but often overlooked in ETL. Transforming and loading data from various sources is complex, making errors likely. Common errors are addressed in the transformation section.
Accurate data is paramount, while inaccuracies can be disastrous. ETL professionals are responsible for ensuring data integrity throughout the process. Each phase requires testing. Extracted data, whether from one or many sources, must be validated.
Similarly, transformed data, such as summarized data, needs verification for accuracy and completeness.
After loading data into the data warehouse, testing continues. The loaded data is compared against both the transformed and extracted data.
In our cell phone case scenario, with three data sources, sample data from each source (stores, old kiosks, new kiosks) can be compared against the staging area data to ensure error-free extraction.
This sample data is then compared to the original source. Discrepancies, if any, are analyzed to determine if they are acceptable or errors.
Errors require fixing. Decisions like whether to modify the original data, the feasibility of doing so, or if transformations can rectify the errors need to be made.
In some situations, discarding erroneous data and triggering alerts may be necessary.
Testing encompasses:
- Data validation
- Data quality assessment
- Performance evaluation
- Data rule compliance
- Data model validation
Logging
Thorough logging is essential for maintainability and troubleshooting ETL processes.
Proper logging enables continuous improvement by assisting in bug detection and issue resolution related to data sources, formats, transformations, destinations, and more.
Robust logs expedite problem identification, saving teams valuable time. Without logs, pinpointing errors, especially in large data volumes, becomes significantly harder, potentially requiring re-running the entire process. Logs, however, help pinpoint specific problematic files and rows for efficient resolution.
Logs might be less crucial in small, manual systems with manageable data.
However, automated ETL processes handling large data volumes benefit greatly from logging. Well-designed logging systems pay off by facilitating faster error identification, improving data reliability, and highlighting areas for improvement.
Creating a logging system involves three main steps:
- Generation: Documenting the ETL pipeline execution, including start times, extracted files or tables, staged data, error messages, and other relevant information.
- Archiving: Maintaining a history of past executions for error analysis and improvement tracking. Irrelevant historical data, such as outdated structures, can be purged.
- Analysis: Log analysis is critical. Storing logs without analysis is wasteful. Analysis helps identify errors, improvement opportunities, and enhances data quality.
Performance
ETL processes, dealing with potentially massive data, can be resource-intensive in terms of both time and computational power. When designing integrations, it’s vital to consider the necessity of all data being used. Prioritize minimizing the data volume while achieving objectives to avoid unnecessary costs. Anticipate future data growth as well.
The complexity of a logging system depends on factors like data volume and processing frequency. Small, infrequent data processes might suffice with manual logging, while companies with large datasets, multiple processes, diverse data sources and formats, and complex pipelines require robust logging systems for cost and time savings.
Regardless of these factors, ETL processes must be profitable. Investments in time, resources, and technology should yield positive returns.
In conclusion, ETL helps organizations unlock the value of their data, even if it’s scattered and underutilized. By integrating disparate systems, spreadsheets, and departmental data silos, ETL empowers businesses to maximize data utilization and gain a competitive edge.