civil-and-structural-engineering
Interview Questions on Data Warehousing and Etl Processes
Table of Contents
Basic Concepts of Data Warehousing
A solid grasp of data warehousing fundamentals is the first thing interviewers assess. You need to not only define terms but also explain how they apply in real-world scenarios.
What is a data warehouse?
A data warehouse is a centralized repository that stores large volumes of structured, historical data from multiple source systems. It is optimized for query and analysis rather than transaction processing. Data warehouses support business intelligence activities such as reporting, dashboards, and ad-hoc analytics. Unlike operational databases, a data warehouse holds integrated, subject-oriented, time-variant, and non-volatile data.
What are the key characteristics of a data warehouse?
- Subject-oriented: Organized around major subjects (e.g., customers, products, sales) rather than application processes.
- Integrated: Data from disparate sources is cleansed, transformed, and standardized into a consistent format.
- Non-volatile: Data is read-only once loaded; historical changes are tracked via versioning, not overwrites.
- Time-variant: Data contains time dimension attributes (e.g., date stamps, periods) to support historical analysis.
How does a data warehouse differ from a data lake?
A data lake stores raw, unprocessed data in its native format (structured, semi-structured, or unstructured). A data warehouse stores processed, cleaned, and structured data. Organizations often use both: the lake for exploratory analytics and machine learning, and the warehouse for structured reporting. Interviewers may ask about the use cases where one is preferred over the other.
What is an Operational Data Store (ODS)?
An ODS is a database designed to integrate data from multiple operational systems for near-real-time reporting. Unlike a data warehouse, the ODS is updated frequently (often in real time) and typically does not retain historical snapshots. It acts as a staging area for operational reporting before data is moved into the data warehouse.
Data Modeling in Data Warehousing
Data modeling is the blueprint of a data warehouse. Two common approaches are the star schema and the snowflake schema.
What is a star schema?
A star schema has a central fact table linked to one or more dimension tables via foreign keys. Dimensions are denormalized (e.g., a single product dimension table containing category, brand, and subcategory). This structure simplifies queries and improves read performance. It is the most common model in data warehousing.
What is a snowflake schema?
A snowflake schema normalizes dimension tables into multiple related tables. For example, a product dimension might be split into separate product, brand, and category tables. While this reduces data redundancy, it increases the number of joins and can slow query performance. It is used when storage efficiency is prioritized over query speed.
What is a fact table? What are the types of facts?
A fact table stores quantitative measures (e.g., sales amount, quantity, profit) and foreign keys linking to dimension tables. Fact tables can be classified as:
- Transactional – records individual events (e.g., each sale line item).
- Periodic snapshot – captures measures at regular intervals (e.g., daily inventory levels).
- Accumulating snapshot – tracks processes with a fixed start and end (e.g., order fulfillment stages).
Interviewers may ask you to choose the appropriate fact type for a given business scenario.
What are dimension tables? Explain conformed dimensions.
Dimension tables contain descriptive attributes (e.g., customer name, product color, store location). Conformed dimensions are shared across multiple fact tables within a data warehouse or across different data marts. They ensure consistency so that reports can be combined meaningfully. For example, a date dimension used in both sales and inventory fact tables must have the same structure and granularity.
Slowly Changing Dimensions (SCDs)
Handling changes in dimension attributes over time is a critical skill in ETL design. Interviewers frequently ask about SCD types.
Explain Type 1, Type 2, and Type 3 slowly changing dimensions.
- Type 1: Overwrites the old value with the new value. No history is retained. Suitable when historical accuracy is not required (e.g., correcting a typo in a product name).
- Type 2: Adds a new row to track the change, with effective date ranges (start date, end date) and a current flag. This preserves full history. Most common for attributes like customer address or employee department.
- Type 3: Adds a new column to store the previous value while keeping the current value. This allows limited history (usually one previous version). Used for attributes that change infrequently (e.g., product category realignment).
Be prepared to discuss trade-offs: Type 2 increases row count but gives complete audit trail; Type 1 is simple but loses history.
ETL Process Overview
The ETL process is the backbone of data integration. A thorough understanding of each phase and common challenges is essential.
Explain each step of ETL in detail.
Extract: Data is pulled from various source systems — relational databases, flat files (CSV, JSON, XML), APIs, cloud storage, or streaming platforms. Extraction can be full (all data) or incremental (only new/modified records since last run). Challenges include handling different data formats, network latency, and source system load.
Transform: Data is cleaned, validated, and converted into a consistent format. Transformations include:
- Data type conversions (e.g., string to date)
- Deduplication and null handling
- Business rule application (e.g., calculating margin = revenue – cost)
- Aggregation and pivoting
- Lookups to surrogate keys
Load: Transformed data is inserted into the target data warehouse. Loading strategies: full refresh (truncate and reload), incremental append, and upsert (merge). Consider index rebuilding, partition switching, and transaction management during load.
What is the difference between ETL and ELT?
ETL transforms data before loading it into the warehouse. ELT (Extract, Load, Transform) loads raw data first and then transforms it using the data warehouse’s processing power (e.g., SQL or MapReduce). ELT is common in modern cloud data warehouses like Snowflake, BigQuery, and Redshift, where storage and compute are decoupled. ETL is still preferred when transformations require complex business logic or when source data quality is low.
What are common ETL tools?
Popular tools include Informatica PowerCenter, Talend, IBM DataStage, Microsoft SSIS, Apache NiFi, and cloud-native services like AWS Glue, Azure Data Factory, and Google Dataflow. Open-source options: Pentaho (Kettle), Apache Airflow (orchestration), and dbt (data build tool for transformations). Interviewers may ask about your experience with specific tools and how you handled performance or debugging.
Common Interview Questions and Their Detailed Answers
1. What are the major challenges faced in ETL processes, and how do you mitigate them?
Challenges include:
- Data quality issues – missing values, duplicates, inconsistent formats. Mitigation: implement profiling and validation rules early; use staging tables to quarantine bad records.
- Performance bottlenecks – slow extraction from source systems, heavy transformations, or inefficient loads. Mitigation: use incremental extraction, parallel processing, batch partitioning, and optimize SQL join strategies.
- Data volume growth – loading terabytes daily. Mitigation: implement partition pruning, compression, and scalable cloud infrastructure.
- Data latency requirements – need for near-real-time updates. Mitigation: use change data capture (CDC) and streaming ingestion tools (Kafka, Kinesis).
- Dependency management – ETL jobs that fail due to resource contention or scheduling conflicts. Mitigation: use orchestration tools with retry logic and alerting.
2. How do you optimize ETL processes for performance?
Performance optimization spans multiple areas:
- Extraction: Use incremental extraction instead of full loads; implement CDC (e.g., log-based or timestamp-based); use bulk copy utilities.
- Transformation: Push transformations where possible (e.g., use SQL in the database); avoid row-by-row operations; use set-based logic; parallelize independent tasks.
- Load: Disable indexes and constraints during the load and rebuild afterwards; use batch inserts; consider partition switching for large tables.
- Infrastructure: Use SSDs, scale compute resources, and use caching layers. Monitor with profiling tools to identify bottlenecks.
3. What is the difference between OLAP and OLTP systems?
OLTP (Online Transaction Processing) is designed for high-volume, short, atomic transactions (e.g., order entry, inventory updates). Data is normalized, and queries touch a small number of records. OLAP (Online Analytical Processing) is designed for complex queries that aggregate large volumes of historical data. OLAP systems are typically denormalized (star schema) and support multidimensional analysis (slice, dice, drill-down). A typical interview question: "When would you choose an OLAP database over an OLTP database?" Answer: For analytical reports, dashboards, and data mining.
4. Explain the concept of surrogate keys vs natural keys in data warehousing.
A surrogate key is an artificial, system-generated unique identifier (e.g., integer sequence) used as the primary key in a dimension table. A natural key is a business identifier from the source (e.g., product code, customer ID). Surrogate keys are recommended because they are stable (business keys can change, causing ripple effects), support SCD Type 2 (multiple rows per business key), and improve join performance (narrow numeric keys). Natural keys should still be kept as attributes for audit and traceability.
5. How do you handle error handling in an ETL pipeline?
Implement a robust error-handling framework:
- Use try-catch blocks and log errors to a separate error table with job ID, timestamp, row data, and error description.
- Define data quality rules and reject records that fail validation into a quarantine folder or table.
- Set up alerts (email, Slack) for critical failures.
- Implement retry logic for transient errors (network timeouts).
- Maintain a run history table to track success/failure status for each job step.
6. What is change data capture (CDC)?
CDC is a technique to capture changes (inserts, updates, deletes) in source data and apply them to a target system. Methods include:
- Log-based CDC (e.g., Oracle GoldenGate, Debezium)
- Timestamp-based (using last modified columns)
- Trigger-based (database triggers)
- Diff-based (comparing snapshots)
Advanced Interview Questions
7. How do you design an ETL process for a data warehouse that supports both batch and real-time ingestion?
Hybrid architectures are common. For batch: schedule nightly jobs using incremental loads. For real-time: use a streaming layer (e.g., Kafka) to capture events, then apply lightweight transformations and load into a real-time fact table or a delta layer (e.g., in a lakehouse). The batch and real-time paths should converge in the warehouse using upsert logic. Consider partitioning by time to merge the two streams consistently. Use tools like Apache Flink or Spark Structured Streaming.
8. Explain data lineage and why it is important.
Data lineage tracks the origin, transformations, and movement of data from source to target. It helps in impact analysis (what downstream reports break if a source changes), debugging (trace why a value is wrong), and auditing (compliance with regulations like GDPR or SOX). Tools like Apache Atlas, Marquez, or commercial solutions (Collibra, Alation) provide automated lineage. Interviewers may ask how you would document lineage in an ETL project.
9. What is the difference between a data warehouse and a data mart?
A data warehouse is an enterprise-wide repository covering multiple subject areas. A data mart is a subset focused on a single business function (e.g., sales, finance). Data marts can be built on top of the warehouse (dependent) or independently (independent). Choosing between them involves trade-offs in cost, governance, and agility.
10. How do you handle slowly changing dimensions in ETL?
The approach depends on the SCD type:
- Type 1: Use UPDATE statements to overwrite the record.
- Type 2: Use a MERGE (upsert) to close the previous version (set end date) and insert a new row with start date = now and current flag = true.
- Type 3: UPDATE the current column and move the old value to the previous column.
For large dimensions, implement a lookup cache to reduce database round trips. Also consider using hash comparison to detect actual changes and avoid unnecessary updates.
ETL Best Practices
Interviewers will look for practical experience. Mention these best practices during discussions:
- Modular design: Break ETL jobs into reusable components (e.g., reusable staging load, standard transformation library).
- Idempotency: Ensure that rerunning a job produces the same result (no duplicates). Use upsert logic and transactional boundaries.
- Metadata management: Maintain a data dictionary and job dependency graph.
- Performance monitoring: Track key metrics: rows processed per minute, duration, error rates, and skew. Use dashboards.
- Version control: Store ETL code in Git along with SQL scripts and configuration files.
- Testing: Write unit tests for transformations, integration tests for end-to-end pipelines, and data comparison tests against source and target.
Reference external resources for deeper learning: IBM on ETL, Snowflake: ETL vs ELT, and Martin Fowler on evolutionary data.
Conclusion
Mastering interviews on data warehousing and ETL processes requires both theoretical knowledge and practical experience. Focus on core concepts – data warehouse characteristics, dimensional modeling, SCDs, and ETL optimization – and be ready to discuss real-world challenges with specific solutions. Practice explaining your thought process clearly. By preparing for these common and advanced questions, you will demonstrate the expertise needed for successful data management roles.