[email protected]
17 Jul, 2025
0 Comments
74 Mins Read

Azure Data Engineer Interview Questions

Azure Data Engineering Interview Questions

1. What are the best practices for managing and optimizing storage costs in ADLS?

Use storage tiers – Hot, Cool, Archive based on access frequency.
Enable lifecycle policies – Auto-move or delete old data.
Use compressed formats – Like Parquet or Avro.
Avoid small files – Merge to reduce overhead.
Clean up unused data – Delete temp or obsolete files.
Monitor with Cost Management – Set budgets and alerts.
Use hierarchical namespace – For efficient file handling.

2. How do you implement security measures for data in transit and at rest in Azure?

Security measures in Azure:

Data in Transit:
- Use TLS encryption (enabled by default).
- Use private endpoints and VPNs for secure connections.
Data at Rest:
- Use Azure Storage Service Encryption (SSE) (enabled by default).
- Enable Azure Disk Encryption for VMs.
- Use customer-managed keys (CMK) for added control.

3 . Describe the role of triggers and schedules in Azure Data Factory.

Role of Triggers in ADF

Triggers determine when and how a pipeline should run. ADF supports three main types:

Schedule Trigger
- Runs pipelines at specific times or intervals (e.g., daily, hourly).
- Ideal for regular ETL jobs.
Event-based Trigger
- Starts pipelines in response to events, such as the arrival of a file in Azure Blob Storage.
- Useful for real-time or near-real-time processing.
Manual Trigger
- Pipelines are started manually by a user or system.
- Useful for testing or ad-hoc runs.

Schedules in ADF

Schedules define time-based rules for execution:

Specify start time, recurrence, and time zone.
Can be linked to schedule triggers to automate runs.

4 . How do you optimize data storage and retrieval in Azure Data Lake Storage?

1. Use Efficient File Formats

Store data in Parquet or Avro formats, which are compressed and columnar, reducing both storage space and read times during analytics.

2. Partition Data

Organize your data into logical folders (e.g., by date or region). This helps in minimizing data scanned during queries, improving performance.

3. Avoid Small Files

Too many small files cause metadata overhead and slow down processing. Combine them into larger files for better efficiency.

4. Use Hierarchical Namespace (HNS)

ADLS Gen2 with HNS enabled supports directory operations and improves performance and manageability.

5. Storage Tiering

Use Hot tier for frequently accessed data, Cool for infrequent, and Archive for rarely accessed data to reduce costs.

6. Automate with Lifecycle Policies

Set lifecycle rules to automatically move or delete old data, keeping storage optimized.

5 . How do you optimize query performance in Azure SQL Database?

To optimize query performance in Azure SQL Database:

Use appropriate indexes to speed up data retrieval.
Analyze slow queries with Query Store and Execution Plans.
Update statistics and avoid unnecessary cursors or subqueries.
Use parameterized queries and optimize joins and filters.
Scale up or out using Elastic Pools or higher performance tiers if needed

6 . Describe the process of integrating Azure Data Factory (ADF) with Azure Synapse Analytics.

To integrate Azure Data Factory (ADF) with Azure Synapse Analytics:

Create a Linked Service in ADF to connect to your Azure Synapse workspace.
Use Copy Activity or Data Flows in ADF pipelines to move or transform data into Synapse tables.
Optionally, use Stored Procedure activities to trigger SQL scripts in Synapse.
Schedule or trigger pipelines using ADF triggers for automated workflows.
Monitor pipeline runs via ADF Monitoring for performance and error tracking.

7 .How do you handle schema evolution in Azure Data Lake?

To handle schema evolution in Azure Data Lake:

Use Delta Lake on ADLS Gen2, which supports automatic schema evolution with the mergeSchema option.
Enable schema drift in Azure Data Factory (ADF) when ingesting data with varying schemas.
Store raw and curated data separately using a medallion architecture (Bronze, Silver, Gold) to isolate schema changes.
Maintain metadata management using tools like Azure Purview or Synapse Data Catalog for version tracking.

8 . How do you implement CI/CD pipelines for deploying ADF and Azure Databricks solutions?

To handle schema evolution in Azure Data Lake:

Use Delta Lake on ADLS Gen2, which supports automatic schema evolution with the mergeSchema option.
Enable schema drift in Azure Data Factory (ADF) when ingesting data with varying schemas.
Store raw and curated data separately using a medallion architecture (Bronze, Silver, Gold) to isolate schema changes.
Maintain metadata management using tools like Azure Purview or Synapse Data Catalog for version tracking.

9 . How do you manage and monitor Azure Data Factory pipeline performance?

To manage and monitor Azure Data Factory pipeline performance:

Use the Monitoring tab in ADF to track pipeline, trigger, and activity runs in real time.
Enable diagnostic logs and send them to Log Analytics or Azure Monitor for detailed insights and alerts.
Optimize pipelines by reducing data movement, using parallelism, and leveraging data flow performance tuning.
Set up alerts for failures or performance thresholds to proactively manage issues.

10 . Explain the concept of Delta Lake and its advantages (in Azure Databricks context).

Delta Lake is an open-source storage layer that brings ACID transactions to big data workloads in Azure Databricks. It sits on top of Azure Data Lake Storage and enables reliable, scalable, and high-performance data processing.

Advantages in Azure Databricks:

ACID Transactions – Ensures data consistency during concurrent reads/writes.
Schema Enforcement & Evolution – Automatically adapts to schema changes while maintaining data integrity.
Time Travel – Allows users to access previous versions of data using Delta logs.
Improved Performance – Supports Z-ordering and data skipping for faster queries.
Scalability – Efficient handling of batch and streaming data in one unified pipeline.

11 . How do you implement schema drift handling in Azure Data Factory?

To handle schema drift in Azure Data Factory (ADF):

Enable Schema Drift in the source and sink settings when using Mapping Data Flows.
Use Auto Mapping or dynamic column mapping to handle changing schemas without manual updates.
Store data in flexible formats like Parquet or JSON in Data Lake to accommodate evolving structures.
Use parameterized pipelines to dynamically adjust to schema changes across datasets.

12 . What is the significance of Z-ordering in Delta tables in Azure Databricks?

Z-ordering in Delta tables (Azure Databricks) is a technique used to optimize data layout for faster query performance.

Significance:

Improves query speed by co-locating related data (e.g., filtering columns) on disk.
Reduces the amount of data scanned during queries by enabling data skipping.
Especially useful for high-cardinality columns like timestamps, user IDs, or product codes.
Enhances performance for range queries and filters in large datasets.

13 . How do you handle incremental data load in Azure Databricks?

To handle incremental data load in Azure Databricks:

Use a watermark column (e.g., LastModifiedDate or UpdatedAt) to filter new or changed records.
Query only the data that has changed since the last load using Spark SQL or DataFrame filters.
Store the checkpoint or last processed value (e.g., in a Delta table or metadata file).
Merge incremental data into the target Delta table using MERGE INTO for upserts (insert/update).
Automate the process using Databricks Jobs or integrate with ADF pipelines for orchestration.

14 . How do you optimize data partitioning in Azure Data Lake Storage (ADLS)?

To optimize data partitioning in Azure Data Lake Storage (ADLS):

Partition by frequently queried columns (e.g., date, region) to reduce data scan size.
Use hierarchical folder structures (e.g., /year/month/day/) for better query performance.
Avoid over-partitioning with too many small files — balance partition granularity.
Use Parquet or Delta formats, which support partition pruning for efficient reads.
Regularly compact small files and monitor partition usage to maintain performance.

15 . Describe the process of creating a data pipeline for real-time analytics in Azure.

To create a data pipeline for real-time analytics in Azure:

Ingest Streaming Data using services like Azure Event Hubs, IoT Hub, or Kafka.
Process data in real-time with Azure Stream Analytics or Azure Databricks Structured Streaming.
Transform and enrich data using streaming queries or Spark transformations.
Store processed data in a low-latency store like Azure Synapse Analytics, Azure Data Explorer, or Delta Lake.
Visualize insights with Power BI for real-time dashboards and alerts.
Use Azure Data Factory or Azure Logic Apps for orchestration and monitoring

16 . What are the security best practices for Azure Data Lake?

Security best practices for Azure Data Lake Storage (ADLS):

Use Role-Based Access Control (RBAC) and Access Control Lists (ACLs) to enforce fine-grained permissions.
Enable encryption at rest using Microsoft-managed or customer-managed keys (CMK).
Encrypt data in transit with HTTPS and secure network paths (e.g., private endpoints).
Use firewall rules and virtual networks to restrict access.
Monitor access and activities using Azure Monitor, Log Analytics, and Microsoft Defender for Cloud.
Avoid shared keys; use Azure AD authentication for better identity management.

17 . Explain the use of Integration Runtime (IR) in Azure Data Factory.

Integration Runtime (IR) in Azure Data Factory is the compute infrastructure used to perform data movement, data transformation, and activity dispatch.

Types and Use:

Azure IR – For cloud-based data movement and transformation across Azure services.
Self-hosted IR – For connecting to on-premises data sources or private networks securely.
Azure SSIS IR – For running existing SSIS packages in ADF.

18 . How do you design a fault-tolerant architecture for big data processing in Azure?

To design a fault-tolerant architecture for big data processing in Azure:

Use distributed systems like Azure Databricks, HDInsight, or Synapse that support auto-recovery and data replication.
Store data in durable storage like ADLS Gen2 or Delta Lake with built-in redundancy.
Design pipelines in Azure Data Factory with retry policies, error handling, and checkpointing.
Use Azure Event Hubs/Kafka with Stream Analytics or Databricks Streaming for resilient real-time ingestion.
Implement monitoring and alerting using Azure Monitor and Log Analytics for proactive issue detection.

19 .How do you monitor and troubleshoot Azure Data Factory pipeline failures?

To monitor and troubleshoot Azure Data Factory (ADF) pipeline failures:

Use the Monitor tab in ADF to view failed pipeline, activity, and trigger runs with detailed error messages.
Enable diagnostic logs and send them to Azure Log Analytics, Storage, or Event Hubs for advanced analysis.
Set up alerts via Azure Monitor based on failure conditions or metrics.
Implement error handling in pipelines using Try-Catch patterns, If Conditions, and Failure paths to capture and log errors gracefully.

20 .Explain the concept of Managed Identity in Azure and its use in data engineering.

Managed Identity in Azure is a feature that provides automated, secure identity management for Azure services to authenticate with other Azure resources without using credentials in code.

Use in Data Engineering:

Secure Access: ADF, Databricks, Synapse, etc., can access ADLS, Azure SQL, Key Vault, and more using Managed Identity securely.
Credential-Free: Eliminates the need to store secrets or keys in pipelines or notebooks.
Simplifies RBAC: You can assign RBAC roles to the Managed Identity to control resource access.
Auditable and Least Privilege: Enables better compliance and security through identity-based access control and auditing

21 . Describe the process of migrating on-premises databases to Azure SQL Database.

To migrate on-premises databases to Azure SQL Database:

Assess Readiness using tools like Data Migration Assistant (DMA) to identify compatibility issues.
Choose a Migration Method:
- Use Azure Database Migration Service (DMS) for minimal downtime migration.
- For smaller or less critical databases, use BACPAC export/import.
Prepare the Target by creating an Azure SQL Database and configuring networking, firewall rules, and security.
Migrate Schema and Data using DMS or scripts generated by DMA.
Validate and Test the migrated database for data integrity and performance.
Redirect Applications to point to the new Azure SQL Database after successful testing.

22 . How do you implement error handling in Azure Data Factory pipelines?

To implement error handling in Azure Data Factory (ADF) pipelines:

Use activity dependencies with “On Failure” paths to handle errors gracefully.
Add Try-Catch patterns using If Condition, Execute Pipeline, and Set Variable activities.
Use the @activity().error expression to capture error details and log them.
Route errors to a logging mechanism like Azure SQL, Blob Storage, or Log Analytics.
Set retry policies and timeouts on activities to auto-recover from transient issues.

23 . Describe the process of integrating ADF with Azure Databricks for ETL workflows.

To integrate Azure Data Factory (ADF) with Azure Databricks for ETL workflows:

Create a Linked Service in ADF to connect to your Azure Databricks workspace using a workspace URL and access token.
In your ADF pipeline, add a Databricks Notebook activity to call a specific notebook for ETL logic (e.g., data transformation, cleansing).
Pass parameters from ADF to Databricks using the base parameters option.
Use ADF triggers or scheduling to automate and orchestrate the ETL workflow.
Monitor and log execution results in ADF’s Monitor tab to track success or failure.

25 . How do you handle schema evolution in Delta Lake (Databricks on Azure)?

Use the mergeSchema option when writing data to allow automatic schema updates:
Enable schema enforcement to prevent accidental writes with incompatible schemas.
Use the ALTER TABLE command to manually add or update columns when needed.
For streaming data, use Auto Loader with cloudFiles.schemaEvolutionMode set to addNewColumns.
Track schema changes using Delta Lake’s transaction log and DESCRIBE HISTORY command.

26 . How do you secure data pipelines in Azure?

To secure data pipelines in Azure, follow these best practices:

Use Managed Identity to authenticate ADF, Databricks, or Synapse with other Azure services without storing secrets.
Enable encryption:
- In transit using HTTPS/TLS
- At rest using Azure Storage encryption with Microsoft or customer-managed keys (CMK)
Restrict access using Azure RBAC and Access Control Lists (ACLs) on resources like ADLS or Key Vault.
Use Private Endpoints and VNET Integration to keep data movement within secure networks.
Audit and monitor activity using Azure Monitor, Log Analytics, and Defender for Cloud.
Store secrets securely in Azure Key Vault and reference them in pipelines instead of hardcoding.

27 . What are the best practices for managing large datasets in Azure Databricks?

Use Delta Lake format to ensure data reliability, support for ACID transactions, and efficient updates.
Optimize data layout by managing partitions effectively and using Z-Ordering for faster query filtering.
Minimize small files by batching writes or using tools like Auto Optimize to combine data efficiently.
Scale clusters appropriately using autoscaling and choose the right node types for compute-heavy workloads.
Monitor and tune performance with the Spark UI, job metrics, and built-in Databricks performance tools.
Use caching carefully for frequently reused data to reduce computation time.
Implement access controls with Unity Catalog, table ACLs, and Azure security features to govern large datasets securely.

28 . Explain the difference between streaming and batch processing in Spark (Azure context).

In the Azure context (e.g., Azure Databricks with Spark), the difference between streaming and batch processing lies in how data is ingested and processed:

Batch Processing:

Processes static or finite datasets at scheduled intervals.
Ideal for ETL jobs, historical data analysis, and data warehouse loads.
Uses Spark APIs like DataFrame, read, write.

Streaming Processing:

Handles real-time or continuous data from sources like Event Hubs, Kafka, or IoT Hub.
Suitable for real-time analytics, fraud detection, or alerting systems.
Uses Structured Streaming API with readStream and writeStream.

29 . What is the purpose of caching in PySpark and how is it used in Azure Databricks?

To Speed Up Workflows:
When a DataFrame is used multiple times in transformations or actions, caching it with .cache() or .persist() keeps it in memory for faster access.
Monitoring:
You can track cache usage and storage through the Spark UI in Databricks for optimization.
Best Practices:
- Cache only when data fits in memory.
- Unpersist unused data to free up memory.

30 . How to implement incremental load in ADF?

Incremental load in Azure Data Factory is implemented using watermark columns (e.g., LastModifiedDate or
ID).
You can use the ‘Lookup’ activity to retrieve the last loaded value, pass it as a parameter to the source
dataset, and use a ‘Filter’ or query condition to load only new or updated records.

31 . How do you design and implement data pipelines using Azure Data Factory?

Designing pipelines in ADF involves defining source and destination datasets, creating linked services for
connectivity, and using activities like Copy, Data Flow, or stored procedure. Pipelines can include conditional
logic, loops, parameters, and triggers to orchestrate the flow of data.

32 . How do you handle late-arriving data in ADF?

Late-arriving data can be handled using time window-based watermarking, storing late data in a staging area, or using tumbling window triggers. You can also reprocess specific partitions using ADF pipeline parameters
and conditional branching.

33 . Describe the process of setting up CI/CD for Azure Data Factory.

CI/CD in ADF is achieved using Git integration with Azure Repos or GitHub. You create feature branches for
development, publish changes to the collaboration branch, and use Azure DevOps pipelines or ARM
templates to deploy to other environments like test and production.

34 . What are the types of Integration Runtimes (IR) in ADF?

ADF supports three types of Integration Runtimes:
– Azure IR for cloud data movement and transformation
– Self-hosted IR for on-premises and VNet access
– Azure-SSIS IR for running SSIS packages in ADF

35 . How do you ensure data quality and validation in ADLS?

Data quality in ADLS can be ensured using ADF Data Flows with derived columns, conditional splits, and
assertion transformations. You can also implement row-level validation checks and log invalid records into
separate datasets for analysis.

36 . Describe the role of triggers in ADF pipelines.

Triggers in ADF automate pipeline execution. Types include:
– Schedule Trigger: runs at defined intervals
– Tumbling Window Trigger: used for time-based partitioning
– Event-based Trigger: responds to blob events in Azure Storage
– Manual Trigger: used for on-demand runs.

37 .How to copy all tables from one source to the target using metadata-driven pipelines in ADF?

Use a metadata table that stores source and destination table names. Create a ForEach activity in ADF that reads the metadata and uses Copy activity inside it to copy data dynamically.

38.How do you monitor ADF pipeline performance?

Use Monitor tab in ADF Studio.
Enable diagnostic logs to route data to Log Analytics.
Use Azure Monitor or custom alerts for errors or performance bottlenecks.

39 .How do you implement error handling in ADF using retry, try-catch blocks, and failover mechanisms?

ADF provides robust mechanisms for error handling to ensure data reliability and fault tolerance. You can apply Retry Policies directly in each activity to automatically retry upon transient failures. Use control activities like If Condition, Switch, and Execute Pipeline along with the On Failure path to route the workflow logically based on the outcome. Additionally, log failed rows or activities into a separate error-handling pipeline or storage location to allow for future reprocessing, minimizing data loss.

40.How to track file names in the output table while performing copy operations in ADF?

In Azure Data Factory, you can track file names during copy operations by using sourceInfo().fileName in Mapping Data Flows. This expression allows you to capture and store the source file name as a new column in the output table. This is useful for audit and traceability, especially when ingesting data from multiple files.

41 . How do you handle schema evolution in ADF?

Use Mapping Data Flows with Auto Mapping and enable “Allow Schema Drift” to handle dynamic schema changes. You can also validate schema using metadata checks before processing to ensure consistency.

42.What are the key considerations for designing scalable pipelines in ADF?

To design scalable pipelines in ADF, use parallelism by configuring the ForEach activity with a batch count. Structure your pipelines modularly for reusability and better maintainability. Leverage Integration Runtime scaling to manage large workloads efficiently, and ensure robust error handling with proper retry and failover strategies.

43 .How do you manage schema drift in ADF?

To manage schema drift in Azure Data Factory, enable the “Allow Schema Drift” option in Mapping Data Flows. Use dynamic mapping or schema projection to accommodate changing schemas during runtime. Additionally, implement schema validation logic to audit and control any unexpected schema changes.

44 .How do you integrate Azure Key Vault with ADF pipelines?

To integrate Azure Key Vault with Azure Data Factory, create a Linked Service that connects directly to Key Vault. Store sensitive information like passwords and keys in Key Vault, and securely reference them in ADF pipelines using dynamic content expressions.

45 . Explain the role of Integration Runtime in ADF.

Integration Runtime (IR) is the compute engine behind ADF operations. Azure IR handles data movement and transformation in the cloud, Self-hosted IR connects securely to on-premises or private networks, and SSIS IR is used to execute SSIS packages within Azure Data Factory.

46 . How do you implement CI/CD for Azure Data Factory?

CI/CD in ADF is achieved by integrating with Git (either GitHub or Azure DevOps). Development is done in a collaboration branch, then published to the adf_publish branch. Release pipelines are set up in Azure DevOps to automate deployments across different environments.

47 .Describe the process of creating a data pipeline for real-time analytics.

pipeline for real-time analytics.
To build a real-time analytics pipeline, use an Event-based Trigger to initiate processing as data arrives. Ingest data through Azure Event Hubs or IoT Hub, process it using Stream Analytics or Data Flows, store the results in ADLS or Synapse, and visualize insights using Power BI.

48 . What is the binary copy method in ADF, and when is it used?

The binary copy method in ADF is used when you want to move files from a source to a destination without inspecting or transforming the content. It performs a byte-level copy, making it suitable for non-tabular data like images, videos, or encrypted files. This method ensures high performance and efficiency by avoiding data parsing or schema mapping.

49 . How do you implement data masking in ADF for sensitive data?

Data masking in ADF can be achieved using Mapping Data Flows where you can transform or replace sensitive fields. Techniques include using string manipulation functions to obfuscate values or replacing them with static/dynamic tokens. Additionally, data masking can be done conditionally based on user roles by integrating ADF with Azure Key Vault and role-based access controls.

50 . What are the activities in ADF (e.g., Copy Activity, Notebook Activity)?

Azure Data Factory provides various activities to build data pipelines. The Copy Activity is used to move data from source to sink, Data Flow Activity allows transformations using Mapping Data Flows, Notebook Activity runs notebooks from Azure Databricks, Lookup and Get Metadata Activities retrieve values from datasets, and Execute Pipeline runs another pipeline as part of the current pipeline execution.

51 . How to implement parallel copies in ADF using partitioning?

Parallel copies in ADF can be implemented using the Copy Activity’s source partitioning feature. You can define column-based partitioning so that ADF splits the data into multiple segments and reads them in parallel. This improves performance significantly, especially when dealing with large datasets. Setting parallel copy settings in the Copy Activity allows you to control the degree of parallelism.

52 . How do you implement data validation and quality checks in ADF?

Data validation in ADF can be implemented using Mapping Data Flows, where conditional split, filter, and derived column transformations allow for validation rules. You can also use the Assert transformation to enforce data quality constraints. Additionally, output logs or audit tables can be created to capture records that fail validation for further review.

53 . How to handle null values in PySpark (drop/fill)?

You can handle nulls using DataFrame functions:
– `df.na.drop()` to remove rows with nulls.
– `df.na.fill(value)` to replace nulls with a specific value.
You can also use `fillna()` for individual columns.

54 . How do you implement data deduplication in PySpark?

Use `dropDuplicates()` method:
`df.dropDuplicates([‘column1’, ‘column2’])` to remove duplicate rows based on specific columns.

55 . Explain the difference between narrow and wide transformations in PySpark

Narrow transformations (e.g., map, filter) operate on a single partition.
Wide transformations (e.g., groupByKey, reduceByKey) require data shuffling across partitions.

56. How do you optimize PySpark jobs for large datasets?

Use partitioning wisely.
– Cache/persist intermediate results.
– Avoid using collect() on large datasets.
– Minimize data shuffles and use broadcast joins when possible

57 . Write PySpark code to perform an inner join between two DataFrames

df1.join(df2, df1.id == df2.id, ‘inner’)

58 . Describe the concept of fault tolerance in Spark.

Spark achieves fault tolerance through lineage information in RDDs and the ability to recompute
lost data using DAGs (Directed Acyclic Graphs).

59 . Explain the concept of partitioning in PySpark

Partitioning controls data distribution across clusters. Use `repartition()` to increase/decrease
partitions and `coalesce()` to reduce them efficiently.

60 . Explain the difference between Spark SQL and PySpark DataFrame APIs.

Spark SQL provides a SQL interface to Spark, allowing users to run SQL queries directly on
structured data. PySpark DataFrame APIs, on the other hand, offer a Pythonic way to perform data
manipulation, transformation, and aggregation using Spark’s DataFrame abstraction. While Spark
SQL is more suited for users familiar with SQL, PySpark DataFrames are better for complex data
engineering and transformations in code, allowing chaining of transformations with better type safety
and scalability.

61 . How do you manage partitioning in PySpark?

Partitioning in PySpark refers to dividing the data into logical chunks across nodes to enable
distributed computing. You can manage partitioning using the repartition() and coalesce() functions.
repartition() increases or decreases the number of partitions and reshuffles the data, while
coalesce() reduces the number of partitions without a full shuffle. Proper partitioning improves
parallelism and minimizes data shuffling during operations like joins, groupBy, and aggregations.

62 . How do you optimize joins in PySpark for large datasets?

To optimize joins in PySpark, consider broadcasting the smaller dataset using broadcast() when
joining with a significantly larger dataset. Also, ensure both datasets are partitioned properly on the
join key using repartition(). Avoid wide transformations and unnecessary shuffles, and filter data
before the join if possible. Using Delta Lake for data storage and caching frequently used tables can
further improve performance

63 . Write PySpark code to calculate the average salary by department

from pyspark.sql import SparkSession
from pyspark.sql.functions import avg
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("employees.csv", header=True, inferSchema=True)
avg_salary = df.groupBy("department").agg(avg("salary").alias("average_salary"))
avg_salary.show()

64 . How do you implement parallel processing in PySpark?

PySpark enables parallel processing inherently through its distributed computing architecture. Each
transformation on a DataFrame or RDD is executed in parallel across partitions. You can influence
the degree of parallelism using repartition() to increase the number of partitions, allowing more tasks
to run concurrently. Additionally, actions like mapPartitions() and foreachPartition() help perform
operations in parallel across data partitions.

65 . What is Z-ordering in Spark?

Z-ordering is a data clustering technique used in Delta Lake to colocate related information in the
same set of files. It improves read performance by reducing the amount of data scanned during
queries. In PySpark, Z-ordering is used while writing data with .zorderBy() during the OPTIMIZE
command on Delta tables. This is particularly effective for queries with filters on columns that are
Z-ordered.

66 . Write PySpark code to perform an inner join between two DataFrames.

df1 = spark.read.csv("employees.csv", header=True, inferSchema=True)
df2 = spark.read.csv("departments.csv", header=True, inferSchema=True)
joined_df = df1.join(df2, df1.dept_id == df2.id, "inner")
joined_df.show()

67 . What is AQE (Adaptive Query Execution) in Databricks?

Adaptive Query Execution (AQE) in Spark dynamically optimizes query plans at runtime based on
actual data statistics. AQE can change the join strategy, adjust skewed partition handling, and
optimize the number of shuffle partitions. In PySpark on Databricks, AQE is enabled by default,
making queries more efficient without requiring manual tuning.

68 . Explain the difference between narrow and wide transformations in PySpark.

Narrow transformations (like map, filter, and union) operate on a single partition of data and do not
require shuffling. These are fast and efficient. Wide transformations (like groupBy, join, and distinct)
require data shuffling across partitions, which is more expensive and slower. Understanding this
distinction is crucial for optimizing PySpark applications.

69 . Write a PySpark code to join two DataFrames and perform aggregation.

To join two DataFrames and aggregate the result, use the join() and groupBy() methods. Here’s an example where we join sales data with product data and compute total sales by category:

from pyspark.sql.functions import sum

df_sales = spark.createDataFrame([
(1, 100), (2, 200), (1, 150)
], ["product_id", "sales"])

df_products = spark.createDataFrame([
(1, "Electronics"), (2, "Furniture")
], ["product_id", "category"])

joined_df = df_sales.join(df_products, on="product_id", how="inner")
agg_df = joined_df.groupBy("category").agg(sum("sales").alias("total_sales"))
agg_df.show()

70 . What is the difference between wide and narrow transformations in Spark?

In Spark, narrow transformations (like map, filter, and union) are transformations where each input partition contributes to only one output partition. These do not require data movement between partitions. Wide transformations (like groupByKey, reduceByKey, join) involve shuffling, which means data is transferred across nodes to re-organize it based on keys. Wide transformations are more expensive and often require tuning for performance.

71 . Explain lazy evaluation in PySpark.

Lazy evaluation means that Spark does not immediately execute transformations like map() or filter(). Instead, it builds a logical execution plan. Actual computation is triggered only when an action like collect(), count(), or show() is called. This allows Spark to optimize execution plans and minimize data scans, improving performance.

72. How does caching work in PySpark?

Caching in PySpark means storing DataFrame results in memory so that subsequent actions can reuse the results without recomputing. You can cache a DataFrame using .cache() or .persist(). It’s particularly useful when you’re performing multiple actions on the same DataFrame. This reduces computation time but increases memory usage.

df.cache()
df.count() # triggers caching

73 . Write PySpark code to calculate the total sales for each product

from pyspark.sql.functions import sum

df = spark.createDataFrame([
("Electronics", 100), ("Furniture", 200), ("Electronics", 150)
], ["category", "sales"])

df.groupBy("category").agg(sum("sales").alias("total_sales")).show()

74 . Explain how broadcast joins improve performance in PySpark.

Broadcast joins are used when one DataFrame is small enough to fit in memory. Spark broadcasts the small DataFrame to all worker nodes, avoiding a shuffle. This significantly improves join performance in scenarios where one dataset is large and the other is small.

from pyspark.sql.functions import broadcast

large_df.join(broadcast(small_df), "id").show()

75 . Describe the role of the driver and executor in Spark architecture.

In Spark, the driver is the main program that coordinates all tasks. It converts user code into a Directed Acyclic Graph (DAG), plans job execution, and manages metadata. Executors are worker processes running on cluster nodes. They execute the tasks, perform data processing, and return results to the driver. Together, the driver and executors enable parallel data processing in a distributed environment.

76 . Explain Adaptive Query Execution (AQE) in Spark.

Adaptive Query Execution (AQE) is a feature in Spark that allows the engine to dynamically adjust the execution plan at runtime based on actual data statistics. AQE helps optimize:

Join strategies (e.g., switching to broadcast join)
Skew handling (by splitting skewed partitions)
Shuffle partition sizes (by coalescing small partitions)

AQE improves query performance, especially in cases where static optimization falls short.

77 . Write PySpark code to perform a left join between two DataFrames.

df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
df2 = spark.createDataFrame([(1, "HR"), (3, "Sales")], ["id", "dept"])

df1.join(df2, on="id", how="left").show()

78 .Join Two DataFrames and Aggregate

result = df1.join(df2, "id").groupBy("category").agg({"sales": "sum"})

79 . Difference Between Wide and Narrow Transformations

Narrow: Data required to compute the records in a single partition resides in the same partition (e.g., map, filter).
Wide: Requires shuffling of data across partitions (e.g., groupByKey, join).

80 .Explain lazy evaluation in PySpark.

81 . How does caching work in PySpark?

df.cache()
df.count() # triggers caching

82 . Write PySpark code to calculate the total sales for each product category.

from pyspark.sql.functions import sum

df = spark.createDataFrame([
("Electronics", 100), ("Furniture", 200), ("Electronics", 150)
], ["category", "sales"])

df.groupBy("category").agg(sum("sales").alias("total_sales")).show()

83 . Explain how broadcast joins improve performance in PySpark.

from pyspark.sql.functions import broadcast

large_df.join(broadcast(small_df), "id").show()

84 . Describe the role of the driver and executor in Spark architecture.

85 . Explain Adaptive Query Execution (AQE) in Spark.

Adaptive Query Execution (AQE) is a feature in Spark that allows the engine to dynamically adjust the execution plan at runtime based on actual data statistics. AQE helps optimize:

Join strategies (e.g., switching to broadcast join)
Skew handling (by splitting skewed partitions)
Shuffle partition sizes (by coalescing small partitions)

AQE improves query performance, especially in cases where static optimization falls short.

86 . Write PySpark code to perform a left join between two DataFrames.

df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
df2 = spark.createDataFrame([(1, "HR"), (3, "Sales")], ["id", "dept"])

df1.join(df2, on="id", how="left").show()

87 . How do you join two DataFrames and perform aggregation in PySpark?

result = df1.join(df2, "id").groupBy("category").agg({"sales": "sum"})

88 . What is the difference between wide and narrow transformations in PySpark?

Narrow transformations: Operate within a single partition (e.g., map, filter), no shuffling required.
Wide transformations: Require shuffling data between partitions (e.g., join, groupByKey), more expensive.

89 . What is lazy evaluation in PySpark?

Spark doesn’t execute transformations immediately. It builds a logical plan and waits until an action (like show(), collect(), or count()) is called to execute the plan efficiently.

90 . How is caching implemented in PySpark?

df.cache()
df.count() # Triggers the caching

91 . How do you calculate total sales by product category in PySpark?

df.groupBy("category").agg({"sales": "sum"}).show()

92 . What is a broadcast join in PySpark?

Used when one of the DataFrames is small. It avoids shuffling by sending the small DataFrame to all worker nodes:

from pyspark.sql.functions import broadcast 
df1.join(broadcast(df2), "id")

93 . What is the difference between a driver and an executor in Spark?

Driver: Coordinates and manages the Spark application (task scheduling, metadata).
Executor: Executes tasks on worker nodes and returns results.

94 . What is Adaptive Query Execution (AQE) in PySpark?

AQE allows Spark to dynamically adjust execution plans at runtime based on data statistics (e.g., skew handling, join strategy switching, partition coalescing).

95 . How do you perform a left join between two DataFrames in PySpark?

df1.join(df2, df1.id == df2.id, "left").show()

96 . Describe the process of setting up CI/CD for Azure Data Factory

CI/CD for ADF is done using Azure DevOps. Create a Git repository linked to ADF, manage branches, publish changes, and create build and release pipelines using YAML or classic pipelines.

97 .What are the types of Integration Runtimes (IR) in ADF?

There are three types of IR:

(1) Azure IR (for cloud data movement),

(2) Self-hosted IR (on-premises or VNet),

(3) Azure-SSIS IR (for SSIS packages).

98 .Difference between Blob Storage and Azure Data Lake Storage (ADLS).

ADLS is built on top of Blob Storage but supports hierarchical namespaces, optimized performance for big data analytics, and fine-grained access control.

99 .How do you integrate Databricks with Azure DevOps for CI/CD pipelines?

Use Azure Repos or GitHub to manage notebooks. Use Azure Pipelines to automate testing and deployment using Databricks CLI or REST APIs.

100 .How do you ensure data quality and validation in ADLS?

Implement validation checks in ADF or Databricks, use custom logging for anomalies, schema enforcement via Delta Lake, and apply data profiling.

101 .Explain the use of hierarchical namespaces in ADLS.

Hierarchical namespace in ADLS Gen2 allows directories and files to behave like a traditional filesystem, enabling better performance and ACL-based security.

102 .Describe the process of setting up and managing an Azure Synapse Analytics workspace.

Provision a workspace, configure dedicated/ serverless SQL pools, link storage accounts, and integrate pipelines, Spark pools, and data flows for analytics.

103 .How do you handle late-arriving data in ADF?

Use watermark columns with tumbling window triggers, or design pipelines to reprocess data based on arrival time.

104 . Explain the concept of Data Lakehouse.

A Data Lakehouse combines the storage of a data lake with the structure and management of a data warehouse using Delta Lake or Apache Iceberg.

105 .How do you implement disaster recovery for ADLS?

Use Geo-redundant storage (GRS), data backup strategies, lifecycle policies, and automation scripts to recover and sync data.

106 .How do autoscaling clusters work in Databricks?

Databricks autoscaling adjusts the number of worker nodes based on workload. It helps optimize costs by scaling down when idle and scaling up during high demand.

107 .How do you manage access control in Azure Data Lake?

Use role-based access control (RBAC) and access control lists (ACLs) for files and directories, integrated with Azure AD for authentication.

108 .What are the challenges in integrating on-premises data with Azure services?

Challenges include network latency, VPN setup, firewall rules, hybrid identity, data consistency, and secure data movement.

109 .Describe the role of triggers in ADF pipelines.

Triggers initiate pipeline runs. Types include Schedule (time-based), Tumbling Window (interval + state), and Event-based (file arrival, etc).

110 .How do you monitor and troubleshoot Spark jobs?

Use Spark UI, logs, Ganglia metrics, job history server, and Databricks job run details to identify slow stages and bottlenecks.

111 . How to copy all tables using metadata-driven pipelines in ADF?

To copy all tables using metadata-driven pipelines in ADF, you can maintain a metadata table or configuration file that contains information like source and target table names and schemas. Azure Data Factory uses a Lookup activity to fetch this metadata and a ForEach activity to loop over each table configuration. Inside the loop, a Copy Data activity dynamically reads from and writes to datasets using parameters and expressions, which allows for scalable and reusable pipeline design without manually creating separate activities for each table.

112 . How do you implement data encryption in Azure SQL Database?

Data encryption in Azure SQL Database is implemented using multiple layers of protection. For data at rest, Transparent Data Encryption (TDE) is enabled by default to encrypt stored data using certificates. For data in transit, Azure SQL uses Transport Layer Security (TLS) to protect data sent over the network. Additionally, key management can be handled through Azure Key Vault, which provides secure key rotation and storage capabilities.

113 . What are the best practices for managing and optimizing storage costs in ADLS?

To optimize storage costs in Azure Data Lake Storage, use lifecycle policies to automatically delete or move older data to cooler tiers. Compress files (e.g., using Parquet or Avro) to reduce storage footprint. Avoid small file issues by batching writes, and use partitioning to optimize data access and reduce scanning. Regularly monitor usage and configure alerts for unusual storage patterns.

114 . How do you implement security for data in transit and at rest in Azure?

Data at rest is secured using encryption methods like TDE in SQL databases and SSE in ADLS. Data in transit is encrypted using HTTPS and TLS 1.2 or higher. Role-Based Access Control (RBAC) and Access Control Lists (ACLs) help restrict access, while Azure Key Vault stores and manages encryption keys securely. Network-level security using Private Endpoints and Firewalls adds further protection.

115 . Describe the role of triggers and schedules in Azure Data Factory.

Triggers in ADF automate pipeline execution. There are three types: Schedule Trigger (runs pipelines on a time-based schedule), Tumbling Window Trigger (ensures interval-based data movement and supports retries), and Event Trigger (starts pipeline based on events like file creation in a blob). Triggers help build robust, time- or event-based workflows with minimal manual intervention.

116 . How do you optimize data storage and retrieval in Azure Data Lake Storage?

Optimization involves storing data in columnar formats like Parquet, organizing it using partitioned folders, and minimizing the number of small files. Use hierarchical namespace for better performance with directory-based operations. Employ caching and filter pushdowns in querying tools. Enable data tiering and compression to reduce costs and improve performance.

117 . How do you monitor ADF pipeline performance?

You can monitor pipelines using Azure Monitor, which integrates with ADF and provides metrics and logs. Activity runs and trigger runs can be visualized in the ADF Monitoring tab. For detailed insights, you can send diagnostic logs to Log Analytics or use custom logging inside pipelines using Web activities and Azure Functions.

118 . How do you implement incremental load in Databricks?

Incremental loading in Databricks can be done using watermarks (last updated timestamp or surrogate key). You filter new/changed records during each run using a value stored in a checkpoint or metadata table. Delta Lake simplifies this by supporting merge (upsert) operations and built-in versioning for CDC-like behavior.

119 . What are key considerations for designing scalable pipelines in ADF?

Scalability in ADF pipelines involves using parameterized datasets and linked services to build reusable and dynamic pipelines. Use ForEach with batch controls for concurrency, Data Flows for scalable transformations, and leverage Integration Runtime for distributed data movement. Monitor pipeline performance and break large workflows into smaller reusable units to ensure modularity and reusability.

120 . How do you handle error handling in ADF (retry, try-catch, failover)?

ADF provides built-in retry policies at the activity level and error handling using the IfCondition, Until, and Execute Pipeline activities. You can wrap failure-prone steps in Try-Catch-Finally patterns using control flow logic. Failed rows can be routed to error paths or logged into error tables/files using Data Flows. Alerts and logging through Log Analytics or custom email/Teams alerts help monitor and recover.

121 . How to track file names in ADF output during copy operations ?

To track file names during copy operations, you can use the @item().name or dynamic expressions in Copy Data activity to capture filenames during iteration. Logging the filename and status in a sink table or using a metadata store like Azure SQL can help in auditing and debugging. The filename can also be written into the target file using a derived column or parameter in Mapping Data Flow.

122 . How do you manage and automate ETL workflows using Databricks Workflows?

Databricks Workflows lets you orchestrate notebooks, scripts, and SQL in a DAG-like fashion. You can schedule workflows, define dependencies, retry logic, and pass parameters between tasks. Integration with Azure DevOps or GitHub allows for CI/CD. Workflow results and logs can be monitored in the UI or exported to external monitoring tools for alerts and metrics.

123 . Describe disaster recovery for ADLS.

ADLS provides geo-redundant storage (GRS) and zone-redundant storage (ZRS) to replicate data across regions. For mission-critical workloads, you can implement cross-region replication manually using tools like AzCopy or Azure Data Factory. Access can be managed through backup SAS tokens, and automated scripts or Logic Apps can restore metadata and access settings during a DR event.

124 .How do you handle incremental data loads in ADLS?

Incremental loads in ADLS are typically based on timestamps or surrogate keys. You can filter new or changed data at the source using parameters and only copy deltas. Delta Lake format makes it easy by supporting ACID transactions, versioning, and merge operations. ADF pipelines or Databricks jobs can update checkpoints or metadata tables to track the last successful load.

125 . What are the security features in Azure Synapse Analytics?

Synapse provides multiple layers of security: RBAC for workspace access, firewall rules and private endpoints for network isolation, managed identities for secure data access, and integration with Azure AD for authentication. Data is encrypted at rest and in transit, and column-level and row-level security controls ensure fine-grained access. Synapse also supports auditing and diagnostic logging.

126 . How do you implement real-time processing in Databricks using Azure Event Hubs/Kafka?

Real-time processing is implemented by connecting Structured Streaming in Databricks to Azure Event Hubs or Kafka. The streaming source ingests events in micro-batches, applies transformations, and writes to sinks like ADLS, Delta Lake, or Synapse. You define checkpoints to ensure fault tolerance and use watermarking for late data handling. Auto-scaling and trigger intervals help optimize performance.

127 . How do you integrate Azure Key Vault with ADF?

Azure Key Vault can be linked to ADF by referencing secrets in linked services. When creating a linked service (e.g., Azure SQL or Blob), choose “store in Azure Key Vault” for credentials. You must grant ADF’s managed identity access to the Key Vault using access policies. This approach avoids hardcoding sensitive information in pipeline configurations.

128 . What are best practices for optimizing ADLS storage costs?

Use compression (e.g., GZIP, Snappy) and columnar formats like Parquet to reduce file sizes. Implement lifecycle management policies to move infrequently accessed data to cool or archive tiers. Avoid small files by batching writes and optimizing partition strategies. Monitor and clean up unused or outdated data using automation.

129 . How do you implement CI/CD for Azure Synapse Analytics?

Synapse integrates with Git repositories like Azure DevOps or GitHub for source control. You can commit pipelines, notebooks, and SQL scripts to branches and use YAML-based pipelines to deploy to different environments. ARM templates or Synapse Workspace Deployment Tool can automate resource provisioning. Use environment variables and parameterization for flexibility across environments.

130 . What is the role of Integration Runtime in ADF?

Integration Runtime (IR) is the compute infrastructure used by ADF to move and transform data. The Azure IR supports cloud data movement and transformations. Self-hosted IR allows access to on-prem or private network resources. SSIS IR supports running SSIS packages. IR is responsible for scaling, secure data transfer, and region-based performance optimization.

131 . How do you secure sensitive data in Azure?

Sensitive data is secured using encryption (at rest and in transit), Azure Key Vault for managing secrets, RBAC for access control, and data masking features in Azure SQL. Network security is enforced using firewalls, private endpoints, and NSGs. You can use tools like Microsoft Defender for Cloud for threat detection and compliance checks.

132 . Describe a data pipeline for real-time analytics using Azure tools.

A real-time analytics pipeline in Azure often starts with data ingestion using Azure Event Hubs, IoT Hub, or Kafka. Streamed data is processed using Azure Databricks Structured Streaming or Stream Analytics, and enriched or aggregated data is stored in Delta Lake or Azure Synapse Analytics. Power BI is used for near real-time dashboarding. The pipeline uses Azure Key Vault, ADF triggers, and Log Analytics for security, orchestration, and monitoring.

133 . What is the difference between a job cluster and an interactive cluster in Databricks?

A job cluster is created for a single job run and auto-terminates afterward, saving costs. An interactive cluster is manually started and used for development and collaboration, staying active until manually shut down.

134. How do you implement data deduplication in PySpark?

Use dropDuplicates() on specific columns to remove duplicates. Example: df.dropDuplicates(["email", "phone"]). It ensures clean data in ETL workflows.

135 . Explain the concept of Delta Lake compaction.

Compaction merges small files into larger ones using the OPTIMIZE command, improving read performance in Delta Lake by reducing file overhead.

136 . How do you handle null values in PySpark?

Use functions like fillna(), dropna(), or na.replace() to handle nulls. Choose based on whether you want to replace, drop, or impute missing data.

137 . What is AQE (Adaptive Query Execution) in Databricks?

AQE optimizes query plans at runtime by dynamically changing join types, fixing data skew, and optimizing partitions.

138 . Write PySpark code to perform an inner join between two DataFrames.

result = df1.join(df2, df1.id == df2.id, "inner")

139 . Explain the difference between narrow and wide transformations in PySpark.

Narrow transformations (e.g., map) don’t require data movement. Wide transformations (e.g., groupBy) involve shuffling across partitions.

140 . How do you optimize PySpark jobs for large datasets?

Use partitioning, caching, broadcast joins, and avoid wide transformations. Monitor Spark UI for bottlenecks.

141 . Explain the concept of partitioning in PySpark.

Partitioning splits data across executors. It improves performance by enabling parallel processing and reducing shuffles.

142 . How do you implement real-time data processing in Databricks using Structured Streaming?

Use readStream and writeStream with supported sources like Kafka/Event Hubs. Define processing logic and output sink.

143 . Describe the concept of fault tolerance in Spark.

Spark tracks lineage using DAGs and can recompute lost data partitions upon failure, ensuring data reliability.

144 . Explain the concept of shuffling in Spark.

Shuffling is data movement across partitions during wide transformations. It’s expensive and can affect performance.

145 . What is a broadcast join in PySpark?

Broadcast join is used when one of the DataFrames is small. Spark broadcasts the smaller DataFrame to all executors to avoid shuffling. This improves join performance significantly. Use broadcast() from pyspark.sql.functions during the join.

146 . How to create a rank column using the Window function in PySpark?

Use PySpark’s Window and rank() or dense_rank() to rank rows within partitions. Define a WindowSpec with partitionBy and orderBy. Apply rank using .withColumn() and .over(windowSpec).

147 . Difference between repartition() and coalesce() in PySpark

repartition() increases or reshuffles partitions via full shuffle; used to balance data. coalesce() merges existing partitions without shuffle; ideal for reducing partitions before writing. Repartition is costlier; coalesce is faster.

148 . How to persist and cache data in PySpark?

Use .cache() to store data in memory. Use .persist() for control over storage level (memory, disk, etc.). Always unpersist data after use to free resources.

149 . Explain the concept of partitioning in PySpark.

Partitioning splits data into chunks processed in parallel. Use .repartition() or .coalesce() for control. Efficient partitioning improves parallelism and reduces shuffle overhead.

150 . Explain the concept of Delta Lake compaction.

Compaction merges small Delta files into larger ones to reduce file overhead and improve query performance. Use OPTIMIZE command in Databricks for compaction.

151 . How to handle null values in PySpark?

Use fillna() to replace nulls, dropna() to remove rows, and na.replace() for custom logic. Use isNull()/isNotNull() in filters.

152 . How do you implement real-time data processing using Structured Streaming?

Use readStream for source (e.g., Kafka), apply transformations, then write with writeStream to sink (e.g., console, Delta). Ensure checkpointing is configured for fault tolerance.

153 . Explain fault tolerance in Spark.

Spark uses lineage info (DAG) to recompute lost data. RDD/DataFrame transformations are deterministic, enabling automatic recovery upon executor failure.

154 . Explain shuffling in Spark.

Shuffling is data movement across partitions due to wide transformations. It’s costly and can lead to performance bottlenecks. Minimize with techniques like broadcast join and proper partitioning.

155 . How do you monitor and optimize performance in Azure Synapse?

Use Monitor Hub, DMVs, and SQL insights. Optimize using result caching, materialized views, and appropriate table distribution. Avoid excessive shuffling and use partitions.

156 . How do you handle schema drift in ADF?

Enable “Allow schema drift” in Mapping Data Flows. Use dynamic mappings and parameterized datasets. Schema projection helps handle unexpected schema changes.

157 . Explain denormalization and when it should be used.

Denormalization combines tables for faster read performance. Use in OLAP systems or reporting scenarios. It simplifies joins but introduces redundancy.

158 . What are common ADF activities?

Copy Activity, Data Flow, Notebook Activity, Web Activity, Lookup, and Stored Procedure Activity. Each helps build ETL/ELT pipelines.

159 . How do you integrate ADLS with Databricks?

Mount ADLS using service principal and OAuth config in Databricks. Read/write using mounted path (e.g., /mnt/...). Use secrets from Azure Key Vault for secure integration.

160 . How do you automate workflows using Azure Logic Apps?

Use triggers (timer, HTTP) and actions (SQL, email, Power BI). Example: Query SQL DB and send alerts. Good for lightweight event-driven workflows.

161 . How to implement data masking in ADF?

Use Derived Column transformation to mask data. Also, apply Dynamic Data Masking in Azure SQL. Combine with Key Vault and role-based access for security.

162 . How to ensure high availability and disaster recovery for Azure SQL DB?

Use Business Critical tier with zone redundancy. Set up geo-replication for DR. Enable auto-failover groups and long-term backup retention.

163 . Differences between Azure SQL DB and Managed Instance

SQL DB is fully managed, best for modern apps. Managed Instance supports full SQL Server features (e.g., SQL Agent, VNET), better for lift-and-shift workloads.

164 . What are the security features in ADLS Gen2?

Supports RBAC, ACLs, VNet, and encryption at rest and in transit. Use Key Vault for CMKs and enforce HTTPS. Combine with firewalls for added security.

165 . How to manage data lifecycle in ADLS?

Use Azure Blob Lifecycle policies to tier or delete data. Define rules by blob age or last modified. Helps reduce costs for infrequent-access data.

166 . How do you implement CI/CD in Azure DevOps?

Use pipelines for build and release. Store code in Git, package templates, and deploy via YAML or Classic pipelines. Use approvals, secrets, and stages.

167 . How do you integrate Azure Key Vault with other services?

Grant access to services via managed identity. Reference secrets in ADF, Synapse, Databricks using Key Vault integration. Helps secure credentials.

168 .What are key features of Azure DevOps?

Includes Git-based repos, CI/CD pipelines, Boards for agile, Test Plans, and Artifact storage. Integrates with VS Code, GitHub, and Azure services.

169. How to monitor and troubleshoot Azure SQL DB?

Use Query Performance Insight, SQL Auditing, and DMVs. Set alerts and track long-running queries. Use Log Analytics for deeper monitoring.

170 . What is the role of metadata in data architecture?

Metadata defines schema, relationships, and lineage. Used in governance (Purview), cataloging, and auditing. Improves discoverability and trust in data.

171 . Explain the concept of Delta Lake compaction.

Delta Lake compaction is the process of combining many small files generated during streaming or frequent batch writes into fewer large files. This improves read performance by reducing file overhead during query execution. Compaction can be triggered manually or scheduled periodically. It typically uses OPTIMIZE in Databricks for efficient file merging.

172 . How do you handle null values in PySpark?

Null values in PySpark can be handled using functions like fillna(), dropna(), and na.replace(). You can choose to either replace nulls with default values or drop rows/columns containing them. Custom logic can also be applied using when() and isNull() for complex transformations.

173 . Describe the concept of fault tolerance in Spark.

Spark ensures fault tolerance using lineage and DAGs. If a task fails, Spark can recompute lost partitions based on their lineage. Data stored in resilient formats like Delta or checkpoints in streaming helps recover from failures without reprocessing everything.

174 .What are the differences between RDD, DataFrame, and Dataset in PySpark?

Feature	RDD	DataFrame	Dataset
Type Safety	No	No	Yes
API Level	Low-level	High-level	High-level
Performance	Less optimized	More optimized	More optimized
Data Structure	Unstructured	Structured (schema)	Structured (schema)
Use Cases	Complex, unstructured data	Data analysis, SQL	Type-safe, structured
Fault Tolerance	Yes	Yes	Yes

RDD: Immutable distributed collection, fault-tolerant, gives more control but less performance.
DataFrame: Structured, schema-based, high-level API, supports Catalyst/Tungsten optimizations.
Dataset: Type-safe like RDD, optimized like DataFrame (more relevant in Scala/Java).

RDD: Immutable distributed collection, fault-tolerant, gives more control but less performance.
DataFrame: Structured, schema-based, high-level API, supports Catalyst/Tungsten optimizations.
Dataset: Type-safe like RDD, optimized like DataFrame (more relevant in Scala/Java).

175. How is lazy evaluation implemented in PySpark?

Transformations (e.g., map, filter) are lazy — they build a lineage graph.
Actual computation occurs when an action (e.g., collect, count, show) is called.
Enables execution plan optimization, memory efficiency, and fault tolerance.

176. What is DataFrame lineage and how does Spark handle fault tolerance?

Lineage: DAG (Directed Acyclic Graph) of transformations.
Spark recomputes only the lost partitions using lineage.
No need for frequent checkpoints, but optional for long chains.
Caching improves recovery speed but not mandatory.

177. What is the role of the Catalyst Optimizer in PySpark?

Catalyst Optimizer improves performance via:

Query Analysis: Parses and validates logical plans.
Logical Optimization: Predicate pushdown, projection pruning, etc.
Physical Planning: Generates multiple physical plans and chooses the best.
Code Generation: Runtime bytecode generation.
Extensibility: Add custom rules and support various data sources.

178. How to read CSV, JSON, and Parquet files?

179. How to join multiple DataFrames and what are the types?

180. Difference between `select()` and `selectExpr()`?

Feature	`select()`	`selectExpr()`
Input Type	Column objects or strings	SQL-like expressions (strings)
Transformations	Basic column ops	Complex SQL-like ops
Use Case	Simple selections	Complex expressions

181. How to optimize PySpark jobs?

Repartition/Coalesce
Persist/Cache intermediate DataFrames
Use broadcast joins for small datasets
Filter early
Tune spark.sql.shuffle.partitions
Avoid UDFs; use built-in functions
Monitor via Spark UI

182. What’s the difference between `cache()` and `persist()`?

Feature	`cache()`	`persist()`
Storage Level	MEMORY_ONLY	Custom (e.g., MEMORY_AND_DISK)
Flexibility	Less	More

183. How to handle skewed data in Spark?

Use salting
Use repartition() or coalesce()
Prefer reduceByKey() over groupByKey()
Enable AQE: spark.sql.adaptive.enabled = true
Increase shuffle partitions
Use Parquet/ORC, not CSV

184. What are broadcast joins and how to use them?

Used when joining a large DataFrame with a small one.
Reduces shuffling by sending the small DF to all executors.

185. How to register a DataFrame as a temporary view?

Use createGlobalTempView() for global session scope.

186. Can you run SQL queries in PySpark?

Yes, using spark.sql() on views registered with .createOrReplaceTempView().

187. Explain window functions with an example.

windowSpec = Window.partitionBy(“department”).orderBy(“salary”)
df.withColumn(“rank”, rank().over(windowSpec)).show()

188. How to handle missing/null values?

Identify: isNull() / isNotNull()
Drop: dropna() / dropna(subset=[...])
Fill: fillna() with single or column-wise dict
Replace: replace(to_replace=None, value=...)
Aggregate functions ignore nulls by default

189. How to detect and remove duplicates?

To detect:

190. What is Databricks Runtime?

A curated, tested software stack that runs on Databricks clusters. It bundles:

OS (Ubuntu LTS) and JVM/Scala/Python/R.
Apache Spark (tuned and pre-integrated).
Delta Lake, MLflow, DBFS, and Databricks utilities (e.g., dbutils).
Databricks optimizations and services (e.g., Photon on compatible SQL compute, cluster management, security hardening).
You choose the runtime version (e.g., 12.x/13.x/14.x) when creating/editing a cluster. Each version pins compatible Spark + libraries.

191. What are the types of Databricks Runtimes?

Major flavors (availability can vary by cloud/region):

a) Databricks Runtime (Standard) – General-purpose Spark + Delta with Databricks performance, security, and reliability improvements.

b) Databricks Runtime for Machine Learning (ML) – Standard + popular ML/DL libs (e.g., scikit-learn, XGBoost, TensorFlow/PyTorch/Keras), MLflow tracking, and GPU support where available.

c) Databricks Runtime for Genomics – Tuned stack for genomic/biomedical workloads (specialized libs and IO optimizations).

d) Databricks Light – Minimal footprint for simple batch jobs where you don’t need advanced performance features; reduced components and features.

192. How do you share a notebook with other developers in Workspace?

To share a notebook in Databricks:

Direct Sharing:
- Open the notebook
- Click on the “Share” button in the top-right corner
- Enter the username or email of the colleague
- Set permissions (Can View, Can Run, Can Edit)
Workspace Permissions:
- Right-click the notebook/folder in Workspace
- Select “Permissions”
- Add users/groups and set appropriate access levels
Export/Import:
- Export notebook as .dbc or .ipynb file
- Share the file with others who can import it
Git Integration:
- Connect notebook to Git repository
- Collaborators can clone the repo

193.How to access one notebook variable into other notebooks?

There are several ways to share variables between notebooks:

1. %run command:
  python
```
%run /path/to/notebook
# Variables defined in the called notebook become available
```
2. dbutils.notebook.run() (for notebook workflows):
  python
```
result = dbutils.notebook.run("/path/to/notebook", timeout_seconds=60, arguments={"param": "value"})
```
3. Spark tables/views:
  - Create a temporary view in one notebook
  - Access it in another notebook
4. DBFS storage:
  - Save data to DBFS in one notebook
  - Read from DBFS in another notebook
5. Widgets for parameter passing

194. How to call one notebook from another?

Two primary options:

Inline import style

%run /Shared/common/setup

Workflow/task style

result = dbutils.notebook.run(
    "/Shared/jobs/prepare_data",
    timeout_seconds=3600,
    arguments={"p_date": "2025-08-19", "reload": "true"}
)
# result is a string; often JSON you parse:
import json
payload = json.loads(result)

Use %run to reuse code, and dbutils.notebook.run to orchestrate and pass/return parameters safely.

195. How to exit a notebook and return output to the caller?

Use dbutils.notebook.exit(value: str) inside the callee.

Callee (producer):

import json
output = {"row_count": 12345, "status": "OK"}
dbutils.notebook.exit(json.dumps(output))

Caller (consumer):

import json
out = dbutils.notebook.run("/Shared/producer", 600, {"p": "v"})
data = json.loads(out)
print(data["row_count"])

196. How to create Internal (Managed) & External tables?

Managed (Internal) table: Databricks manages both data and metadata in the workspace’s managed storage. Dropping the table deletes the data.
External table: Metadata in the metastore, data remains in your specified path (e.g., ADLS/Blob). Dropping the table does not delete external data.

SQL examples (Delta recommended):

Managed (no LOCATION):

CREATE TABLE sales_orders
USING DELTA
AS
SELECT * FROM bronze_sales;

External (specify LOCATION):

CREATE TABLE ext_clickstream
USING DELTA
LOCATION 'abfss://[email protected]/clickstream/';

197. How to access ADLS/Blob Storage in Databricks?

Three common patterns (Azure):

Direct ABFS/ABFSS paths (recommended) with credential passthrough or service principal:

spark.conf.set("fs.azure.account.auth.type.myacct.dfs.core.windows.net", "OAuth")

spark.conf.set("fs.azure.account.oauth.provider.type.myacct.dfs.core.windows.net",

"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")

spark.conf.set("fs.azure.account.oauth2.client.id.myacct.dfs.core.windows.net", "<app-id>")

spark.conf.set("fs.azure.account.oauth2.client.secret.myacct.dfs.core.windows.net",

dbutils.secrets.get("kv-scope", "sp-secret"))

spark.conf.set("fs.azure.account.oauth2.client.endpoint.myacct.dfs.core.windows.net",

"https://login.microsoftonline.com/<tenant-id>/oauth2/token")

df = spark.read.format(“delta”).load(
“abfss://[email protected]/sales/”)

DBFS Mount (legacy/OK for simple cases):

dbutils.fs.mount(

source="abfss://[email protected]",

mount_point="/mnt/raw",

extra_configs={

"fs.azure.account.auth.type.myacct.dfs.core.windows.net": "OAuth",

"fs.azure.account.oauth.provider.type.myacct.dfs.core.windows.net":

"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

"fs.azure.account.oauth2.client.id.myacct.dfs.core.windows.net": "<app-id>",

"fs.azure.account.oauth2.client.secret.myacct.dfs.core.windows.net":

dbutils.secrets.get("kv-scope", "sp-secret"),

"fs.azure.account.oauth2.client.endpoint.myacct.dfs.core.windows.net":

"https://login.microsoftonline.com/<tenant-id>/oauth2/token"

}

)

Then read/write via /mnt/raw/....

Unity Catalog External Locations (governed, recommended for prod): Define a credential + external location and create external tables over that path.

198.What are the types of Cluster Modes in Databricks?

1. Standard Mode:
  - Default and most common
  - Provides a secure, isolated environment
  - Best for single-user or production workloads
2. High Concurrency Mode:
  - Designed for multiple users
  - Provides fine-grained sharing and isolation
  - Supports SQL, Python, R, and Scala
  - Requires Premium plan or above
3. Single Node Mode:
  - Runs all workloads on the driver node only
  - No worker nodes
  - Good for small jobs or testing
  - Lower cost but limited scalability

199. What workload types exist for Standard clusters?

1. Interactive Workloads:
  - Notebook development
  - Ad-hoc queries
  - Data exploration
2. Batch Workloads:
  - ETL pipelines
  - Scheduled jobs
  - Large-scale data processing
3. Machine Learning:
  - Training models
  - Feature engineering
  - Hyperparameter tuning
4. Streaming Workloads:
  - Real-time data processing
  - Structured Streaming applications
5. SQL Analytics:
  - BI and dashboarding
  - SQL queries
  - Data visualization

200. Can I use both Python 2 and Python 3 notebooks on the same cluster?

No, you cannot use both Python 2 and Python 3 notebooks on the same cluster in Databricks.

Each cluster is configured with either Python 2 or Python 3 (selected during cluster creation)
All notebooks running on that cluster must use the same Python version
Modern Databricks runtimes (9.1 LTS and above) only support Python 3
Python 2 was deprecated in Databricks Runtime 7.x and removed in later versions

Workaround:

Create separate clusters for Python 2 and Python 3 workloads
Migrate to Python 3 (recommended as Python 2 is end-of-life)

201. What is a pool? Why use it? How to create one?

What is a pool?
A pool (formerly called instance pool) is a set of idle, ready-to-use cloud instances that reduce cluster start and auto-scaling times.

Why use pools?

Faster cluster startup (instances are pre-provisioned)
Reduced costs (instances can be shared across clusters)
Better resource management
Minimizes cold start times

How to create a pool:

Using UI:
- Go to Compute > Pools > Create Pool
- Configure:
  - Pool name
  - Instance type
  - Min/Max idle instances
  - Autoscaling
  - Preloaded Spark versions

Using API:

import requests

headers = {"Authorization": "Bearer <token>"}
data = {
  "instance_pool_name": "my-pool",
  "node_type_id": "Standard_DS3_v2",
  "min_idle_instances": 1,
  "max_capacity": 10,
  "idle_instance_autotermination_minutes": 15
}

response = requests.post(
  "https://<databricks-instance>/api/2.0/instance-pools/create",
  headers=headers,
  json=data
)

Using Terraform:

resource "databricks_instance_pool" "pool" {
  instance_pool_name = "my-pool"
  min_idle_instances = 1
  max_capacity       = 10
  node_type_id       = "Standard_DS3_v2"
  
  idle_instance_autotermination_minutes = 15
}

202. How many ways can we create/pass variables in Databricks?

There are several ways to create variables in Databricks notebooks:

Standard Python variables:
python
```
x = 10
name = "Databricks"
```
Spark SQL variables:
python
```
spark.sql("SET my_var = 10")
```

Widgets (for parameterization):

dbutils.widgets.text("input", "default_value", "Label")
input_value = dbutils.widgets.get("input")

Environment variables:

import os
os.environ["MY_VAR"] = "value"

Notebook-scoped variables (using %run):

# In notebook1:
var1 = "hello"

# In notebook2:
%run /path/to/notebook1
print(var1)  # Accessible

Shared variables via Spark context:

spark.conf.set("shared.var", "value")
value = spark.conf.get("shared.var")

203. What are important Jobs limits to remember?

Databricks Jobs have several limitations:

Timeout Limits:
- Maximum timeout is 30 days for a single run
- Notebook jobs have a 1-year retention limit for results
Size Limits:
- Maximum of 1000 jobs per workspace
- Notebook size limit (several MB, depends on runtime)
Concurrency Limits:
- Maximum concurrent runs per workspace (depends on tier)
- Default is 1000 for most tiers
Parameter Limits:
- Notebook jobs accept up to 100 parameters
- Maximum parameter size is 10KB
Cluster Limitations:
- Jobs can’t use High Concurrency clusters
- Some instance types may be restricted
Scheduling Limits:
- Minimum schedule interval is 10 minutes
- Cron syntax has some cloud-specific limitations
API Limitations:
- Rate limits on Jobs API calls
- Maximum of 1000 runs returned per list operation

204. Can I use `%pip` to install packages in notebooks?

Yes, you can use %pip in Databricks notebooks to install Python packages. This is the recommended approach for package management in notebooks.

# Basic package installation
%pip install pandas==1.2.0

# Install multiple packages
%pip install numpy matplotlib seaborn

# Install from requirements file
%pip install -r requirements.txt

# Install from GitHub
%pip install git+https://github.com/user/repo.git

# Uninstall packages
%pip uninstall package-name -y

Installed packages are available only to the current notebook session
For cluster-wide packages, use cluster-scoped libraries (via UI or API)
%pip commands should typically be in the first cell of the notebook
Changes take effect immediately (no need to restart kernel)
You can also use %conda for Conda packages in some runtimes

Notes:

%pip/%conda take effect on the current attached cluster; rebuilds/new clusters require re-install (use init scripts or the Libraries UI for cluster-level pinning).
Prefer Repos + requirements.txt / environment.yml for reproducibility.

205 . Explain all the activities available in Azure Data Factory.

ADF activities are grouped as Data Movement (Copy), Data Transformation (Data Flow, Databricks, HDInsight, etc.), Control (ForEach, Until, If Condition, Wait), and External Activities. They let you move, transform, and orchestrate data pipelines.

206 .Difference between Integration Runtimes in ADF?

Integration Runtime	Description	When to Use
Azure IR	Fully managed compute in Azure for data movement, data flow, and activity dispatch.	For cloud-to-cloud data copy and transformations.
Self-hosted IR	Installed on-premises or on a VM, connects private networks with ADF.	For on-premises ↔ cloud or private network data access.
Azure-SSIS IR	Dedicated cluster to run SSIS packages in Azure without redevelopment.	For lift-and-shift SSIS ETL workloads to the cloud.

207 . Explain the types of triggers in ADF. Which ones have you used in projects and why?

Schedule trigger (time-based),
Tumbling window trigger (time slices with retries),
Event-based trigger (on blob events).
I’ve used tumbling window for incremental loads and event triggers for real-time ingestion.

208 . How do you enable and schedule pipelines in ADF?

Create a trigger (schedule, tumbling window, or event) and attach it to the pipeline. Pipelines can also be triggered manually or via REST API/PowerShell for automation.

209 . How do you send only the last 5 days of data to Databricks?

Use a date filter condition in source queries (e.g., WHERE date >= GETDATE()-5) or parameterize pipeline variables with system functions to pass only the last 5 days of data to Databricks.

210 . How do you define the schema in ADF?

Schema is auto-detected from linked service datasets, but you can also manually define columns, data types, or use schema drift in Mapping Data Flows for flexibility.

211 . How do you connect ADF to a database?

Create a linked service for the database (Azure SQL, SQL Server, Oracle, etc.), provide connection details (server, DB, credentials/Key Vault), and then use it in datasets.

212 . Explain the Data Flow activity in detail.

Mapping Data Flow is a visual, code-free transformation engine in ADF. It allows joins, aggregations, derived columns, lookups, and more at scale, executed on Spark under the hood.

213 . What transformations have you performed in ADF?

Common ones include Filter, Join, Aggregate, Derived Column, Lookup, Conditional Split, Surrogate Key, Pivot/Unpivot. These are used for cleaning, reshaping, and enriching data.

214 . What is the tumbling window trigger in ADF?

It triggers pipelines in fixed, contiguous, non-overlapping time slices (e.g., every 15 minutes). Useful for batch/stream-like processing with retry and catch-up options.

215 . What is the Filter activity in ADF?

Filter activity lets you apply conditions on an array and pass only matching elements to the next step. Fields include items (input array) and condition (Boolean expression).

216 . How do you get metadata in ADF?

Use the Get Metadata activity, which retrieves properties like structure, last modified, size, and schema from a dataset or file system, then pass values dynamically.

217 . What are the limits for Lookup activity in ADF?

Lookup can return a single row or up to 5000 rows (max 2 MB size). It’s generally used for config tables, parameters, or small reference data.

218 . What is Azure Data Factory?

Azure Data Factory (ADF) is a cloud-based data integration service that allows you to create data-driven workflows for orchestrating and automating data movement and data transformation. It supports data ingestion from various sources, transformation using data flows or external compute services, and data movement to a variety of destinations.

219 . What are the key components of Azure Data Factory?

The key components of Azure Data Factory include:

Pipelines: Logical grouping of activities that perform a task.

Activities: Define the actions to be performed within a pipeline.

Datasets: Represent data structures within data stores, pointing to the data you want to use in activities.

Linked Services: Define the connection information needed for Data Factory to connect to external resources.

Triggers: Define when a pipeline execution needs to be kicked off.

220 . How does Azure Data Lake Storage Gen2 differ from Azure Blob Storage?

Azure Data Lake Storage Gen2 is designed for big data analytics and provides hierarchical namespace capabilities, enabling efficient management of large datasets and fine-grained access control. Azure Blob Storage is more general-purpose and used for storing unstructured data. Data Lake Storage Gen2 builds on top of Blob Storage but includes enhancements for big data workloads.

221 . What is the purpose of the Integration Runtime in Azure Data Factory?

Integration Runtime (IR) in Azure Data Factory acts as a bridge between the activity and the data store. It supports data movement, dispatch, and integration capabilities across different network environments, including Azure, on-premises, and hybrid scenarios. There are three types: Azure IR, Self-hosted IR, and Azure-SSIS IR.

222 . Explain the concept of a Data Lake and its importance.

A Data Lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. Its importance lies in its ability to ingest data in its raw form from various sources, providing a foundation for advanced analytics and machine learning. It allows for schema-on-read, meaning data is interpreted at the time of processing, offering flexibility and scalability.

223 . How would you optimize the performance of an Azure Data Factory pipeline?

Optimizing performance in ADF pipelines can be achieved by:

Using parallelism and partitioning to process large datasets efficiently.

Reducing data movement by processing data in place where possible.

Leveraging the performance tuning capabilities of the underlying data stores and compute resources.

Using appropriate Integration Runtime (IR) types and configurations based on the network environment.

224 . What is PolyBase and how is it used in Azure SQL Data Warehouse?

PolyBase is a data virtualization feature in Azure SQL Data Warehouse (now Azure Synapse Analytics) that allows you to query data stored in external sources like Azure Blob Storage, Azure Data Lake Storage, and Hadoop, using T-SQL. It enables seamless data integration and querying without the need to move data, thus optimizing performance and reducing data redundancy.

225 . Describe the process of implementing incremental data loading in Azure Data Factory.

Incremental data loading involves only loading new or changed data since the last load. This can be achieved by:

Using watermarking techniques with a column like timestamp or ID to identify new or changed records.

Implementing change data capture (CDC) mechanisms in the source systems.

Using lookup and conditional split activities in ADF to separate new/changed data from the rest.

226 . What are Delta Lake tables and why are they important in big data processing?

Delta Lake tables are an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. They enable reliable and scalable data lakes with features like versioned data, schema enforcement, and the ability to handle streaming and batch data in a unified manner. They ensure data integrity and consistency, making them essential for complex data processing pipelines.

227 . How can you implement security and compliance in an Azure Data Lake?

Security and compliance in an Azure Data Lake can be implemented by:

Using Azure Active Directory (AAD) for authentication and fine-grained access control.

Applying Role-Based Access Control (RBAC) to manage permissions.

Encrypting data at rest and in transit.

Monitoring and auditing access and activity using Azure Monitor and Azure Security Center.

Implementing data governance policies and ensuring compliance with industry standards and regulations.

228 . What are Tumbling Window Triggers in Azure Data Factory?

Tumbling Window Triggers are a type of trigger in Azure Data Factory that fire at periodic intervals. They are useful for processing data in fixed-size, non-overlapping time windows. Each trigger instance is independent, and the trigger will only execute if the previous instance has completed.

229 . How do you monitor and troubleshoot pipeline failures in Azure Data Factory?

Monitoring and troubleshooting pipeline failures in Azure Data Factory can be done using:

Azure Monitor: Provides a comprehensive view of pipeline runs, including success and failure metrics.

Activity Runs: Reviewing the details of individual activity runs to identify the root cause of failures.

Logs and Alerts: Configuring logging to capture detailed execution logs and setting up alerts to notify of failures.

Retry Policies: Implementing retry policies for transient failures.

Debugging Tools: Using the debug mode in the Data Factory UI to test and troubleshoot pipelines before deployment.

230 . How do you schedule a pipeline in Azure Data Factory?

To schedule a pipeline in Azure Data Factory, you use triggers. There are three types of triggers:

Schedule trigger: Runs pipelines on a specified schedule.

Tumbling window trigger: Runs pipelines in a series of fixed-size, non-overlapping time intervals.

Event-based trigger: Runs pipelines in response to events, such as the arrival of a file in a storage account.

231 . What are the best practices for designing pipelines in Azure Data Factory?

Best practices for designing pipelines in Azure Data Factory include:

Modularize pipelines: Break down complex workflows into smaller, reusable pipelines.

Parameterize components: Use parameters to create flexible and reusable pipelines, datasets, and linked services.

Implement logging and monitoring: Set up comprehensive logging and monitoring to track pipeline executions and diagnose issues.

Optimize performance: Use parallelism, data partitioning, and efficient data movement strategies to optimize pipeline performance.

Secure data: Implement robust security practices, such as using managed identities, encryption, and access control.

232 . Scenario: You are given a large dataset stored in Azure Data Lake Storage (ADLS). Your task is to perform ETL (Extract, Transform, Load) operations using Azure Databricks and load the transformed data into an Azure SQL Database. Describe your approach.

Answer:

Extract: Use Databricks to read data from ADLS using Spark’s DataFrame API.

Transform: Perform necessary transformations using Spark SQL or DataFrame operations (e.g., filtering, aggregations, joins).

Load: Use the Azure SQL Database connector to write the transformed data into the SQL database.

Optimization: Optimize the Spark job for performance by caching intermediate results and adjusting the number of partitions.

Error Handling: Implement error handling and logging to track the ETL process.

233 . Scenario: Your Databricks notebook is running slower than expected due to large shuffle operations. How would you identify and resolve the bottleneck?

Answer:

Identify Bottleneck: Use the Spark UI to identify stages with high shuffle read/write times.

Repartition: Repartition the data to distribute it more evenly across the cluster.

Broadcast Joins: Use broadcast joins for smaller tables to avoid shuffles.

Optimize Transformations: Review and optimize transformations to reduce the amount of data being shuffled.

Increase Shuffle Partitions: Increase the number of shuffle partitions to distribute the load more evenly.

234 . Scenario: You need to implement a real-time data processing pipeline in Azure Databricks that ingests data from Azure Event Hubs, processes it, and writes the results to Azure Cosmos DB. What steps would you take?

Answer:

Ingestion: Set up a Spark Structured Streaming job to read data from Azure Event Hubs.

Processing: Apply necessary transformations and aggregations on the streaming data.

Output: Use the Azure Cosmos DB connector to write the processed data to Cosmos DB.

Checkpointing: Enable checkpointing to ensure exactly-once processing and fault tolerance.

Monitoring: Implement monitoring to track the performance and health of the streaming pipeline.

235 . Scenario: Your team needs to collaborate on a Databricks notebook, but you want to ensure that all changes are version-controlled. How would you set this up?

Answer:

Databricks Repos: Use Databricks Repos to integrate with a version control system like GitHub or Azure DevOps.

Clone Repository: Clone the repository into Databricks and start working on the notebooks.

Commit and Push: Commit changes to the local repo and push them to the remote repository to keep track of versions.

Collaboration: Use branches and pull requests to manage collaboration and code reviews.

Sync Changes: Regularly sync changes between Databricks and the remote repository to ensure consistency.

236 . Scenario: You need to optimize a Databricks job that processes petabytes of data daily. What strategies would you use to improve performance and reduce costs?

Answer:

Auto-Scaling: Enable auto-scaling to dynamically adjust the cluster size based on the workload.

Optimized Clusters: Use instance types optimized for the workload, such as compute-optimized VMs for CPU-intensive tasks.

Data Caching: Cache intermediate data to avoid re-computation and reduce I/O operations.

Efficient Storage: Use Delta Lake for efficient storage and read/write operations.

Pipeline Optimization: Break down the job into smaller, manageable tasks and optimize each stage of the pipeline.

237 . Scenario: You need to implement data lineage in your Databricks environment to track the flow of data from source to destination. How would you achieve this?

Use Delta Lake: Leverage Delta Lake’s built-in capabilities for data versioning and auditing.

Databricks Lineage Tracking: Use Databricks’ built-in lineage tracking features to capture data flow and transformations.

External Tools: Integrate with external data lineage tools like Azure Purview for more comprehensive tracking.

Logging: Implement custom logging to capture metadata about data transformations and movements.

Documentation: Maintain detailed documentation of data pipelines and transformations.

238 . Scenario: Your Databricks job is running out of memory. How would you troubleshoot and resolve this issue?

Answer:

Memory Profiling: Use Spark’s UI and memory profiling tools to identify stages consuming excessive memory.

Data Partitioning: Adjust the number of partitions to better distribute the data across the cluster.

Garbage Collection: Tune JVM garbage collection settings to improve memory management.

Data Serialization: Use efficient data serialization formats like Kryo to reduce memory usage.

Cluster Configuration: Increase the executor memory and cores to provide more resources for the job.

239 . Scenario: You need to ensure that your Databricks environment complies with regulatory requirements for data security and privacy. What measures would you implement?

Answer:

Encryption: Ensure data at rest and in transit is encrypted using Azure-managed keys.

Access Controls: Implement RBAC and enforce least privilege access to Databricks resources.

Auditing: Enable and monitor audit logs to track access and changes to data.

Compliance Tools: Use tools like Azure Policy and Azure Security Center to enforce compliance policies.

Data Masking: Implement data masking and anonymization techniques to protect sensitive information.

240 . Scenario: Your team needs to migrate an existing on-premises data processing job to Azure Databricks. Describe your migration strategy.

Answer:

Assessment: Evaluate the existing job and identify dependencies and required resources.

Data Transfer: Use Azure Data Factory or Azure Databricks to transfer data from on-premises to ADLS.

Code Migration: Convert the on-premises code to Spark-compatible code and test it in Databricks.

Performance Tuning: Optimize the Spark job for cloud execution, focusing on performance and cost-efficiency.

Validation: Validate the migrated job to ensure it produces correct results and meets performance requirements.

241 . Scenario: You are tasked with setting up a CI/CD pipeline for your Databricks notebooks. What steps would you take?

Answer:

Version Control: Store Databricks notebooks in a version control system like GitHub or Azure DevOps.

Build Pipeline: Set up a build pipeline to automatically test and validate notebook code.

Deployment Pipeline: Create a deployment pipeline to automate the deployment of notebooks to different environments (e.g., dev, test, prod).

Integration: Use tools like Databricks CLI or REST API to integrate with the CI/CD pipeline.

Monitoring: Implement monitoring and alerting to track the health and performance of the CI/CD pipeline.

SQL

Employee table

EmpID	EmpName	Salary	ManagerID	DeptID	JoinDate
1	Alice	90000	3	101	2025-01-10
2	Bob	60000	3	101	2024-11-05
3	Charlie	120000	NULL	101	2023-07-01
4	David	60000	3	102	2025-06-15
5	Anita	75000	1	102	2025-05-20
6	Arjun	90000	1	103	2024-12-01
7	Meena	60000	2	103	2024-08-18

Q1. Fetch the second-highest salary

SELECT MAX(Salary) AS SecondHighestSalary
FROM Employee
WHERE Salary < (SELECT MAX(Salary) FROM Employee);

Output

SecondHighestSalary
90000

Q2. Get duplicate records from a table

SELECT EmpName, Salary, COUNT(*) AS Count
FROM Employee
GROUP BY EmpName, Salary
HAVING COUNT(*) > 1;

Output

EmpName	Salary	Count
David	60000	2

Q3. Employees earning more than their managers

SELECT e.EmpName, e.Salary, m.EmpName AS ManagerName, m.Salary AS ManagerSalary
FROM Employee e
JOIN Employee m ON e.ManagerID = m.EmpID
WHERE e.Salary > m.Salary;

Output:

EmpName	Salary	ManagerName	ManagerSalary
Alice	90000	Charlie	120000
(Only those > manager will show, in this dataset none qualify).

Q4. Retrieve the top `N` records

SELECT * FROM Employee
ORDER BY Salary DESC
FETCH FIRST 3 ROWS ONLY; -- Example N=3

Output (Top 3 salaries):

EmpID	EmpName	Salary
3	Charlie	120000
1	Alice	90000
6	Arjun	90000

Q5. Count employees in each department

SELECT DeptID, COUNT(*) AS EmployeeCount
FROM Employee
GROUP BY DeptID;

Output:

DeptID	EmployeeCount
101	3
102	2
103	2

Q6. Department with highest number of employees

SELECT DeptID, COUNT(*) AS EmployeeCount
FROM Employee
GROUP BY DeptID
ORDER BY EmployeeCount DESC
FETCH FIRST 1 ROW ONLY;

Output:

DeptID	EmployeeCount
101	3

Q7. Employees with the same salary

Salary	EmpCount
60000	3
90000	2

Q8. Employees whose name starts with ‘A’

SELECT EmpName
FROM Employee
WHERE EmpName LIKE 'A%';

Output:

EmpName
Alice
Anita
Arjun

Q9. Get the last record from a table

SELECT *
FROM Employee
ORDER BY EmpID DESC
FETCH FIRST 1 ROW ONLY;

Output:

EmpID	EmpName	Salary	DeptID
7	Meena	60000	103

Q10. Employees joined in the last 6 months

SELECT EmpName, JoinDate
FROM Employee
WHERE JoinDate >= DATEADD(MONTH, -6, GETDATE());

Output (assuming today = Aug 2025):

EmpName	JoinDate
David	2025-06-15
Anita	2025-05-20

Q11. Find the Nth highest salary

SELECT DISTINCT Salary
FROM Employee E1
WHERE (N-1) = (
SELECT COUNT(DISTINCT Salary)
FROM Employee E2
WHERE E2.Salary > E1.Salary
);

Output

Salary
90000

Q12. Remove duplicate rows (without DISTINCT)

DELETE FROM Employee E1
WHERE ROWID > (
SELECT MIN(ROWID) 
FROM Employee E2 
WHERE E1.EmpID = E2.EmpID
);

Output

No duplicates in given input → table remains same.

Q13. Find missing numbers in EmpID sequence

SELECT Level AS Missing_ID
FROM dual
CONNECT BY Level <= (SELECT MAX(EmpID) FROM Employee)
MINUS
SELECT EmpID FROM Employee;

Output

Missing_ID
(None – IDs 1 to 7 are continuous)

Q14. Display first and last name in single column

(Assuming EmpName column holds first names only; let’s simulate LastName from DeptID for example.)

SELECT EmpName || ' ' || DeptID AS FullName
FROM Employee;

Output

FullName
Alice 101
Bob 101
Charlie 101
David 102
Anita 102
Arjun 103
Meena 103

Q15. Cumulative sum of salaries

SELECT EmpID, EmpName, Salary,
SUM(Salary) OVER (ORDER BY EmpID) AS Cumulative_Sum
FROM Employee;

Output

EmpID	EmpName	Salary	Cumulative_Sum
1	Alice	90000	90000
2	Bob	60000	150000
3	Charlie	120000	270000
4	David	60000	330000
5	Anita	75000	405000
6	Arjun	90000	495000
7	Meena	60000	555000

Q16. Swap two columns (Salary and DeptID)

UPDATE Employee
SET Salary = Salary + DeptID,
DeptID = Salary - DeptID,
Salary = Salary - DeptID;

After swap (just showing EmpID, Salary, DeptID):

EmpID	Salary	DeptID
1	101	90000
2	101	60000
3	101	120000
4	102	60000
5	102	75000
6	103	90000
7	103	60000

Q17. Employees whose names contain only vowels

SELECT EmpName 
FROM Employee
WHERE REGEXP_LIKE(EmpName, '^[AEIOUaeiou]+$');

Output

EmpName
(None – no employee with only vowels name)

Q18. Transpose rows into columns (Dept wise salary sum)

SELECT 
SUM(CASE WHEN DeptID=101 THEN Salary END) AS Dept101,
SUM(CASE WHEN DeptID=102 THEN Salary END) AS Dept102,
SUM(CASE WHEN DeptID=103 THEN Salary END) AS Dept103
FROM Employee;

Output

Dept101	Dept102	Dept103
270000	135000	150000

Q19. Employees with highest salary in each department

SELECT EmpID, EmpName, DeptID, Salary
FROM Employee E
WHERE Salary = (
SELECT MAX(Salary) 
FROM Employee 
WHERE DeptID = E.DeptID
);

20. Customers who made multiple purchases on the same day

Input Table – Orders

OrderID	CustID	OrderDate
1	101	2025-08-01
2	101	2025-08-01
3	102	2025-08-02
4	103	2025-08-02
5	103	2025-08-02

SELECT CustID, OrderDate, COUNT(*) AS NumOrders
FROM Orders
GROUP BY CustID, OrderDate
HAVING COUNT(*) > 1;

Output:

CustID	OrderDate	NumOrders
101	2025-08-01	2
103	2025-08-02	2

Input Tables

Employee

EmpID	EmpName	Salary	ManagerID	DeptID	JoinDate
1	Alice	90000	3	101	2025-01-10
2	Bob	60000	3	101	2024-11-05
3	Charlie	120000	NULL	101	2023-07-01
4	David	60000	3	102	2025-06-15
5	Anita	75000	1	102	2025-05-20
6	Arjun	90000	1	103	2024-12-01
7	Meena	60000	2	103	2024-08-18

Sales

SaleID	SaleDate	Amount
1	2025-01-01	1000
2	2025-02-01	1500
3	2025-03-01	2000
4	2025-04-01	2500
5	2025-05-01	3000

Orders

OrderID	CustID	OrderDate
1	101	2025-01-10
2	101	2025-01-10
3	102	2025-02-05
4	103	2025-03-01
5	104	2025-03-01
6	104	2025-03-01

21. Moving Average of Sales (Last 3 Months)

SELECT SaleDate, Amount,
AVG(Amount) OVER (ORDER BY SaleDate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS MovingAvg
FROM Sales;

Output

SaleDate	Amount	MovingAvg
2025-01-01	1000	1000
2025-02-01	1500	1250
2025-03-01	2000	1500
2025-04-01	2500	2000
2025-05-01	3000	2500

22. Rank Employees by Salary in Each Department

SELECT DeptID, EmpName, Salary,
RANK() OVER (PARTITION BY DeptID ORDER BY Salary DESC) AS RankInDept
FROM Employee;

Output

DeptID	EmpName	Salary	RankInDept
101	Charlie	120000	1
101	Alice	90000	2
101	Bob	60000	3
102	Anita	75000	1
102	David	60000	2
103	Arjun	90000	1
103	Meena	60000	2

23. Employees with More Than One Manager

SELECT EmpName, COUNT(DISTINCT ManagerID) AS ManagerCount
FROM Employee
WHERE ManagerID IS NOT NULL
GROUP BY EmpName
HAVING COUNT(DISTINCT ManagerID) > 1;

Output
In our data, none has multiple managers, so result = empty set.

24. Most Frequent Order Date

OrderDate	OrderCount
2025-03-01	3

25. Compare Two Tables (Mismatched Records)

SELECT * FROM Employee_2024
MINUS
SELECT * FROM Employee_2025
UNION
SELECT * FROM Employee_2025
MINUS
SELECT * FROM Employee_2024;

Output

Shows rows present in one table but not the other.

26. Difference Between Consecutive Rows

SELECT SaleDate, Amount,
Amount - LAG(Amount) OVER (ORDER BY SaleDate) AS DiffFromPrev
FROM Sales;

Output

SaleDate	Amount	DiffFromPrev
2025-01-01	1000	NULL
2025-02-01	1500	500
2025-03-01	2000	500
2025-04-01	2500	500
2025-05-01	3000	500

27. Pivot Table Data Dynamically

SELECT *
FROM (SELECT DeptID, EmpName FROM Employee)
PIVOT (COUNT(EmpName) FOR DeptID IN (101, 102, 103));

Output

DeptID_101	DeptID_102	DeptID_103
3	2	2

28. Delete Every Alternate Row

DELETE FROM Employee
WHERE MOD(EmpID,2) = 0;

Effect
Deletes rows with EmpID = 2, 4, 6.

29. First Purchase Date for Each Customer

SELECT CustID, MIN(OrderDate) AS FirstPurchase
FROM Orders
GROUP BY CustID;

Output

CustID	FirstPurchase
101	2025-01-10
102	2025-02-05
103	2025-03-01
104	2025-03-01

30. Running Total of Sales per Month

SaleDate	Amount	RunningTotal
2025-01-01	1000	1000
2025-02-01	1500	2500
2025-03-01	2000	4500
2025-04-01	2500	7000
2025-05-01	3000	10000

At Learnomate Technologies, we don’t just teach tools, we train you with real-world, hands-on knowledge that sticks. Our Azure Data Engineering training program is designed to help you crack job interviews, build solid projects, and grow confidently in your cloud career.

Want to see how we teach? Hop over to our YouTube channel for bite-sized tutorials, student success stories, and technical deep-dives explained in simple English.
Ready to get certified and hired? Check out our Azure Data Engineering course page for full curriculum details, placement assistance, and batch schedules.
Curious about who’s behind the scenes? I’m Ankush Thavali, founder of Learnomate and your trainer for all things cloud and data. Let’s connect on LinkedIn—I regularly share practical insights, job alerts, and learning tips to keep you ahead of the curve.

And hey, if this article got your curiosity going…

Explore more on our blog where we simplify complex technologies across data engineering, cloud platforms, databases, and more.

Thanks for reading. Now it’s time to turn this knowledge into action. Happy learning and see you in class or in the next blog!

Happy Vibes!

ANKUSH