Azure Data Engineer Interview Questions
1. What are the best practices for managing and optimizing storage costs in ADLS?
-
Use storage tiers – Hot, Cool, Archive based on access frequency.
-
Enable lifecycle policies – Auto-move or delete old data.
-
Use compressed formats – Like Parquet or Avro.
-
Avoid small files – Merge to reduce overhead.
-
Clean up unused data – Delete temp or obsolete files.
-
Monitor with Cost Management – Set budgets and alerts.
-
Use hierarchical namespace – For efficient file handling.
2. How do you implement security measures for data in transit and at rest in Azure?
Security measures in Azure:
-
Data in Transit:
-
Use TLS encryption (enabled by default).
-
Use private endpoints and VPNs for secure connections.
-
-
Data at Rest:
-
Use Azure Storage Service Encryption (SSE) (enabled by default).
-
Enable Azure Disk Encryption for VMs.
-
Use customer-managed keys (CMK) for added control.
-
3 . Describe the role of triggers and schedules in Azure Data Factory.
Role of Triggers in ADF
Triggers determine when and how a pipeline should run. ADF supports three main types:
-
Schedule Trigger
-
Runs pipelines at specific times or intervals (e.g., daily, hourly).
-
Ideal for regular ETL jobs.
-
-
Event-based Trigger
-
Starts pipelines in response to events, such as the arrival of a file in Azure Blob Storage.
-
Useful for real-time or near-real-time processing.
-
-
Manual Trigger
-
Pipelines are started manually by a user or system.
-
Useful for testing or ad-hoc runs.
-
Schedules in ADF
Schedules define time-based rules for execution:
-
Specify start time, recurrence, and time zone.
-
Can be linked to schedule triggers to automate runs.
4 . How do you optimize data storage and retrieval in Azure Data Lake Storage?
1. Use Efficient File Formats
Store data in Parquet or Avro formats, which are compressed and columnar, reducing both storage space and read times during analytics.
2. Partition Data
Organize your data into logical folders (e.g., by date or region). This helps in minimizing data scanned during queries, improving performance.
3. Avoid Small Files
Too many small files cause metadata overhead and slow down processing. Combine them into larger files for better efficiency.
4. Use Hierarchical Namespace (HNS)
ADLS Gen2 with HNS enabled supports directory operations and improves performance and manageability.
5. Storage Tiering
Use Hot tier for frequently accessed data, Cool for infrequent, and Archive for rarely accessed data to reduce costs.
6. Automate with Lifecycle Policies
Set lifecycle rules to automatically move or delete old data, keeping storage optimized.
5 . How do you optimize query performance in Azure SQL Database?
To handle schema drift in Azure Data Factory (ADF):
-
Enable Schema Drift in the source and sink settings when using Mapping Data Flows.
-
Use Auto Mapping or dynamic column mapping to handle changing schemas without manual updates.
-
Store data in flexible formats like Parquet or JSON in Data Lake to accommodate evolving structures.
-
Use parameterized pipelines to dynamically adjust to schema changes across datasets.
12 . What is the significance of Z-ordering in Delta tables in Azure Databricks?
Z-ordering in Delta tables (Azure Databricks) is a technique used to optimize data layout for faster query performance.
Significance:
-
Improves query speed by co-locating related data (e.g., filtering columns) on disk.
-
Reduces the amount of data scanned during queries by enabling data skipping.
-
Especially useful for high-cardinality columns like timestamps, user IDs, or product codes.
-
Enhances performance for range queries and filters in large datasets.
13 . How do you handle incremental data load in Azure Databricks?
To handle incremental data load in Azure Databricks:
-
Use a watermark column (e.g.,
LastModifiedDate
orUpdatedAt
) to filter new or changed records. -
Query only the data that has changed since the last load using Spark SQL or DataFrame filters.
-
Store the checkpoint or last processed value (e.g., in a Delta table or metadata file).
-
Merge incremental data into the target Delta table using
MERGE INTO
for upserts (insert/update). -
Automate the process using Databricks Jobs or integrate with ADF pipelines for orchestration.
22 . How do you implement error handling in Azure Data Factory pipelines?
23 . Describe the process of integrating ADF with Azure Databricks for ETL workflows.
To integrate Azure Data Factory (ADF) with Azure Databricks for ETL workflows:
-
Create a Linked Service in ADF to connect to your Azure Databricks workspace using a workspace URL and access token.
-
In your ADF pipeline, add a Databricks Notebook activity to call a specific notebook for ETL logic (e.g., data transformation, cleansing).
-
Pass parameters from ADF to Databricks using the base parameters option.
-
Use ADF triggers or scheduling to automate and orchestrate the ETL workflow.
-
Monitor and log execution results in ADF’s Monitor tab to track success or failure.
25 . How do you handle schema evolution in Delta Lake (Databricks on Azure)?
-
Use the
mergeSchema
option when writing data to allow automatic schema updates: -
Enable schema enforcement to prevent accidental writes with incompatible schemas.
-
Use the
ALTER TABLE
command to manually add or update columns when needed. -
For streaming data, use Auto Loader with
cloudFiles.schemaEvolutionMode
set toaddNewColumns
. -
Track schema changes using Delta Lake’s transaction log and
DESCRIBE HISTORY
command.
26 . How do you secure data pipelines in Azure?
To secure data pipelines in Azure, follow these best practices:
-
Use Managed Identity to authenticate ADF, Databricks, or Synapse with other Azure services without storing secrets.
-
Enable encryption:
-
In transit using HTTPS/TLS
-
At rest using Azure Storage encryption with Microsoft or customer-managed keys (CMK)
-
-
Restrict access using Azure RBAC and Access Control Lists (ACLs) on resources like ADLS or Key Vault.
-
Use Private Endpoints and VNET Integration to keep data movement within secure networks.
-
Audit and monitor activity using Azure Monitor, Log Analytics, and Defender for Cloud.
-
Store secrets securely in Azure Key Vault and reference them in pipelines instead of hardcoding.
27 . What are the best practices for managing large datasets in Azure Databricks?
-
Use Delta Lake format to ensure data reliability, support for ACID transactions, and efficient updates.
-
Optimize data layout by managing partitions effectively and using Z-Ordering for faster query filtering.
-
Minimize small files by batching writes or using tools like Auto Optimize to combine data efficiently.
-
Scale clusters appropriately using autoscaling and choose the right node types for compute-heavy workloads.
-
Monitor and tune performance with the Spark UI, job metrics, and built-in Databricks performance tools.
-
Use caching carefully for frequently reused data to reduce computation time.
-
Implement access controls with Unity Catalog, table ACLs, and Azure security features to govern large datasets securely.
28 . Explain the difference between streaming and batch processing in Spark (Azure context).
In the Azure context (e.g., Azure Databricks with Spark), the difference between streaming and batch processing lies in how data is ingested and processed:
Batch Processing:
-
Processes static or finite datasets at scheduled intervals.
-
Ideal for ETL jobs, historical data analysis, and data warehouse loads.
-
Uses Spark APIs like
DataFrame
,read
,write
.
Streaming Processing:
-
Handles real-time or continuous data from sources like Event Hubs, Kafka, or IoT Hub.
-
Suitable for real-time analytics, fraud detection, or alerting systems.
-
Uses Structured Streaming API with
readStream
andwriteStream
.
29 . What is the purpose of caching in PySpark and how is it used in Azure Databricks?
-
To Speed Up Workflows:
When a DataFrame is used multiple times in transformations or actions, caching it with.cache()
or.persist()
keeps it in memory for faster access. -
Monitoring:
You can track cache usage and storage through the Spark UI in Databricks for optimization. -
Best Practices:
-
Cache only when data fits in memory.
-
Unpersist unused data to free up memory.
-
30 . How to implement incremental load in ADF?
Incremental load in Azure Data Factory is implemented using watermark columns (e.g., LastModifiedDate or
ID).
You can use the ‘Lookup’ activity to retrieve the last loaded value, pass it as a parameter to the source
dataset, and use a ‘Filter’ or query condition to load only new or updated records.
31 . How do you design and implement data pipelines using Azure Data Factory?
Designing pipelines in ADF involves defining source and destination datasets, creating linked services for
connectivity, and using activities like Copy, Data Flow, or stored procedure. Pipelines can include conditional
logic, loops, parameters, and triggers to orchestrate the flow of data.
32 . How do you handle late-arriving data in ADF?
Late-arriving data can be handled using time window-based watermarking, storing late data in a staging area, or using tumbling window triggers. You can also reprocess specific partitions using ADF pipeline parameters
and conditional branching.
33 . Describe the process of setting up CI/CD for Azure Data Factory.
CI/CD in ADF is achieved using Git integration with Azure Repos or GitHub. You create feature branches for
development, publish changes to the collaboration branch, and use Azure DevOps pipelines or ARM
templates to deploy to other environments like test and production.
34 . What are the types of Integration Runtimes (IR) in ADF?
ADF supports three types of Integration Runtimes:
– Azure IR for cloud data movement and transformation
– Self-hosted IR for on-premises and VNet access
– Azure-SSIS IR for running SSIS packages in ADF
35 . How do you ensure data quality and validation in ADLS?
Data quality in ADLS can be ensured using ADF Data Flows with derived columns, conditional splits, and
assertion transformations. You can also implement row-level validation checks and log invalid records into
separate datasets for analysis.
36 . Describe the role of triggers in ADF pipelines.
Triggers in ADF automate pipeline execution. Types include:
– Schedule Trigger: runs at defined intervals
– Tumbling Window Trigger: used for time-based partitioning
– Event-based Trigger: responds to blob events in Azure Storage
– Manual Trigger: used for on-demand runs.
37 .How to copy all tables from one source to the target using metadata-driven pipelines in ADF?
Use a metadata table that stores source and destination table names. Create a ForEach activity in ADF that reads the metadata and uses Copy activity inside it to copy data dynamically.
38.How do you monitor ADF pipeline performance?
- Use Monitor tab in ADF Studio.
- Enable diagnostic logs to route data to Log Analytics.
- Use Azure Monitor or custom alerts for errors or performance bottlenecks.
39 .How do you implement error handling in ADF using retry, try-catch blocks, and failover mechanisms?
ADF provides robust mechanisms for error handling to ensure data reliability and fault tolerance. You can apply Retry Policies directly in each activity to automatically retry upon transient failures. Use control activities like If Condition, Switch, and Execute Pipeline along with the On Failure path to route the workflow logically based on the outcome. Additionally, log failed rows or activities into a separate error-handling pipeline or storage location to allow for future reprocessing, minimizing data loss.
40.How to track file names in the output table while performing copy operations in ADF?
In Azure Data Factory, you can track file names during copy operations by using sourceInfo().fileName
in Mapping Data Flows. This expression allows you to capture and store the source file name as a new column in the output table. This is useful for audit and traceability, especially when ingesting data from multiple files.
41 . How do you handle schema evolution in ADF?
Use Mapping Data Flows with Auto Mapping and enable “Allow Schema Drift” to handle dynamic schema changes. You can also validate schema using metadata checks before processing to ensure consistency.
42.What are the key considerations for designing scalable pipelines in ADF?
To design scalable pipelines in ADF, use parallelism by configuring the ForEach activity with a batch count. Structure your pipelines modularly for reusability and better maintainability. Leverage Integration Runtime scaling to manage large workloads efficiently, and ensure robust error handling with proper retry and failover strategies.
43 .How do you manage schema drift in ADF?
To manage schema drift in Azure Data Factory, enable the “Allow Schema Drift” option in Mapping Data Flows. Use dynamic mapping or schema projection to accommodate changing schemas during runtime. Additionally, implement schema validation logic to audit and control any unexpected schema changes.
44 .How do you integrate Azure Key Vault with ADF pipelines?
Use `dropDuplicates()` method:
`df.dropDuplicates([‘column1’, ‘column2’])` to remove duplicate rows based on specific columns.
55 . Explain the difference between narrow and wide transformations in PySpark
Narrow transformations (e.g., map, filter) operate on a single partition.
Wide transformations (e.g., groupByKey, reduceByKey) require data shuffling across partitions.
56. How do you optimize PySpark jobs for large datasets?
Use partitioning wisely.
– Cache/persist intermediate results.
– Avoid using collect() on large datasets.
– Minimize data shuffles and use broadcast joins when possible
57 . Write PySpark code to perform an inner join between two DataFrames
df1.join(df2, df1.id == df2.id, ‘inner’)
58 . Describe the concept of fault tolerance in Spark.
Spark achieves fault tolerance through lineage information in RDDs and the ability to recompute
lost data using DAGs (Directed Acyclic Graphs).
59 . Explain the concept of partitioning in PySpark
Partitioning controls data distribution across clusters. Use `repartition()` to increase/decrease
partitions and `coalesce()` to reduce them efficiently.
60 . Explain the difference between Spark SQL and PySpark DataFrame APIs.
Spark SQL provides a SQL interface to Spark, allowing users to run SQL queries directly on
structured data. PySpark DataFrame APIs, on the other hand, offer a Pythonic way to perform data
manipulation, transformation, and aggregation using Spark’s DataFrame abstraction. While Spark
SQL is more suited for users familiar with SQL, PySpark DataFrames are better for complex data
engineering and transformations in code, allowing chaining of transformations with better type safety
and scalability.
61 . How do you manage partitioning in PySpark?
Partitioning in PySpark refers to dividing the data into logical chunks across nodes to enable
distributed computing. You can manage partitioning using the repartition() and coalesce() functions.
repartition() increases or decreases the number of partitions and reshuffles the data, while
coalesce() reduces the number of partitions without a full shuffle. Proper partitioning improves
parallelism and minimizes data shuffling during operations like joins, groupBy, and aggregations.
62 . How do you optimize joins in PySpark for large datasets?
To optimize joins in PySpark, consider broadcasting the smaller dataset using broadcast() when
joining with a significantly larger dataset. Also, ensure both datasets are partitioned properly on the
join key using repartition(). Avoid wide transformations and unnecessary shuffles, and filter data
before the join if possible. Using Delta Lake for data storage and caching frequently used tables can
further improve performance
63 . Write PySpark code to calculate the average salary by department
from pyspark.sql import SparkSession from pyspark.sql.functions import avg spark = SparkSession.builder.getOrCreate() df = spark.read.csv("employees.csv", header=True, inferSchema=True) avg_salary = df.groupBy("department").agg(avg("salary").alias("average_salary")) avg_salary.show()
64 . How do you implement parallel processing in PySpark?
PySpark enables parallel processing inherently through its distributed computing architecture. Each
transformation on a DataFrame or RDD is executed in parallel across partitions. You can influence
the degree of parallelism using repartition() to increase the number of partitions, allowing more tasks
to run concurrently. Additionally, actions like mapPartitions() and foreachPartition() help perform
operations in parallel across data partitions.
65 . What is Z-ordering in Spark?
Z-ordering is a data clustering technique used in Delta Lake to colocate related information in the
same set of files. It improves read performance by reducing the amount of data scanned during
queries. In PySpark, Z-ordering is used while writing data with .zorderBy() during the OPTIMIZE
command on Delta tables. This is particularly effective for queries with filters on columns that are
Z-ordered.
66 . Write PySpark code to perform an inner join between two DataFrames.
df1 = spark.read.csv("employees.csv", header=True, inferSchema=True) df2 = spark.read.csv("departments.csv", header=True, inferSchema=True) joined_df = df1.join(df2, df1.dept_id == df2.id, "inner") joined_df.show()
67 . What is AQE (Adaptive Query Execution) in Databricks?
Adaptive Query Execution (AQE) in Spark dynamically optimizes query plans at runtime based on
actual data statistics. AQE can change the join strategy, adjust skewed partition handling, and
optimize the number of shuffle partitions. In PySpark on Databricks, AQE is enabled by default,
making queries more efficient without requiring manual tuning.
68 . Explain the difference between narrow and wide transformations in PySpark.
Narrow transformations (like map, filter, and union) operate on a single partition of data and do not
require shuffling. These are fast and efficient. Wide transformations (like groupBy, join, and distinct)
require data shuffling across partitions, which is more expensive and slower. Understanding this
distinction is crucial for optimizing PySpark applications.
69 . Write a PySpark code to join two DataFrames and perform aggregation.
To join two DataFrames and aggregate the result, use the join()
and groupBy()
methods. Here’s an example where we join sales data with product data and compute total sales by category:
from pyspark.sql.functions import sum df_sales = spark.createDataFrame([ (1, 100), (2, 200), (1, 150) ], ["product_id", "sales"]) df_products = spark.createDataFrame([ (1, "Electronics"), (2, "Furniture") ], ["product_id", "category"]) joined_df = df_sales.join(df_products, on="product_id", how="inner") agg_df = joined_df.groupBy("category").agg(sum("sales").alias("total_sales")) agg_df.show()
70 . What is the difference between wide and narrow transformations in Spark?
In Spark, narrow transformations (like map
, filter
, and union
) are transformations where each input partition contributes to only one output partition. These do not require data movement between partitions. Wide transformations (like groupByKey
, reduceByKey
, join
) involve shuffling, which means data is transferred across nodes to re-organize it based on keys. Wide transformations are more expensive and often require tuning for performance.
71 . Explain lazy evaluation in PySpark.
Lazy evaluation means that Spark does not immediately execute transformations like map()
or filter()
. Instead, it builds a logical execution plan. Actual computation is triggered only when an action like collect()
, count()
, or show()
is called. This allows Spark to optimize execution plans and minimize data scans, improving performance.
72. How does caching work in PySpark?
Caching in PySpark means storing DataFrame results in memory so that subsequent actions can reuse the results without recomputing. You can cache a DataFrame using .cache()
or .persist()
. It’s particularly useful when you’re performing multiple actions on the same DataFrame. This reduces computation time but increases memory usage.
df.cache() df.count() # triggers caching
73 . Write PySpark code to calculate the total sales for each product
from pyspark.sql.functions import sum df = spark.createDataFrame([ ("Electronics", 100), ("Furniture", 200), ("Electronics", 150) ], ["category", "sales"]) df.groupBy("category").agg(sum("sales").alias("total_sales")).show()
74 . Explain how broadcast joins improve performance in PySpark.
Broadcast joins are used when one DataFrame is small enough to fit in memory. Spark broadcasts the small DataFrame to all worker nodes, avoiding a shuffle. This significantly improves join performance in scenarios where one dataset is large and the other is small.
from pyspark.sql.functions import broadcast large_df.join(broadcast(small_df), "id").show()
75 . Describe the role of the driver and executor in Spark architecture.
In Spark, the driver is the main program that coordinates all tasks. It converts user code into a Directed Acyclic Graph (DAG), plans job execution, and manages metadata. Executors are worker processes running on cluster nodes. They execute the tasks, perform data processing, and return results to the driver. Together, the driver and executors enable parallel data processing in a distributed environment.
76 . Explain Adaptive Query Execution (AQE) in Spark.
Adaptive Query Execution (AQE) is a feature in Spark that allows the engine to dynamically adjust the execution plan at runtime based on actual data statistics. AQE helps optimize:
-
Join strategies (e.g., switching to broadcast join)
-
Skew handling (by splitting skewed partitions)
-
Shuffle partition sizes (by coalescing small partitions)
AQE improves query performance, especially in cases where static optimization falls short.
77 . Write PySpark code to perform a left join between two DataFrames.
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"]) df2 = spark.createDataFrame([(1, "HR"), (3, "Sales")], ["id", "dept"]) df1.join(df2, on="id", how="left").show()
78 .Join Two DataFrames and Aggregate
result = df1.join(df2, "id").groupBy("category").agg({"sales": "sum"})
96 . Describe the process of setting up CI/CD for Azure Data Factory
CI/CD for ADF is done using Azure DevOps. Create a Git repository linked to ADF, manage branches, publish changes, and create build and release pipelines using YAML or classic pipelines.
97 .What are the types of Integration Runtimes (IR) in ADF?
There are three types of IR:
(1) Azure IR (for cloud data movement),
(2) Self-hosted IR (on-premises or VNet),
(3) Azure-SSIS IR (for SSIS packages).
98 .Difference between Blob Storage and Azure Data Lake Storage (ADLS).
ADLS is built on top of Blob Storage but supports hierarchical namespaces, optimized performance for big data analytics, and fine-grained access control.
99 .How do you integrate Databricks with Azure DevOps for CI/CD pipelines?
Use Azure Repos or GitHub to manage notebooks. Use Azure Pipelines to automate testing and deployment using Databricks CLI or REST APIs.
100 .How do you ensure data quality and validation in ADLS?
Implement validation checks in ADF or Databricks, use custom logging for anomalies, schema enforcement via Delta Lake, and apply data profiling.
101 .Explain the use of hierarchical namespaces in ADLS.
Hierarchical namespace in ADLS Gen2 allows directories and files to behave like a traditional filesystem, enabling better performance and ACL-based security.
102 .Describe the process of setting up and managing an Azure Synapse Analytics workspace.
Provision a workspace, configure dedicated/ serverless SQL pools, link storage accounts, and integrate pipelines, Spark pools, and data flows for analytics.
103 .How do you handle late-arriving data in ADF?
Use watermark columns with tumbling window triggers, or design pipelines to reprocess data based on arrival time.
104 . Explain the concept of Data Lakehouse.
A Data Lakehouse combines the storage of a data lake with the structure and management of a data warehouse using Delta Lake or Apache Iceberg.
105 .How do you implement disaster recovery for ADLS?
Use Geo-redundant storage (GRS), data backup strategies, lifecycle policies, and automation scripts to recover and sync data.
106 .How do autoscaling clusters work in Databricks?
Databricks autoscaling adjusts the number of worker nodes based on workload. It helps optimize costs by scaling down when idle and scaling up during high demand.
107 .How do you manage access control in Azure Data Lake?
Use role-based access control (RBAC) and access control lists (ACLs) for files and directories, integrated with Azure AD for authentication.
108 .What are the challenges in integrating on-premises data with Azure services?
Challenges include network latency, VPN setup, firewall rules, hybrid identity, data consistency, and secure data movement.
109 .Describe the role of triggers in ADF pipelines.
Triggers initiate pipeline runs. Types include Schedule (time-based), Tumbling Window (interval + state), and Event-based (file arrival, etc).
110 .How do you monitor and troubleshoot Spark jobs?
Use Spark UI, logs, Ganglia metrics, job history server, and Databricks job run details to identify slow stages and bottlenecks.
111 . How to copy all tables using metadata-driven pipelines in ADF?
To copy all tables using metadata-driven pipelines in ADF, you can maintain a metadata table or configuration file that contains information like source and target table names and schemas. Azure Data Factory uses a Lookup activity to fetch this metadata and a ForEach activity to loop over each table configuration. Inside the loop, a Copy Data activity dynamically reads from and writes to datasets using parameters and expressions, which allows for scalable and reusable pipeline design without manually creating separate activities for each table.
112 . How do you implement data encryption in Azure SQL Database?
Data encryption in Azure SQL Database is implemented using multiple layers of protection. For data at rest, Transparent Data Encryption (TDE) is enabled by default to encrypt stored data using certificates. For data in transit, Azure SQL uses Transport Layer Security (TLS) to protect data sent over the network. Additionally, key management can be handled through Azure Key Vault, which provides secure key rotation and storage capabilities.
113 . What are the best practices for managing and optimizing storage costs in ADLS?
To optimize storage costs in Azure Data Lake Storage, use lifecycle policies to automatically delete or move older data to cooler tiers. Compress files (e.g., using Parquet or Avro) to reduce storage footprint. Avoid small file issues by batching writes, and use partitioning to optimize data access and reduce scanning. Regularly monitor usage and configure alerts for unusual storage patterns.
114 . How do you implement security for data in transit and at rest in Azure?
Data at rest is secured using encryption methods like TDE in SQL databases and SSE in ADLS. Data in transit is encrypted using HTTPS and TLS 1.2 or higher. Role-Based Access Control (RBAC) and Access Control Lists (ACLs) help restrict access, while Azure Key Vault stores and manages encryption keys securely. Network-level security using Private Endpoints and Firewalls adds further protection.
115 . Describe the role of triggers and schedules in Azure Data Factory.
Triggers in ADF automate pipeline execution. There are three types: Schedule Trigger (runs pipelines on a time-based schedule), Tumbling Window Trigger (ensures interval-based data movement and supports retries), and Event Trigger (starts pipeline based on events like file creation in a blob). Triggers help build robust, time- or event-based workflows with minimal manual intervention.
116 . How do you optimize data storage and retrieval in Azure Data Lake Storage?
Optimization involves storing data in columnar formats like Parquet, organizing it using partitioned folders, and minimizing the number of small files. Use hierarchical namespace for better performance with directory-based operations. Employ caching and filter pushdowns in querying tools. Enable data tiering and compression to reduce costs and improve performance.
117 . How do you monitor ADF pipeline performance?
You can monitor pipelines using Azure Monitor, which integrates with ADF and provides metrics and logs. Activity runs and trigger runs can be visualized in the ADF Monitoring tab. For detailed insights, you can send diagnostic logs to Log Analytics or use custom logging inside pipelines using Web activities and Azure Functions.
118 . How do you implement incremental load in Databricks?
Incremental loading in Databricks can be done using watermarks (last updated timestamp or surrogate key). You filter new/changed records during each run using a value stored in a checkpoint or metadata table. Delta Lake simplifies this by supporting merge (upsert) operations and built-in versioning for CDC-like behavior.
119 . What are key considerations for designing scalable pipelines in ADF?
Scalability in ADF pipelines involves using parameterized datasets and linked services to build reusable and dynamic pipelines. Use ForEach with batch controls for concurrency, Data Flows for scalable transformations, and leverage Integration Runtime for distributed data movement. Monitor pipeline performance and break large workflows into smaller reusable units to ensure modularity and reusability.
120 . How do you handle error handling in ADF (retry, try-catch, failover)?
ADF provides built-in retry policies at the activity level and error handling using the IfCondition
, Until
, and Execute Pipeline
activities. You can wrap failure-prone steps in Try-Catch-Finally
patterns using control flow logic. Failed rows can be routed to error paths or logged into error tables/files using Data Flows. Alerts and logging through Log Analytics or custom email/Teams alerts help monitor and recover.
121 . How to track file names in ADF output during copy operations ?
To track file names during copy operations, you can use the @item().name
or dynamic expressions in Copy Data activity to capture filenames during iteration. Logging the filename and status in a sink table or using a metadata store like Azure SQL can help in auditing and debugging. The filename can also be written into the target file using a derived column or parameter in Mapping Data Flow.
122 . How do you manage and automate ETL workflows using Databricks Workflows?
Databricks Workflows lets you orchestrate notebooks, scripts, and SQL in a DAG-like fashion. You can schedule workflows, define dependencies, retry logic, and pass parameters between tasks. Integration with Azure DevOps or GitHub allows for CI/CD. Workflow results and logs can be monitored in the UI or exported to external monitoring tools for alerts and metrics.
123 . Describe disaster recovery for ADLS.
ADLS provides geo-redundant storage (GRS) and zone-redundant storage (ZRS) to replicate data across regions. For mission-critical workloads, you can implement cross-region replication manually using tools like AzCopy or Azure Data Factory. Access can be managed through backup SAS tokens, and automated scripts or Logic Apps can restore metadata and access settings during a DR event.
124 .How do you handle incremental data loads in ADLS?
Incremental loads in ADLS are typically based on timestamps or surrogate keys. You can filter new or changed data at the source using parameters and only copy deltas. Delta Lake format makes it easy by supporting ACID transactions, versioning, and merge operations. ADF pipelines or Databricks jobs can update checkpoints or metadata tables to track the last successful load.
125 . What are the security features in Azure Synapse Analytics?
Synapse provides multiple layers of security: RBAC for workspace access, firewall rules and private endpoints for network isolation, managed identities for secure data access, and integration with Azure AD for authentication. Data is encrypted at rest and in transit, and column-level and row-level security controls ensure fine-grained access. Synapse also supports auditing and diagnostic logging.
126 . How do you implement real-time processing in Databricks using Azure Event Hubs/Kafka?
Real-time processing is implemented by connecting Structured Streaming in Databricks to Azure Event Hubs or Kafka. The streaming source ingests events in micro-batches, applies transformations, and writes to sinks like ADLS, Delta Lake, or Synapse. You define checkpoints to ensure fault tolerance and use watermarking for late data handling. Auto-scaling and trigger intervals help optimize performance.
127 . How do you integrate Azure Key Vault with ADF?
Azure Key Vault can be linked to ADF by referencing secrets in linked services. When creating a linked service (e.g., Azure SQL or Blob), choose “store in Azure Key Vault” for credentials. You must grant ADF’s managed identity access to the Key Vault using access policies. This approach avoids hardcoding sensitive information in pipeline configurations.
128 . What are best practices for optimizing ADLS storage costs?
Use compression (e.g., GZIP, Snappy) and columnar formats like Parquet to reduce file sizes. Implement lifecycle management policies to move infrequently accessed data to cool or archive tiers. Avoid small files by batching writes and optimizing partition strategies. Monitor and clean up unused or outdated data using automation.
129 . How do you implement CI/CD for Azure Synapse Analytics?
Synapse integrates with Git repositories like Azure DevOps or GitHub for source control. You can commit pipelines, notebooks, and SQL scripts to branches and use YAML-based pipelines to deploy to different environments. ARM templates or Synapse Workspace Deployment Tool can automate resource provisioning. Use environment variables and parameterization for flexibility across environments.
130 . What is the role of Integration Runtime in ADF?
Integration Runtime (IR) is the compute infrastructure used by ADF to move and transform data. The Azure IR supports cloud data movement and transformations. Self-hosted IR allows access to on-prem or private network resources. SSIS IR supports running SSIS packages. IR is responsible for scaling, secure data transfer, and region-based performance optimization.
131 . How do you secure sensitive data in Azure?
Sensitive data is secured using encryption (at rest and in transit), Azure Key Vault for managing secrets, RBAC for access control, and data masking features in Azure SQL. Network security is enforced using firewalls, private endpoints, and NSGs. You can use tools like Microsoft Defender for Cloud for threat detection and compliance checks.
132 . Describe a data pipeline for real-time analytics using Azure tools.
A real-time analytics pipeline in Azure often starts with data ingestion using Azure Event Hubs, IoT Hub, or Kafka. Streamed data is processed using Azure Databricks Structured Streaming or Stream Analytics, and enriched or aggregated data is stored in Delta Lake or Azure Synapse Analytics. Power BI is used for near real-time dashboarding. The pipeline uses Azure Key Vault, ADF triggers, and Log Analytics for security, orchestration, and monitoring.
133 . What is the difference between a job cluster and an interactive cluster in Databricks?
A job cluster is created for a single job run and auto-terminates afterward, saving costs. An interactive cluster is manually started and used for development and collaboration, staying active until manually shut down.
134. How do you implement data deduplication in PySpark?
Use dropDuplicates()
on specific columns to remove duplicates. Example: df.dropDuplicates(["email", "phone"])
. It ensures clean data in ETL workflows.
135 . Explain the concept of Delta Lake compaction.
Compaction merges small files into larger ones using the OPTIMIZE
command, improving read performance in Delta Lake by reducing file overhead.
136 . How do you handle null values in PySpark?
Use functions like fillna()
, dropna()
, or na.replace()
to handle nulls. Choose based on whether you want to replace, drop, or impute missing data.
137 . What is AQE (Adaptive Query Execution) in Databricks?
AQE optimizes query plans at runtime by dynamically changing join types, fixing data skew, and optimizing partitions.
138 . Write PySpark code to perform an inner join between two DataFrames.
result = df1.join(df2, df1.id == df2.id, "inner")
139 . Explain the difference between narrow and wide transformations in PySpark.
Narrow transformations (e.g., map
) don’t require data movement. Wide transformations (e.g., groupBy
) involve shuffling across partitions.
140 . How do you optimize PySpark jobs for large datasets?
Use partitioning, caching, broadcast joins, and avoid wide transformations. Monitor Spark UI for bottlenecks.
141 . Explain the concept of partitioning in PySpark.
Partitioning splits data across executors. It improves performance by enabling parallel processing and reducing shuffles.
142 . How do you implement real-time data processing in Databricks using Structured Streaming?
Use readStream
and writeStream
with supported sources like Kafka/Event Hubs. Define processing logic and output sink.
143 . Describe the concept of fault tolerance in Spark.
Spark tracks lineage using DAGs and can recompute lost data partitions upon failure, ensuring data reliability.
144 . Explain the concept of shuffling in Spark.
Shuffling is data movement across partitions during wide transformations. It’s expensive and can affect performance.
145 . What is a broadcast join in PySpark?
Broadcast join is used when one of the DataFrames is small. Spark broadcasts the smaller DataFrame to all executors to avoid shuffling. This improves join performance significantly. Use broadcast()
from pyspark.sql.functions
during the join.
146 . How to create a rank column using the Window function in PySpark?
Use PySpark’s Window
and rank()
or dense_rank()
to rank rows within partitions. Define a WindowSpec
with partitionBy
and orderBy
. Apply rank using .withColumn()
and .over(windowSpec)
.
147 . Difference between repartition() and coalesce() in PySpark
repartition()
increases or reshuffles partitions via full shuffle; used to balance data. coalesce()
merges existing partitions without shuffle; ideal for reducing partitions before writing. Repartition is costlier; coalesce is faster.
148 . How to persist and cache data in PySpark?
Use .cache()
to store data in memory. Use .persist()
for control over storage level (memory, disk, etc.). Always unpersist data after use to free resources.
149 . Explain the concept of partitioning in PySpark.
Partitioning splits data into chunks processed in parallel. Use .repartition()
or .coalesce()
for control. Efficient partitioning improves parallelism and reduces shuffle overhead.
150 . Explain the concept of Delta Lake compaction.
Compaction merges small Delta files into larger ones to reduce file overhead and improve query performance. Use OPTIMIZE
command in Databricks for compaction.
151 . How to handle null values in PySpark?
Use fillna()
to replace nulls, dropna()
to remove rows, and na.replace()
for custom logic. Use isNull()
/isNotNull()
in filters.
152 . How do you implement real-time data processing using Structured Streaming?
Use readStream
for source (e.g., Kafka), apply transformations, then write with writeStream
to sink (e.g., console, Delta). Ensure checkpointing is configured for fault tolerance.
153 . Explain fault tolerance in Spark.
Spark uses lineage info (DAG) to recompute lost data. RDD/DataFrame transformations are deterministic, enabling automatic recovery upon executor failure.
154 . Explain shuffling in Spark.
Shuffling is data movement across partitions due to wide transformations. It’s costly and can lead to performance bottlenecks. Minimize with techniques like broadcast join and proper partitioning.
155 . How do you monitor and optimize performance in Azure Synapse?
Use Monitor Hub, DMVs, and SQL insights. Optimize using result caching, materialized views, and appropriate table distribution. Avoid excessive shuffling and use partitions.
156 . How do you handle schema drift in ADF?
Enable “Allow schema drift” in Mapping Data Flows. Use dynamic mappings and parameterized datasets. Schema projection helps handle unexpected schema changes.
157 . Explain denormalization and when it should be used.
Denormalization combines tables for faster read performance. Use in OLAP systems or reporting scenarios. It simplifies joins but introduces redundancy.
158 . What are common ADF activities?
Copy Activity, Data Flow, Notebook Activity, Web Activity, Lookup, and Stored Procedure Activity. Each helps build ETL/ELT pipelines.
159 . How do you integrate ADLS with Databricks?
Mount ADLS using service principal and OAuth config in Databricks. Read/write using mounted path (e.g., /mnt/...
). Use secrets from Azure Key Vault for secure integration.
160 . How do you automate workflows using Azure Logic Apps?
Use triggers (timer, HTTP) and actions (SQL, email, Power BI). Example: Query SQL DB and send alerts. Good for lightweight event-driven workflows.
161 . How to implement data masking in ADF?
Use Derived Column transformation to mask data. Also, apply Dynamic Data Masking in Azure SQL. Combine with Key Vault and role-based access for security.
162 . How to ensure high availability and disaster recovery for Azure SQL DB?
Use Business Critical tier with zone redundancy. Set up geo-replication for DR. Enable auto-failover groups and long-term backup retention.
163 . Differences between Azure SQL DB and Managed Instance
SQL DB is fully managed, best for modern apps. Managed Instance supports full SQL Server features (e.g., SQL Agent, VNET), better for lift-and-shift workloads.
164 . What are the security features in ADLS Gen2?
Supports RBAC, ACLs, VNet, and encryption at rest and in transit. Use Key Vault for CMKs and enforce HTTPS. Combine with firewalls for added security.
165 . How to manage data lifecycle in ADLS?
Use Azure Blob Lifecycle policies to tier or delete data. Define rules by blob age or last modified. Helps reduce costs for infrequent-access data.
166 . How do you implement CI/CD in Azure DevOps?
Use pipelines for build and release. Store code in Git, package templates, and deploy via YAML or Classic pipelines. Use approvals, secrets, and stages.
167 . How do you integrate Azure Key Vault with other services?
Grant access to services via managed identity. Reference secrets in ADF, Synapse, Databricks using Key Vault integration. Helps secure credentials.
168 .What are key features of Azure DevOps?
Includes Git-based repos, CI/CD pipelines, Boards for agile, Test Plans, and Artifact storage. Integrates with VS Code, GitHub, and Azure services.
169. How to monitor and troubleshoot Azure SQL DB?
Use Query Performance Insight, SQL Auditing, and DMVs. Set alerts and track long-running queries. Use Log Analytics for deeper monitoring.
170 . What is the role of metadata in data architecture?
Metadata defines schema, relationships, and lineage. Used in governance (Purview), cataloging, and auditing. Improves discoverability and trust in data.
171 . Explain the concept of Delta Lake compaction.
Delta Lake compaction is the process of combining many small files generated during streaming or frequent batch writes into fewer large files. This improves read performance by reducing file overhead during query execution. Compaction can be triggered manually or scheduled periodically. It typically uses OPTIMIZE
in Databricks for efficient file merging.
172 . How do you handle null values in PySpark?
Null values in PySpark can be handled using functions like fillna()
, dropna()
, and na.replace()
. You can choose to either replace nulls with default values or drop rows/columns containing them. Custom logic can also be applied using when()
and isNull()
for complex transformations.
173 . Describe the concept of fault tolerance in Spark.
Spark ensures fault tolerance using lineage and DAGs. If a task fails, Spark can recompute lost partitions based on their lineage. Data stored in resilient formats like Delta or checkpoints in streaming helps recover from failures without reprocessing everything.
174 .What are the differences between RDD, DataFrame, and Dataset in PySpark?
Feature | RDD | DataFrame | Dataset |
---|---|---|---|
Type Safety | No | No | Yes |
API Level | Low-level | High-level | High-level |
Performance | Less optimized | More optimized | More optimized |
Data Structure | Unstructured | Structured (schema) | Structured (schema) |
Use Cases | Complex, unstructured data | Data analysis, SQL | Type-safe, structured |
Fault Tolerance | Yes | Yes | Yes |
-
RDD: Immutable distributed collection, fault-tolerant, gives more control but less performance.
-
DataFrame: Structured, schema-based, high-level API, supports Catalyst/Tungsten optimizations.
-
Dataset: Type-safe like RDD, optimized like DataFrame (more relevant in Scala/Java).
-
RDD: Immutable distributed collection, fault-tolerant, gives more control but less performance.
-
DataFrame: Structured, schema-based, high-level API, supports Catalyst/Tungsten optimizations.
-
Dataset: Type-safe like RDD, optimized like DataFrame (more relevant in Scala/Java).
175. How is lazy evaluation implemented in PySpark?
-
Transformations (e.g.,
map
,filter
) are lazy — they build a lineage graph. -
Actual computation occurs when an action (e.g.,
collect
,count
,show
) is called. -
Enables execution plan optimization, memory efficiency, and fault tolerance.
176. What is DataFrame lineage and how does Spark handle fault tolerance?
-
Lineage: DAG (Directed Acyclic Graph) of transformations.
-
Spark recomputes only the lost partitions using lineage.
-
No need for frequent checkpoints, but optional for long chains.
-
Caching improves recovery speed but not mandatory.
177. What is the role of the Catalyst Optimizer in PySpark?
Catalyst Optimizer improves performance via:
-
Query Analysis: Parses and validates logical plans.
-
Logical Optimization: Predicate pushdown, projection pruning, etc.
-
Physical Planning: Generates multiple physical plans and chooses the best.
-
Code Generation: Runtime bytecode generation.
-
Extensibility: Add custom rules and support various data sources.
178. How to read CSV, JSON, and Parquet files?
179. How to join multiple DataFrames and what are the types?
180. Difference between select()
and selectExpr()
?
Feature | select() |
selectExpr() |
---|---|---|
Input Type | Column objects or strings | SQL-like expressions (strings) |
Transformations | Basic column ops | Complex SQL-like ops |
Use Case | Simple selections | Complex expressions |
181. How to optimize PySpark jobs?
-
Repartition/Coalesce
-
Persist/Cache intermediate DataFrames
-
Use broadcast joins for small datasets
-
Filter early
-
Tune
spark.sql.shuffle.partitions
-
Avoid UDFs; use built-in functions
-
Monitor via Spark UI
182. What’s the difference between cache()
and persist()
?
Feature | cache() |
persist() |
---|---|---|
Storage Level | MEMORY_ONLY | Custom (e.g., MEMORY_AND_DISK) |
Flexibility | Less | More |
183. How to handle skewed data in Spark?
-
Use salting
-
Use repartition() or coalesce()
-
Prefer
reduceByKey()
overgroupByKey()
-
Enable AQE:
spark.sql.adaptive.enabled = true
-
Increase shuffle partitions
-
Use Parquet/ORC, not CSV
184. What are broadcast joins and how to use them?
-
Used when joining a large DataFrame with a small one.
-
Reduces shuffling by sending the small DF to all executors.
185. How to register a DataFrame as a temporary view?
Use createGlobalTempView()
for global session scope.
186. Can you run SQL queries in PySpark?
Yes, using spark.sql()
on views registered with .createOrReplaceTempView()
.
187. Explain window functions with an example.
windowSpec = Window.partitionBy(“department”).orderBy(“salary”)
df.withColumn(“rank”, rank().over(windowSpec)).show()
188. How to handle missing/null values?
-
Identify:
isNull()
/isNotNull()
-
Drop:
dropna()
/dropna(subset=[...])
-
Fill:
fillna()
with single or column-wise dict -
Replace:
replace(to_replace=None, value=...)
-
Aggregate functions ignore nulls by default
189. How to detect and remove duplicates?
To detect:
190. What is Databricks Runtime?
A curated, tested software stack that runs on Databricks clusters. It bundles:
-
OS (Ubuntu LTS) and JVM/Scala/Python/R.
-
Apache Spark (tuned and pre-integrated).
-
Delta Lake, MLflow, DBFS, and Databricks utilities (e.g.,
dbutils
). -
Databricks optimizations and services (e.g., Photon on compatible SQL compute, cluster management, security hardening).
You choose the runtime version (e.g., 12.x/13.x/14.x) when creating/editing a cluster. Each version pins compatible Spark + libraries.
191. What are the types of Databricks Runtimes?
Major flavors (availability can vary by cloud/region):
a) Databricks Runtime (Standard) – General-purpose Spark + Delta with Databricks performance, security, and reliability improvements.
b) Databricks Runtime for Machine Learning (ML) – Standard + popular ML/DL libs (e.g., scikit-learn, XGBoost, TensorFlow/PyTorch/Keras), MLflow tracking, and GPU support where available.
c) Databricks Runtime for Genomics – Tuned stack for genomic/biomedical workloads (specialized libs and IO optimizations).
d) Databricks Light – Minimal footprint for simple batch jobs where you don’t need advanced performance features; reduced components and features.
192. How do you share a notebook with other developers in Workspace?
To share a notebook in Databricks:
-
Direct Sharing:
-
Open the notebook
-
Click on the “Share” button in the top-right corner
-
Enter the username or email of the colleague
-
Set permissions (Can View, Can Run, Can Edit)
-
-
Workspace Permissions:
-
Right-click the notebook/folder in Workspace
-
Select “Permissions”
-
Add users/groups and set appropriate access levels
-
-
Export/Import:
-
Export notebook as .dbc or .ipynb file
-
Share the file with others who can import it
-
-
Git Integration:
-
Connect notebook to Git repository
-
Collaborators can clone the repo
-
193.How to access one notebook variable into other notebooks?
There are several ways to share variables between notebooks:
-
-
%run command:
%run /path/to/notebook # Variables defined in the called notebook become available
-
dbutils.notebook.run() (for notebook workflows):
result = dbutils.notebook.run("/path/to/notebook", timeout_seconds=60, arguments={"param": "value"})
-
Spark tables/views:
-
Create a temporary view in one notebook
-
Access it in another notebook
-
-
DBFS storage:
-
Save data to DBFS in one notebook
-
Read from DBFS in another notebook
-
-
Widgets for parameter passing
-
194. How to call one notebook from another?
Two primary options:
-
Inline import style
-
Workflow/task style
Use %run
to reuse code, and dbutils.notebook.run
to orchestrate and pass/return parameters safely.
195. How to exit a notebook and return output to the caller?
Use dbutils.notebook.exit(value: str)
inside the callee.
Callee (producer):
Caller (consumer):
196. How to create Internal (Managed) & External tables?
-
Managed (Internal) table: Databricks manages both data and metadata in the workspace’s managed storage. Dropping the table deletes the data.
-
External table: Metadata in the metastore, data remains in your specified path (e.g., ADLS/Blob). Dropping the table does not delete external data.
SQL examples (Delta recommended):
Managed (no LOCATION):
External (specify LOCATION):
197. How to access ADLS/Blob Storage in Databricks?
Three common patterns (Azure):
-
Direct ABFS/ABFSS paths (recommended) with credential passthrough or service principal:
-
DBFS Mount (legacy/OK for simple cases):
Then read/write via /mnt/raw/...
.
-
Unity Catalog External Locations (governed, recommended for prod): Define a credential + external location and create external tables over that path.
198.What are the types of Cluster Modes in Databricks?
-
-
Standard Mode:
-
Default and most common
-
Provides a secure, isolated environment
-
Best for single-user or production workloads
-
-
High Concurrency Mode:
-
Designed for multiple users
-
Provides fine-grained sharing and isolation
-
Supports SQL, Python, R, and Scala
-
Requires Premium plan or above
-
-
Single Node Mode:
-
Runs all workloads on the driver node only
-
No worker nodes
-
Good for small jobs or testing
-
Lower cost but limited scalability
-
-
199. What workload types exist for Standard clusters?
-
-
Interactive Workloads:
-
Notebook development
-
Ad-hoc queries
-
Data exploration
-
-
Batch Workloads:
-
ETL pipelines
-
Scheduled jobs
-
Large-scale data processing
-
-
Machine Learning:
-
Training models
-
Feature engineering
-
Hyperparameter tuning
-
-
Streaming Workloads:
-
Real-time data processing
-
Structured Streaming applications
-
-
SQL Analytics:
-
BI and dashboarding
-
SQL queries
-
Data visualization
-
-
200. Can I use both Python 2 and Python 3 notebooks on the same cluster?
No, you cannot use both Python 2 and Python 3 notebooks on the same cluster in Databricks.
-
Each cluster is configured with either Python 2 or Python 3 (selected during cluster creation)
-
All notebooks running on that cluster must use the same Python version
-
Modern Databricks runtimes (9.1 LTS and above) only support Python 3
-
Python 2 was deprecated in Databricks Runtime 7.x and removed in later versions
Workaround:
-
Create separate clusters for Python 2 and Python 3 workloads
-
Migrate to Python 3 (recommended as Python 2 is end-of-life)
201. What is a pool? Why use it? How to create one?
What is a pool?
A pool (formerly called instance pool) is a set of idle, ready-to-use cloud instances that reduce cluster start and auto-scaling times.
Why use pools?
-
Faster cluster startup (instances are pre-provisioned)
-
Reduced costs (instances can be shared across clusters)
-
Better resource management
-
Minimizes cold start times
How to create a pool:
-
Using UI:
-
Go to Compute > Pools > Create Pool
-
Configure:
-
Pool name
-
Instance type
-
Min/Max idle instances
-
Autoscaling
-
Preloaded Spark versions
-
-
-
Using API:
import requests headers = {"Authorization": "Bearer <token>"} data = { "instance_pool_name": "my-pool", "node_type_id": "Standard_DS3_v2", "min_idle_instances": 1, "max_capacity": 10, "idle_instance_autotermination_minutes": 15 } response = requests.post( "https://<databricks-instance>/api/2.0/instance-pools/create", headers=headers, json=data )
-
Using Terraform:
resource "databricks_instance_pool" "pool" { instance_pool_name = "my-pool" min_idle_instances = 1 max_capacity = 10 node_type_id = "Standard_DS3_v2" idle_instance_autotermination_minutes = 15 }
202. How many ways can we create/pass variables in Databricks?
There are several ways to create variables in Databricks notebooks:
-
Standard Python variables:
x = 10 name = "Databricks"
-
Spark SQL variables:
spark.sql("SET my_var = 10")
-
Widgets (for parameterization):
dbutils.widgets.text("input", "default_value", "Label") input_value = dbutils.widgets.get("input")
-
Environment variables:
import os os.environ["MY_VAR"] = "value"
-
Notebook-scoped variables (using %run):
# In notebook1: var1 = "hello" # In notebook2: %run /path/to/notebook1 print(var1) # Accessible
-
Shared variables via Spark context:
spark.conf.set("shared.var", "value") value = spark.conf.get("shared.var")
203. What are important Jobs limits to remember?
Databricks Jobs have several limitations:
-
Timeout Limits:
-
Maximum timeout is 30 days for a single run
-
Notebook jobs have a 1-year retention limit for results
-
-
Size Limits:
-
Maximum of 1000 jobs per workspace
-
Notebook size limit (several MB, depends on runtime)
-
-
Concurrency Limits:
-
Maximum concurrent runs per workspace (depends on tier)
-
Default is 1000 for most tiers
-
-
Parameter Limits:
-
Notebook jobs accept up to 100 parameters
-
Maximum parameter size is 10KB
-
-
Cluster Limitations:
-
Jobs can’t use High Concurrency clusters
-
Some instance types may be restricted
-
-
Scheduling Limits:
-
Minimum schedule interval is 10 minutes
-
Cron syntax has some cloud-specific limitations
-
-
API Limitations:
-
Rate limits on Jobs API calls
-
Maximum of 1000 runs returned per list operation
-
204. Can I use %pip
to install packages in notebooks?
Yes, you can use %pip
in Databricks notebooks to install Python packages. This is the recommended approach for package management in notebooks.
# Basic package installation %pip install pandas==1.2.0 # Install multiple packages %pip install numpy matplotlib seaborn # Install from requirements file %pip install -r requirements.txt # Install from GitHub %pip install git+https://github.com/user/repo.git # Uninstall packages %pip uninstall package-name -y
-
Installed packages are available only to the current notebook session
-
For cluster-wide packages, use cluster-scoped libraries (via UI or API)
-
%pip
commands should typically be in the first cell of the notebook -
Changes take effect immediately (no need to restart kernel)
-
You can also use
%conda
for Conda packages in some runtimes
Notes:
-
%pip/%conda
take effect on the current attached cluster; rebuilds/new clusters require re-install (use init scripts or the Libraries UI for cluster-level pinning). -
Prefer Repos + requirements.txt / environment.yml for reproducibility.
205 . Explain all the activities available in Azure Data Factory.
ADF activities are grouped as Data Movement (Copy), Data Transformation (Data Flow, Databricks, HDInsight, etc.), Control (ForEach, Until, If Condition, Wait), and External Activities. They let you move, transform, and orchestrate data pipelines.
206 .Difference between Integration Runtimes in ADF?
Integration Runtime | Description | When to Use |
---|---|---|
Azure IR | Fully managed compute in Azure for data movement, data flow, and activity dispatch. | For cloud-to-cloud data copy and transformations. |
Self-hosted IR | Installed on-premises or on a VM, connects private networks with ADF. | For on-premises ↔ cloud or private network data access. |
Azure-SSIS IR | Dedicated cluster to run SSIS packages in Azure without redevelopment. | For lift-and-shift SSIS ETL workloads to the cloud. |
207 . Explain the types of triggers in ADF. Which ones have you used in projects and why?
-
Schedule trigger (time-based),
-
Tumbling window trigger (time slices with retries),
-
Event-based trigger (on blob events).
I’ve used tumbling window for incremental loads and event triggers for real-time ingestion.
208 . How do you enable and schedule pipelines in ADF?
Create a trigger (schedule, tumbling window, or event) and attach it to the pipeline. Pipelines can also be triggered manually or via REST API/PowerShell for automation.
209 . How do you send only the last 5 days of data to Databricks?
Use a date filter condition in source queries (e.g., WHERE date >= GETDATE()-5
) or parameterize pipeline variables with system functions to pass only the last 5 days of data to Databricks.
210 . How do you define the schema in ADF?
Schema is auto-detected from linked service datasets, but you can also manually define columns, data types, or use schema drift in Mapping Data Flows for flexibility.
211 . How do you connect ADF to a database?
Create a linked service for the database (Azure SQL, SQL Server, Oracle, etc.), provide connection details (server, DB, credentials/Key Vault), and then use it in datasets.
212 . Explain the Data Flow activity in detail.
Mapping Data Flow is a visual, code-free transformation engine in ADF. It allows joins, aggregations, derived columns, lookups, and more at scale, executed on Spark under the hood.
213 . What transformations have you performed in ADF?
Common ones include Filter, Join, Aggregate, Derived Column, Lookup, Conditional Split, Surrogate Key, Pivot/Unpivot. These are used for cleaning, reshaping, and enriching data.
214 . What is the tumbling window trigger in ADF?
It triggers pipelines in fixed, contiguous, non-overlapping time slices (e.g., every 15 minutes). Useful for batch/stream-like processing with retry and catch-up options.
215 . What is the Filter activity in ADF?
Filter activity lets you apply conditions on an array and pass only matching elements to the next step. Fields include items (input array) and condition (Boolean expression).
216 . How do you get metadata in ADF?
Use the Get Metadata activity, which retrieves properties like structure, last modified, size, and schema from a dataset or file system, then pass values dynamically.
217 . What are the limits for Lookup activity in ADF?
Lookup can return a single row or up to 5000 rows (max 2 MB size). It’s generally used for config tables, parameters, or small reference data.
SQL
Employee table
EmpID | EmpName | Salary | ManagerID | DeptID | JoinDate |
---|---|---|---|---|---|
1 | Alice | 90000 | 3 | 101 | 2025-01-10 |
2 | Bob | 60000 | 3 | 101 | 2024-11-05 |
3 | Charlie | 120000 | NULL | 101 | 2023-07-01 |
4 | David | 60000 | 3 | 102 | 2025-06-15 |
5 | Anita | 75000 | 1 | 102 | 2025-05-20 |
6 | Arjun | 90000 | 1 | 103 | 2024-12-01 |
7 | Meena | 60000 | 2 | 103 | 2024-08-18 |
Q1. Fetch the second-highest salary
SELECT MAX(Salary) AS SecondHighestSalary FROM Employee WHERE Salary < (SELECT MAX(Salary) FROM Employee);
Output
SecondHighestSalary |
---|
90000 |
Q2. Get duplicate records from a table
SELECT EmpName, Salary, COUNT(*) AS Count FROM Employee GROUP BY EmpName, Salary HAVING COUNT(*) > 1;
Output:
DeptID | EmployeeCount |
---|---|
101 | 3 |
102 | 2 |
103 | 2 |
Q6. Department with highest number of employees
SELECT DeptID, COUNT(*) AS EmployeeCount FROM Employee GROUP BY DeptID ORDER BY EmployeeCount DESC FETCH FIRST 1 ROW ONLY;
Output:
DeptID | EmployeeCount |
---|---|
101 | 3 |
Q7. Employees with the same salary
Salary | EmpCount |
---|---|
60000 | 3 |
90000 | 2 |
Q8. Employees whose name starts with ‘A’
SELECT EmpName FROM Employee WHERE EmpName LIKE 'A%';
Output:
EmpName |
---|
Alice |
Anita |
Arjun |
Q9. Get the last record from a table
Output
Salary |
---|
90000 |
Q12. Remove duplicate rows (without DISTINCT)
DELETE FROM Employee E1 WHERE ROWID > ( SELECT MIN(ROWID) FROM Employee E2 WHERE E1.EmpID = E2.EmpID );
Output
No duplicates in given input → table remains same.
Q13. Find missing numbers in EmpID sequence
SELECT Level AS Missing_ID FROM dual CONNECT BY Level <= (SELECT MAX(EmpID) FROM Employee) MINUS SELECT EmpID FROM Employee;
Output
Missing_ID |
---|
(None – IDs 1 to 7 are continuous) |
Q14. Display first and last name in single column
(Assuming EmpName
column holds first names only; let’s simulate LastName from DeptID
for example.)
SELECT EmpName || ' ' || DeptID AS FullName FROM Employee;
Output
FullName |
---|
Alice 101 |
Bob 101 |
Charlie 101 |
David 102 |
Anita 102 |
Arjun 103 |
Meena 103 |
Q15. Cumulative sum of salaries
SELECT EmpID, EmpName, Salary, SUM(Salary) OVER (ORDER BY EmpID) AS Cumulative_Sum FROM Employee;
Output
EmpID | EmpName | Salary | Cumulative_Sum |
---|---|---|---|
1 | Alice | 90000 | 90000 |
2 | Bob | 60000 | 150000 |
3 | Charlie | 120000 | 270000 |
4 | David | 60000 | 330000 |
5 | Anita | 75000 | 405000 |
6 | Arjun | 90000 | 495000 |
7 | Meena | 60000 | 555000 |
Q16. Swap two columns (Salary and DeptID)
UPDATE Employee SET Salary = Salary + DeptID, DeptID = Salary - DeptID, Salary = Salary - DeptID;
After swap (just showing EmpID, Salary, DeptID):
EmpID | Salary | DeptID |
---|---|---|
1 | 101 | 90000 |
2 | 101 | 60000 |
3 | 101 | 120000 |
4 | 102 | 60000 |
5 | 102 | 75000 |
6 | 103 | 90000 |
7 | 103 | 60000 |
Q17. Employees whose names contain only vowels
SELECT EmpName FROM Employee WHERE REGEXP_LIKE(EmpName, '^[AEIOUaeiou]+$');
Output
Dept101 | Dept102 | Dept103 |
---|---|---|
270000 | 135000 | 150000 |
Q19. Employees with highest salary in each department
SELECT EmpID, EmpName, DeptID, Salary FROM Employee E WHERE Salary = ( SELECT MAX(Salary) FROM Employee WHERE DeptID = E.DeptID );
20. Customers who made multiple purchases on the same day
Input Table – Orders
OrderID | CustID | OrderDate |
---|---|---|
1 | 101 | 2025-08-01 |
2 | 101 | 2025-08-01 |
3 | 102 | 2025-08-02 |
4 | 103 | 2025-08-02 |
5 | 103 | 2025-08-02 |
SELECT CustID, OrderDate, COUNT(*) AS NumOrders FROM Orders GROUP BY CustID, OrderDate HAVING COUNT(*) > 1;
Output:
CustID | OrderDate | NumOrders |
---|---|---|
101 | 2025-08-01 | 2 |
103 | 2025-08-02 | 2 |
Input Tables
Employee
EmpID | EmpName | Salary | ManagerID | DeptID | JoinDate |
---|---|---|---|---|---|
1 | Alice | 90000 | 3 | 101 | 2025-01-10 |
2 | Bob | 60000 | 3 | 101 | 2024-11-05 |
3 | Charlie | 120000 | NULL | 101 | 2023-07-01 |
4 | David | 60000 | 3 | 102 | 2025-06-15 |
5 | Anita | 75000 | 1 | 102 | 2025-05-20 |
6 | Arjun | 90000 | 1 | 103 | 2024-12-01 |
7 | Meena | 60000 | 2 | 103 | 2024-08-18 |
Sales
SaleID | SaleDate | Amount |
---|---|---|
1 | 2025-01-01 | 1000 |
2 | 2025-02-01 | 1500 |
3 | 2025-03-01 | 2000 |
4 | 2025-04-01 | 2500 |
5 | 2025-05-01 | 3000 |
Orders
OrderID | CustID | OrderDate |
---|---|---|
1 | 101 | 2025-01-10 |
2 | 101 | 2025-01-10 |
3 | 102 | 2025-02-05 |
4 | 103 | 2025-03-01 |
5 | 104 | 2025-03-01 |
6 | 104 | 2025-03-01 |
21. Moving Average of Sales (Last 3 Months)
SELECT SaleDate, Amount, AVG(Amount) OVER (ORDER BY SaleDate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS MovingAvg FROM Sales;
Output
SaleDate | Amount | MovingAvg |
---|---|---|
2025-01-01 | 1000 | 1000 |
2025-02-01 | 1500 | 1250 |
2025-03-01 | 2000 | 1500 |
2025-04-01 | 2500 | 2000 |
2025-05-01 | 3000 | 2500 |
22. Rank Employees by Salary in Each Department
SELECT DeptID, EmpName, Salary, RANK() OVER (PARTITION BY DeptID ORDER BY Salary DESC) AS RankInDept FROM Employee;
Output
DeptID | EmpName | Salary | RankInDept |
---|---|---|---|
101 | Charlie | 120000 | 1 |
101 | Alice | 90000 | 2 |
101 | Bob | 60000 | 3 |
102 | Anita | 75000 | 1 |
102 | David | 60000 | 2 |
103 | Arjun | 90000 | 1 |
103 | Meena | 60000 | 2 |
23. Employees with More Than One Manager
SELECT EmpName, COUNT(DISTINCT ManagerID) AS ManagerCount FROM Employee WHERE ManagerID IS NOT NULL GROUP BY EmpName HAVING COUNT(DISTINCT ManagerID) > 1;
Output
In our data, none has multiple managers, so result = empty set.
24. Most Frequent Order Date
OrderDate | OrderCount |
---|---|
2025-03-01 | 3 |
25. Compare Two Tables (Mismatched Records)
SELECT * FROM Employee_2024 MINUS SELECT * FROM Employee_2025 UNION SELECT * FROM Employee_2025 MINUS SELECT * FROM Employee_2024;
Output
Shows rows present in one table but not the other.
26. Difference Between Consecutive Rows
SELECT SaleDate, Amount, Amount - LAG(Amount) OVER (ORDER BY SaleDate) AS DiffFromPrev FROM Sales;
Output
SaleDate | Amount | DiffFromPrev |
---|---|---|
2025-01-01 | 1000 | NULL |
2025-02-01 | 1500 | 500 |
2025-03-01 | 2000 | 500 |
2025-04-01 | 2500 | 500 |
2025-05-01 | 3000 | 500 |
27. Pivot Table Data Dynamically
SELECT * FROM (SELECT DeptID, EmpName FROM Employee) PIVOT (COUNT(EmpName) FOR DeptID IN (101, 102, 103));
Output
DeptID_101 | DeptID_102 | DeptID_103 |
---|---|---|
3 | 2 | 2 |
28. Delete Every Alternate Row
Output
CustID | FirstPurchase |
---|---|
101 | 2025-01-10 |
102 | 2025-02-05 |
103 | 2025-03-01 |
104 | 2025-03-01 |
30. Running Total of Sales per Month
SaleDate | Amount | RunningTotal |
---|---|---|
2025-01-01 | 1000 | 1000 |
2025-02-01 | 1500 | 2500 |
2025-03-01 | 2000 | 4500 |
2025-04-01 | 2500 | 7000 |
2025-05-01 | 3000 | 10000 |
At Learnomate Technologies, we don’t just teach tools, we train you with real-world, hands-on knowledge that sticks. Our Azure Data Engineering training program is designed to help you crack job interviews, build solid projects, and grow confidently in your cloud career.
- Want to see how we teach? Hop over to our YouTube channel for bite-sized tutorials, student success stories, and technical deep-dives explained in simple English.
- Ready to get certified and hired? Check out our Azure Data Engineering course page for full curriculum details, placement assistance, and batch schedules.
- Curious about who’s behind the scenes? I’m Ankush Thavali, founder of Learnomate and your trainer for all things cloud and data. Let’s connect on LinkedIn—I regularly share practical insights, job alerts, and learning tips to keep you ahead of the curve.
And hey, if this article got your curiosity going…
Explore more on our blog where we simplify complex technologies across data engineering, cloud platforms, databases, and more.
Thanks for reading. Now it’s time to turn this knowledge into action. Happy learning and see you in class or in the next blog!
Happy Vibes!
ANKUSH