Apache Spark Interview question Set 3

1. Difference between Coalesce and repartition?

Repartition is used to increase or decrease the number of partition with equal sized data and creates a lot of shuffling.

Coalesce can be used to decrease the the number of partition or use existing partitions minimizing the amount of data that is shuffled.

2. Advantages of parquet file format in Spark?

Parquet file is native to spark and Parquet file with snappy compression is best optimized format for spark application .It carries metadata long with its footer so whenever we create parquet file then we can see the metadata file within the same directory.

3. When to use avro data file format ?

When we come across the requirement where schema is varying with the time then avro is best.

4. Difference between DAG and lineage?

DAG is the graphical representation of spark program where each vertex acts as a operation and each edge represents dependencies of operation.

Lineage is the part of DAG which represents one or more operation that creates another RDD.

5. Why DAG is Acyclic graph?

DAG is a graph that is directed and without cycles connected through different edges which go in one way only.It is difficult to traverse the entire graph starting from one edge.

6. If spark application stores a data into a target file which are duplicated records then How to remove duplicates rows in the target ?

Duplicates rows can be removed or deleted by using distinct() & dropDuplicates() function.

When same values are present in all columns then distinct() can be used whereas dropDuplicates() is used to remove the rows that are having same values in multiple columns.

Val disDF= df.distinct()

Val dropdupDF= df.dropDuplicates(“column_name1”,”column_name2”)

Happy learning!

1 Comments

Clay SpenserDecember 24, 2021 at 2:08 PM
Nicely done, Thank you for sharing such a useful article. I had a great time. This article was fantastic to read. continue to write about
Data Engineering Solutions
Advanced Data Analytics Solutions

Please do not enter any spam link in the comment box

ITechShree-Data-Analytics-Technologies

Apache Spark Interview question Set 3

Post a Comment

1 Comments

Labels

Random Posts

Popular Posts

Flume Spooling directory example

Apache Spark interview questions Set 2

Learn Flume

Menu Footer Widget

ITechShree-Data-Analytics-Technologies

Apache Spark Interview question Set 3

You may like these posts

Post a Comment

1 Comments

Labels

Random Posts

Popular Posts

Flume Spooling directory example

Apache Spark interview questions Set 2

Learn Flume

Menu Footer Widget