ITechShree-Data-Analytics-Technologies

Apache Spark Interview question Set 3

 1. Difference between Coalesce  and repartition?

 

Repartition is used to increase or decrease the number of partition with equal sized data and creates a lot of shuffling.

Coalesce  can be used to decrease the the number of partition or use existing partitions  minimizing the amount of data that is shuffled.

 

2. Advantages of parquet file format in Spark?

Parquet file is native to spark and Parquet file with snappy compression is best optimized format for spark application .It carries metadata long with its footer so whenever we create parquet file then we can see the metadata file  within the same directory.

 3. When to use avro data file format ?

 

When we  come across the requirement where schema is varying with the time then avro is best.

 

4Difference  between DAG and lineage?

DAG is the graphical representation of spark program where each vertex acts as a operation and each edge represents dependencies of operation.

Lineage is the part of DAG which represents one or more operation that creates another RDD.

 

5. Why DAG is Acyclic graph?

 

DAG is a graph that is   directed and without cycles connected through different edges which go in one way only.It is difficult to traverse the entire graph starting from one edge.

 

6. If spark application stores a data into a target file which are duplicated records then How to remove duplicates rows in the target ?

 

Duplicates rows can be removed or deleted by using distinct() & dropDuplicates() function.

When same values are present in all columns then distinct() can be used whereas dropDuplicates() is used to remove the rows that are having same values in multiple columns.

 

Val disDF= df.distinct()

 

Val dropdupDF= df.dropDuplicates(“column_name1”,”column_name2”)

 


Happy learning!

Post a Comment

1 Comments

  1. Nicely done, Thank you for sharing such a useful article. I had a great time. This article was fantastic to read. continue to write about
    Data Engineering Solutions 
    Advanced Data Analytics Solutions

    ReplyDelete

Please do not enter any spam link in the comment box