1. Difference between Coalesce and repartition?
Repartition is used to increase or decrease the number of partition with equal sized data and creates a lot of shuffling.
Coalesce can be used to decrease the the number of partition or use existing partitions minimizing the amount of data that is shuffled.
2. Advantages of parquet file format in Spark?
Parquet file is native to spark and Parquet file with snappy compression is best optimized format for spark application .It carries metadata long with its footer so whenever we create parquet file then we can see the metadata file within the same directory.
When we come across the requirement where schema is varying with the time then avro is best.
4. Difference between DAG and lineage?
DAG is the graphical representation of spark program where each vertex acts as a operation and each edge represents dependencies of operation.
Lineage is the part of DAG which represents one or more operation that creates another RDD.
5. Why DAG is Acyclic graph?
DAG is a graph that is directed and without cycles connected through different edges which go in one way only.It is difficult to traverse the entire graph starting from one edge.
6. If spark application stores a data into a target file which are duplicated records then How to remove duplicates rows in the target ?
Duplicates rows can be removed or deleted by using distinct() & dropDuplicates() function.
When same values are present in all columns then distinct() can be used whereas dropDuplicates() is used to remove the rows that are having same values in multiple columns.
Val disDF= df.distinct()
Val dropdupDF= df.dropDuplicates(“column_name1”,”column_name2”)
Happy learning!
1 Comments
Nicely done, Thank you for sharing such a useful article. I had a great time. This article was fantastic to read. continue to write about
ReplyDeleteData Engineering Solutions
Advanced Data Analytics Solutions
Please do not enter any spam link in the comment box