+1 vote
in Azure by
What are some performance-tuning techniques for Mapping Data Flow activity?

1 Answer

0 votes
by

We could consider the below set of parameters for tuning the performance of a Mapping Data Flow activity we have in a pipeline.

i) We should leverage partitioning in the source, sink, or transformation whenever possible. Microsoft, however, recommends using the default partition (size 128 MB) selected by the Data Factory as it intelligently chooses one based on our pipeline configuration.

Still, one should try out different partitions and see if they can have improved performance.

ii) We should not use a data flow activity for each loop activity. Instead, we have multiple files similar in structure and processing needs. In that case, we should use a wildcard path inside the data flow activity, enabling the processing of all the files within a folder.

iii) The recommended file format to use is ‘. parquet’. The reason being the pipeline will execute by spinning up spark clusters, and Parquet is the native file format for Apache Spark; thus, it will generally give good performance.

iv) Multiple logging modes are available: Basic, Verbose, and None.

We should only use verbose mode if essential, as it will log all the details about each operation the activity performs. e.g., It will log all the details of the operations performed for all our partitions. This one is useful when troubleshooting issues with the data flow.

The basic mode will give out all the necessary basic details in the log, so try to use this one whenever possible.

v)  Try to break down a complex data flow activity into multiple data flow activities. Let’s say we have several transformations between source and sink, and by adding more, we think the design has become complex. In this case, try to have it in multiple such activities, which will give two advantages:

All activities will run on separate spark clusters, decreasing the run time for the whole task.

The whole pipeline will be easy to understand and maintain in the future. 

Related questions

+1 vote
asked Mar 14, 2023 in Azure by Robindeniel
0 votes
asked May 25, 2023 in Informatica by Robindeniel
...