Saturday, March 14, 2020

Einstein Analytics: Dataflow Performance Best Practice

Performance is critical for Einstein Analytics dataflow, e.g. an optimized dataflow may take only 10 minutes, while the same dataflow with a poor design may take 1 hour (this includes sync setup) to run. Therefore, without great architected dataflows, it will be hard to maintain and sustain Einstein Analytics as a whole, as the company evolved.

Here are a few items noted based on my personal finding/experience, if you have additional inputs or a different perspective, feel free to reach me.

1. Combine all computeExpression nodes whenever possible



calcURI node in image-1 contains 1 compute field return Numeric, the same for calURI2 node also contains 1 other compute field return Numeric, a total of calcURI1 + calURI2 = 3:41 sec.

In image-2, we combined both compute field into calcURI node, and it only took 2:0 sec.

2. Do compute as early as possible, and augment as late as possible

The rationale behind this is, compute node will process lesser fields before augment (as augment always adding fields to the stream), unless you need the field from the augment node for computation.

3. Remove all unnecessary fields

In most of my experience, a dataflow usually a dashboard or clone of a dashboard. The more fields handled by each node will need more power and time, so slice out unnecessary fields if they are not needed in the dashboard or lens.


Notice that calcURI3 in image-1 and image-2 took around 2:08 sec. In image-3, we add a slice node before calcURI3 to remove unnecessary fields, this reduces the number of fields processed in calcURI3, therefore it took only 1:55 sec.

4. Combine all sfdcDigest nodes of the same object to a node, if sync is not enabled

For some reason, your org. maybe not enable for sync, this does not mean you "must" enable straight away, and please DO NOT enable it without a complete analysis, as this may cause data filtering issue.

You should combine all sfdcDigest nodes of the same object into a node, imagine if you have 10 millions row of opportunity, every sfdcDigest nodes take 10 minutes (as an example), and if the dataflow designer adds 3 sfdcDigest nodes of opportunity, the data retrieve itself will need 30 minutes.

5. Do not perform Null check on filter node
So instead of having something like 'Check.Id' is null in SAQL filter, create a computeExpression node to have a Yes/No compute field, then filter with CheckIdIsNull:EQ:Yes
Filter node with Null check will take a lot of time when the dataflow runs.

1 comment:

Page-level ad