Here are a few items noted based on my personal finding/experience, if you have additional inputs or a different perspective, feel free to reach me.
1. Combine all computeExpression nodes whenever possible
calcURI node in image-1 contains 1 compute field return Numeric, the same for calURI2 node also contains 1 other compute field return Numeric, a total of calcURI1 + calURI2 = 3:41 sec.
In image-2, we combined both compute field into calcURI node, and it only took 2:0 sec.
2. Do compute as early as possible, and augment as late as possible
The rationale behind this is, compute node will process lesser fields before augment (as augment always adding fields to the stream), unless you need the field from the augment node for computation.
3. Remove all unnecessary fields
In most of my experience, a dataflow usually a dashboard or clone of a dashboard. The more fields handled by each node will need more power and time, so slice out unnecessary fields if they are not needed in the dashboard or lens.
4. Combine all sfdcDigest nodes of the same object to a node, if sync is not enabled
For some reason, your org. maybe not enable for sync, this does not mean you "must" enable straight away, and please DO NOT enable it without a complete analysis, as this may cause data filtering issue.
You should combine all sfdcDigest nodes of the same object into a node, imagine if you have 10 millions row of opportunity, every sfdcDigest nodes take 10 minutes (as an example), and if the dataflow designer adds 3 sfdcDigest nodes of opportunity, the data retrieve itself will need 30 minutes.