Notes

Sunday, May 22, 2022

Improve Azure Data Explorer (Kusto) query performance with shuffle strategy

The arg_max operator in Kusto is great to filter out duplicate data during query. But it can be quite resource intensive and result in very long query time compared to not filtering out duplicate data.

table
| summarize arg_max(ingestion_time(), *) by SomeKey, OtherKey

If the group by key has a high cardinality, using the shuffle strategy can yield better performance.

table
| summarize hint.strategy = shuffle arg_max(ingestion_time(), *) by SomeKey, OtherKey

It is also possible to specify the suffle key for a multi-key group by using hint.shufflekey = <key>

table
| summarize hint.shufflekey = OtherKey arg_max(ingestion_time(), *) by SomeKey, OtherKey

I also tried to use “Progressive Mode” for Kusto queries, hoping it would allow me to get a stream of the results an process them on the fly. But that’s not really how it works. With “Progressive Mode” on, the results are send in multiple frames, but each frame could append to or replace completely the data received previously. So You still need to wait until the query is over to get the final result.

Handle duplicate data in Azure Data Explorer

shuffle query documentation