Overview
This document summarises our discussion about the role of EXPLAIN in running InterMine queries and how it should be implemented. The purpose of running EXPLAIN is to obtain estimated information about the time to run a query and the number of rows it will return allowing us to limit what is run in the database. We also want to make estimated (and/or computed) statistics about a query available to client applications prior to execution. Information available from EXPLAIN
EXPLAIN displays the execution plan that the planner produces for a given SQL statement. The plan details how tables will be scanned and what join strategies will be used. In addition some estimated statistics are produced for the query:
- The estimated cost of running a query in terms of disk page fetches, two figures are displayed:
- estimated time for the first row to be returned
- estimated time for completion of entire query
- An estimated number of rows that the query will return, based on estimated selectivity of where clauses.
- The average width (bytes) of rows produced by the query
When we need to run EXPLAIN
Our purpose in running EXPLAIN for all queries is twofold:
- to provide users with an estimated number of rows a given query will produce.
- to prevent queries that would take more than a threshold amount of time from being run in the database.
Our implementation of paging dictates that these are two distinct operations. When a query is run the Results object defines a LIMIT and OFFSET to restrict the number of rows returned at one time (e.g the first 1000). It is when when this limited query is run that we need to restrict the amount of time it can take. To estimate the total number of rows a query will return we obviously need to EXPLAIN it without a LIMIT.
Thus, we need to run EXPLAIN on the original query and on each 'page' with OFFSET and LIMIT defined. This is done as follows:
- ObjectStore.estimate(query) method that, given a query, will return a ResultsInfo object containing information about the estimated number of rows and time to run. The Results object has a similar method (getInfo()) which will include information such as any knowledge the Results object has on the number of rows. This is called as required by the client application and is not necessary for normal query execution.
- ObjectStoreInterMineImpl.execute(query, start, limit) performs an explain on the batch to be fetched, and decides if the query is faster or slower that the given threshold. If the query is too slow an Exception is thrown, otherwise the query runs as normal. The threshold is provided in the properties file, so that users could set their own maximum execution time (such that it is less than our global threshold).
