When compiling with a local repository it would be useful to be able to avoid reparsing ECL, especially the dependencies in order to speed up syntax checks.
The proposal is to allow an additional parallel directory, which contains files that contain cached parse information from parsing the attributes. The cached parse files are in json format and containing the following information:
- any meta information
- any warnings and errors
- a list of all the dependencies, and the timestamps for the meta file for that dependency
- a simplified representation of the ECL - which preserves the type information, but not the values.
For each source A/B/x.ecl file there may be two corresponding files
- one for the last good parse of the attribute (A/B/x.parsed)
- if it contains errors, one for the current definition of the attribute (A/B/X.error)
When a file is syntax checked, the associated parse file is checked first.
A parse file is assumed to be up to date if
- It has a modified date later than the associated file
- Each of its direct dependencies are up to date
- That the timestamp of this dependency file is >= the timestamp of each of the dependency parse files
- [Previously the idea was that each of the timestamps of the direct dependency's parse files match the timestamps associated with the dependency.]
If the parse file is up to date then the simplified definition is parsed instead of reading the main definition.
NOTE: It might be possible for the error parse file to be up to date, in which case should it be used? (Example is legacy import rules with a symbol which happens to match a global module, but is not used? Does that currently ignore errors in the global definition?)
- Add an option to eclcc which specifies a cache directory.
- Add an option to eclcc to allow a compound cache file to be provided.
- Add an option to indicate that entries should be created
Remaining implementation steps:
- Design an interface so that you can cleanly access the parse information from an IEclSource.
Ensure the logic is in the correct place.
- When syntax checking checking and resolving an identifier, if the parse file is valid then check for a valid simplified definition, and use it if present.
- Design the interface for cleanly writing the parse information. (Existing code may be sufficient.)
- When a file is compiled or syntax checked ensure the following information is created in the meta file:
o Dependencies and meta information first
o The code needs to cleanly allow updates to parse files and a compound parse file - with clean interfaces!
o The code needs to set different entries depending on whether the syntax check succeeded or not.
- Add code to generate simplified definitions of the exported symbols.
o Possibly introduce a new keyword
o Datasets (including sort order, grouping and distribution).
o Rows, transforms
o Function definitions
o Function definitions with nasty parameters
- Option to syntax check the world
o How would this be implemented? Restart when it runs low on memory??
o Note: Syntax check should be faster and use less memory on subsequent calls since it can use the simplified definitions.
- Create code to walk the dependency information and generate a reverse dependency tree.
Possibly store in another json file in the directory.
- Create code to generate an archive for an attribute from the dependency information - without parsing the query. (Ensure parse file interface is clean to avoid duplicated code.)
- Optimize creating an archive when some of the parse information is out of date (use a variation of the syntax check - but ensure the original (with all dependencies), rather than the simplified ECL is added to the archive).
Possibilities to aid testing:
- Add an option to indicate all meta data should be generated to a compound file.
- Option to read cache information from a compound file
- What happens if the attributes/modules are added or deleted while eclcc is running? (Can occur if a git checkout takes place while a syntax check is running.)
- Are there any situations where the code parsing the query needs to take account of parse files that have just been written. (I can't think of any, and if not it simplifies things significantly.)
- Change regression suites so compound parse files are always generated, so can monitor for changes and verify correctness (no longer generate dependencies).
- Run on all examples in regression suite. generate a compound cache file, then check that all entries in the compound cache file syntax check.
- Ideally also check that parsed attributes are compatible with their simplified versions. Add an eclcc option to verify that when they are created.
- Test boil the world on some large repositories (e.g., BOCA)
- Test multiple eclcc running at the same time (e.g., all boiling the ocean).
- Simplified definitions are only generated if 100% confident they are correct
- Macros never create a definition file.
- Give warnings if definitions cannot be created
- Modules containing macros could be nasty (although they could possibly contain the macros as-is)
- Updates to disk files must be atomic - multiple eclcc processes may be running and updating the files at the same time.
- Need to be careful about generating imports if they are required by the simplified definitions.
- Must cope with very large repositories
- If it crashes it must leave the directories in a consistent state.
- Most continue where it left off if called on a partially indexed result
- Avoid even syntax checking again if known to be up to date
May need to create different simplified expressions for constant and non constant expressions.
May be issues with NAMEOF(index) if index is simplified and other things.