Introduce a new environment wide error log - stored in dali (or cassandra) - for storing critical errors. These can then be presented to the user in eclwatch as an icon on the banner, and more detail on the main activity page.
Some examples of item that should be logged
- Any component that restarts when it didn't close down as expected.
- OOM errors.
- Running out of disk space
- Critical syslog errors that are currently
Ideally the message would include information about which node and component had the problem and which workunit was running at the time (if the information is available).
The log should be stored so that old log entries can be periodically removed (if desired). Could have a sash process to do that.