System error handling
Short resume
This will be used to track the errors on all machines and projects that may not be easy to detect or see. In the first version we will be tracking only system errors and storing them for easier reporting and detection. After that we will create a way to set rules for cron jobs that can be enforced for max execution time mainly. And from this we can build upon to report issues automatically to Jira and notify people and other features that this will open.
Who will be using this system
This is mainly for the development team. The support team should have restricted access. It will be to set rules for the cron jobs and to check in case it is required for errors. The developers will use its full features to be able to track any bugs and issues that are reported. And to see if something is not ok with a specific project or machine.
How it will be used
This system will be on a separate machine with the new interface, but none of the options currently in the Hemi software.
- We will have a list with all of the reports classified by machine and type (critical, warning, notice and etc.). Each issue will be grouped if it occurred more then once and will have timestamp for every occurrence. They will be by folder name not by project name, because we have more folders for the same project. Every issue and machine will have an archive button. Which will clear the log and store its issues in the database as reported. Later on we will add button to report to Jira.
- We will have another list of rules for the cron jobs.t The rules will be split by: machine,folder,cron name, parameters. And for now it will have max execution time and max number of instances which by default will be none. For the number of instances since it can have multiple parameters it is better to keep it as it is being applied for now, but this will allow us to check for these kind of errors (if a cron runs multiple times despite what is in its code).
Important notes about the system
All of the tables and enumerations in the database should not follow the form ids and specific ids for the Hub system. Only the main ones may be kept for easier use. This will be a separate environment just using the same code, because we dont need to write it twice.
How it will work
To implement the first step we need:
- to set the default error log directory for the php environment on every server. Which will store every exception/error thrown by PHP. This will be implemented with the system administrators and decided on its location. For some projects based on MVC that have the Error handler. We should change/update them to use the constant that will be set in the php config. This is applied to all MVC projects and the Exports project.
- Create a project based on MVC that will have different database structure, but keep the key features (like the analytics project). This project will be set to work as an interface on the systems handling machine. But it should have an executable file that will be used to kill cron jobs and other functions. This will be used in later steps. We just need to set the structure now.
- Create the database structure for the project. We will need the users/forms from the Hemi software. A table for the reported errors with columns for the machine name, error type, error message, folder name, file name, timestamp, reported and archive_id. The field types will be varchar for most. Text for the longer ones. Boolean for reported. Long int for the timestamp (this is what is being used by the type of field that works with it). archive_id should be connected to the archives table and by default be 0. It will be explained in later steps how it will behave.
- In the systems project we will have a cron that will be running and scanning all error log directories. It can use the structure for the machines like the deployer project. To know which machines to connect and how. And on every machine it will read the error file from its php config for errors. Group them by machine name,folder name and text and store them in the database so that they can be previewed and send email if this occurs more then a specific number of times after it is logged. Both the emails and how many times an error can occur before reporting by email should be in a table with settings.
- We will need the errors table to be visible so we need to set it up, but with a predefined filter that the archive_id field is 0. This is to see only the not archived errors.
- We will have an archive action for each machine. For it we will need a database table for archives that will be based on machine and date. It will also need a field for status (pending,done). We dont need error, because in case of an error it will be reported if not seen at the moment. The action will be in the interface as a button on the top right of the block for the table. It will create a record on pending in this table. For the interface we can create it as a component and hardcode it where we need it.
- We will need a cron that will be checking all pending archives and connecting to every machine and deleting the error log file. After this is successful it should update the pending to done. And store the id of the archive to all the reported errors that are for this machine and archive_id is 0.
Step 2 will be the cron killer. For it we will need:
- Basic interface and Database table for cron rules with columns for: machine_name, folder_name, cron_name, cron_parameters, max_execution_time, max_number_of_instances. The machine_name, folder_name, cron_name, cron_parameters are varchar fields that can be 0. For 0 as value it means for all machines/folders/crons/parameters.
- Create an executable file that will be run to kill a specific cron by name and optional id. It should do just that. Search by the specific name and kill all crons with this name. If the id is provided it should kill just the cron with the given id and name no all with the given name.
- For this to work in the Exports and Imports projects we will need to add something more there. We need to register a shutdown callback. Which will check if the execution was killed or finished properly. In case of a kill it should change its status of failed for the import/export log.
- Create a cron that will run on the systems handling machine. It will be checking all machines that have rules set for them, reading their start execution time and comparing it to the current one and run the kill file on the machine to kill the crons that fail the rules. The more specific the rule it should be first. The power will be determined by if it specified the machine then folder then cron. The more data provided the higher the rule’s power is. The parameters rules should be applied only if cron name is specified. Otherwise it should be ignored.
For example, If there are three rules. One. For all machines all crons max execution time 1 hour. Second rule: For arcade all crons max execution time 2 hours. Third rule: For all machines all crons in folder Export max execution time 24 hours. The arcade machine all Export crons should be 24 hours max execution time not 2.