heart
Heartbeat monitoring of an Erlang runtime system.
This modules contains the interface to the heart
process.
heart
sends periodic heartbeats to an external port
program, which is also named heart
. The purpose of
the heart
port program is to check that the Erlang runtime system
it is supervising is still running. If the port program has not
received any heartbeats within HEART_BEAT_TIMEOUT
seconds
(defaults to 60 seconds), the system can be rebooted. Also, if
the system is equipped with a hardware watchdog timer and is
running Solaris, the watchdog can be used to supervise the entire
system.
An Erlang runtime system to be monitored by a heart program
is to be started with command-line flag -heart
(see
also erl(1)
).
The heart
process is then started automatically:
% erl -heart ...
If the system is to be rebooted because of missing heartbeats,
or a terminated Erlang runtime system, environment variable
HEART_COMMAND
must be set before the system is started.
If this variable is not set, a warning text is printed but
the system does not reboot. However, if the hardware watchdog is
used, it still triggers a reboot HEART_BEAT_BOOT_DELAY
seconds later (defaults to 60 seconds).
To reboot on Windows, HEART_COMMAND
can be
set to heart -shutdown
(included in the Erlang delivery)
or to any other suitable program that can activate a reboot.
The hardware watchdog is not started under Solaris if
environment variable HW_WD_DISABLE
is set.
The environment variables HEART_BEAT_TIMEOUT
and
HEART_BEAT_BOOT_DELAY
can be used to configure the heart
time-outs; they can be set in the operating system shell before Erlang
is started or be specified at the command line:
% erl -heart -env HEART_BEAT_TIMEOUT 30 ...
The value (in seconds) must be in the range 10 < X <= 65535.
Notice that if the system clock is adjusted with
more than HEART_BEAT_TIMEOUT
seconds, heart
times out and tries to reboot the system. This can occur, for
example, if the system clock is adjusted automatically by use of the
Network Time Protocol (NTP).
If a crash occurs, an erl_crash.dump
is not
written unless environment variable
ERL_CRASH_DUMP_SECONDS
is set:
% erl -heart -env ERL_CRASH_DUMP_SECONDS 10 ...
If a regular core dump is wanted, let heart
know by setting
the kill signal to abort using environment variable
HEART_KILL_SIGNAL=SIGABRT
. If unset, or not set to
SIGABRT
, the default behavior is a kill signal using
SIGKILL
:
% erl -heart -env HEART_KILL_SIGNAL SIGABRT ...
Furthermore, ERL_CRASH_DUMP_SECONDS
has the
following behavior on heart
:
ERL_CRASH_DUMP_SECONDS=0
Suppresses the writing of a crash dump file entirely, thus rebooting the runtime system immediately. This is the same as not setting the environment variable.
ERL_CRASH_DUMP_SECONDS=-1
Setting the environment variable to a negative value does not reboot the runtime system until the crash dump file is completly written.
ERL_CRASH_DUMP_SECONDS=S
heart
waits for S
seconds to let the crash dump
file be written. After S
seconds, heart
reboots the
runtime system, whether the crash dump file is written or not.
In the following descriptions, all functions fail with reason
badarg
if heart
is not started.
Functions
set_cmd(Cmd) -> ok | {error, {bad_cmd, Cmd}}
Cmd = string()
Sets a temporary reboot command. This command is used if
a HEART_COMMAND
other than the one specified with
the environment variable is to be used to reboot
the system. The new Erlang runtime system uses (if it misbehaves)
environment variable HEART_COMMAND
to reboot.
Limitations: Command string
is sent to the
heart
program as an ISO Latin-1 or UTF-8 encoded binary,
depending on the filename encoding mode of the emulator (see
file:native_name_encoding/0
).
The size of the encoded binary must be less than 2047 bytes.
clear_cmd() -> ok
Clears the temporary boot command. If the system terminates,
the normal HEART_COMMAND
is used to reboot.
get_cmd() -> {ok, Cmd}
Cmd = string()
Gets the temporary reboot command. If the command is cleared, the empty string is returned.
set_callback(Module, Function) ->
ok | {error, {bad_callback, {Module, Function}}}
Module = Function = atom()
This validation callback will be executed before any
heartbeat is sent to the port program. For the validation to
succeed it needs to return with the value ok
.
An exception within the callback will be treated as a validation failure.
The callback will be removed if the system reboots.
clear_callback() -> ok
Removes the validation callback call before heartbeats.
get_callback() -> {ok, {Module, Function}} | none
Module = Function = atom()
Get the validation callback. If the callback is cleared, none
will be returned.
set_options(Options) -> ok | {error, {bad_options, Options}}
Options = [heart_option()]
Valid options set_options
are:
check_schedulers
If enabled, a signal will be sent to each scheduler to check its responsiveness. The system check occurs before any heartbeat sent to the port program. If any scheduler is not responsive enough the heart program will not receive its heartbeat and thus eventually terminate the node.
Returns with the value ok
if the options are valid.
get_options() -> {ok, Options} | none
Options = [atom()]
Returns {ok, Options}
where Options
is a list of current options enabled for heart.
If the callback is cleared, none
will be returned.