Discussion:
rasdaemon and abrt
Junliang Li
2013-09-27 06:46:50 UTC
Permalink
Hi, all

Fedora 19 has a tool named rasdaemon. From its maintainer, Mauro, I know
that rasdaemon and ABRT will work together. But I don't know much about
that. Would anyone introduce something about rasdaemon and ABRT?
Jiri Moskovcak
2013-09-27 07:29:02 UTC
Permalink
Post by Junliang Li
Hi, all
Fedora 19 has a tool named rasdaemon. From its maintainer, Mauro, I know
that rasdaemon and ABRT will work together. But I don't know much about
that. Would anyone introduce something about rasdaemon and ABRT?
Hi,
Denys is responsible for rasdaemon&ABRT integration, so I'm adding him
to the loop.

Regards,
Jirka
Denys Vlasenko
2013-10-01 14:05:45 UTC
Permalink
Post by Junliang Li
Fedora 19 has a tool named rasdaemon. From its maintainer, Mauro, I know
that rasdaemon and ABRT will work together. But I don't know much about
that. Would anyone introduce something about rasdaemon and ABRT?
Denys is responsible for rasdaemon&ABRT integration, so I'm adding him to the loop.
IIUC rasdaemon does not send its data yet to abrt.

rasdaemon developers work on the way to prevent
floods of error reports: it's semi-trivial to generate
a single report about an isolated ECC error on PCIe bus;
but what if there are thousands of them per second?

We (abrt team) provided documentation necessary
to use abrt's "create problem data" API.

We are ready to aid rasdaemon people if they have
questions or proposals for changes in abrt.
Some of them (Petr Holasek) are colocated with
abrt team and can just walk over and talk with us.
--
vda
Petr Holasek
2013-10-01 20:15:13 UTC
Permalink
Post by Denys Vlasenko
Post by Junliang Li
Fedora 19 has a tool named rasdaemon. From its maintainer, Mauro, I know
that rasdaemon and ABRT will work together. But I don't know much about
that. Would anyone introduce something about rasdaemon and ABRT?
Denys is responsible for rasdaemon&ABRT integration, so I'm adding him to the loop.
IIUC rasdaemon does not send its data yet to abrt.
rasdaemon developers work on the way to prevent
floods of error reports: it's semi-trivial to generate
a single report about an isolated ECC error on PCIe bus;
but what if there are thousands of them per second?
We (abrt team) provided documentation necessary
to use abrt's "create problem data" API.
We are ready to aid rasdaemon people if they have
questions or proposals for changes in abrt.
Some of them (Petr Holasek) are colocated with
abrt team and can just walk over and talk with us.
Hello all,

to be honest, I still can't find time for digging into implementation of abrt
hook for rasdaemon as well as we still wait for Intel guys who implement code
for reducing floods of errors in some reasonable manner. So I've created RFE
https://bugzilla.redhat.com/show_bug.cgi?id=1013567 that can be taken by
somebody interested in it.

However, I'd be happy to help on rasdaemon side, main idea is to create more
generic output interface, not only console output and DB.

regards,
Petr
Denys Vlasenko
2013-10-02 10:32:27 UTC
Permalink
Post by Petr Holasek
Post by Denys Vlasenko
Post by Junliang Li
Fedora 19 has a tool named rasdaemon. From its maintainer, Mauro, I know
that rasdaemon and ABRT will work together. But I don't know much about
that. Would anyone introduce something about rasdaemon and ABRT?
Denys is responsible for rasdaemon&ABRT integration, so I'm adding him to the loop.
IIUC rasdaemon does not send its data yet to abrt.
rasdaemon developers work on the way to prevent
floods of error reports: it's semi-trivial to generate
a single report about an isolated ECC error on PCIe bus;
but what if there are thousands of them per second?
We (abrt team) provided documentation necessary
to use abrt's "create problem data" API.
We are ready to aid rasdaemon people if they have
questions or proposals for changes in abrt.
Some of them (Petr Holasek) are colocated with
abrt team and can just walk over and talk with us.
Hello all,
to be honest, I still can't find time for digging into implementation of abrt
hook for rasdaemon as well as we still wait for Intel guys who implement code
for reducing floods of errors in some reasonable manner.
How about reporting first detected error to abrt right away, then,
if more errors happen, hold on for a few seconds, then
batch-report them as one problem ("1234 PCIe parity errors happened
at 12:34 during 4 seconds on the device FOO" would be a nice way to report
such a problem).

Increase cooldown period if errors keep coming, with a cap.
We have something like this elsewhere in abrt:

unsigned cooldown_sec = 5;
...
cooldown_sec *= cooldown_sec;
if (cooldown_sec > 15 * 60)
cooldown_sec = 15 * 60;

With formulas like above cooldown rises quickly, resulting in just
a few problem reports even with constant flood of error events;
yet, it does not grow to astronomical values - "collect PCIe errors
for next 27 hours and report them as one"
is obviously a bad idea too.
--
vda
Junliang Li
2013-10-09 03:26:19 UTC
Permalink
在 2013-10-02三的 12:32 +0200,Denys Vlasenko写道:
Post by Denys Vlasenko
Post by Petr Holasek
Post by Denys Vlasenko
Post by Junliang Li
Fedora 19 has a tool named rasdaemon. From its maintainer, Mauro, I know
that rasdaemon and ABRT will work together. But I don't know much about
that. Would anyone introduce something about rasdaemon and ABRT?
Denys is responsible for rasdaemon&ABRT integration, so I'm adding him to the loop.
IIUC rasdaemon does not send its data yet to abrt.
rasdaemon developers work on the way to prevent
floods of error reports: it's semi-trivial to generate
a single report about an isolated ECC error on PCIe bus;
but what if there are thousands of them per second?
We (abrt team) provided documentation necessary
to use abrt's "create problem data" API.
We are ready to aid rasdaemon people if they have
questions or proposals for changes in abrt.
Some of them (Petr Holasek) are colocated with
abrt team and can just walk over and talk with us.
Hello all,
to be honest, I still can't find time for digging into implementation of abrt
hook for rasdaemon as well as we still wait for Intel guys who implement code
for reducing floods of errors in some reasonable manner.
How about reporting first detected error to abrt right away, then,
if more errors happen, hold on for a few seconds, then
batch-report them as one problem ("1234 PCIe parity errors happened
at 12:34 during 4 seconds on the device FOO" would be a nice way to report
such a problem).
Increase cooldown period if errors keep coming, with a cap.
unsigned cooldown_sec = 5;
...
cooldown_sec *= cooldown_sec;
if (cooldown_sec > 15 * 60)
cooldown_sec = 15 * 60;
With formulas like above cooldown rises quickly, resulting in just
a few problem reports even with constant flood of error events;
yet, it does not grow to astronomical values - "collect PCIe errors
for next 27 hours and report them as one"
is obviously a bad idea too.
Cooldown period is a good idea. Let sysadm customize their report
threshold in rasdaemon would be OK. Maybe we just need add an plugin in
rasdaemon to customize threshold and work as abrt hook.

Regards,
Junliang Li

Loading...