rasdaemon and abrt

Discussion:

rasdaemon and abrt

Junliang Li

2013-09-27 06:46:50 UTC

Hi, all

Fedora 19 has a tool named rasdaemon. From its maintainer, Mauro, I know
that rasdaemon and ABRT will work together. But I don't know much about
that. Would anyone introduce something about rasdaemon and ABRT?

Jiri Moskovcak

2013-09-27 07:29:02 UTC

Permalink

Post by Junliang Li
Hi, all
Fedora 19 has a tool named rasdaemon. From its maintainer, Mauro, I know
that rasdaemon and ABRT will work together. But I don't know much about
that. Would anyone introduce something about rasdaemon and ABRT?

Hi,
Denys is responsible for rasdaemon&ABRT integration, so I'm adding him
to the loop.

Regards,
Jirka

Denys Vlasenko

2013-10-01 14:05:45 UTC

Permalink

Post by Junliang Li
Fedora 19 has a tool named rasdaemon. From its maintainer, Mauro, I know
that rasdaemon and ABRT will work together. But I don't know much about
that. Would anyone introduce something about rasdaemon and ABRT?

Denys is responsible for rasdaemon&ABRT integration, so I'm adding him to the loop.

IIUC rasdaemon does not send its data yet to abrt.

rasdaemon developers work on the way to prevent
floods of error reports: it's semi-trivial to generate
a single report about an isolated ECC error on PCIe bus;
but what if there are thousands of them per second?

We (abrt team) provided documentation necessary
to use abrt's "create problem data" API.

We are ready to aid rasdaemon people if they have
questions or proposals for changes in abrt.
Some of them (Petr Holasek) are colocated with
abrt team and can just walk over and talk with us.

--
vda

Petr Holasek

2013-10-01 20:15:13 UTC

Permalink

Post by Denys Vlasenko

Denys is responsible for rasdaemon&ABRT integration, so I'm adding him to the loop.

IIUC rasdaemon does not send its data yet to abrt.
rasdaemon developers work on the way to prevent
floods of error reports: it's semi-trivial to generate
a single report about an isolated ECC error on PCIe bus;
but what if there are thousands of them per second?
We (abrt team) provided documentation necessary
to use abrt's "create problem data" API.
We are ready to aid rasdaemon people if they have
questions or proposals for changes in abrt.
Some of them (Petr Holasek) are colocated with
abrt team and can just walk over and talk with us.

Hello all,

to be honest, I still can't find time for digging into implementation of abrt
hook for rasdaemon as well as we still wait for Intel guys who implement code
for reducing floods of errors in some reasonable manner. So I've created RFE
https://bugzilla.redhat.com/show_bug.cgi?id=1013567 that can be taken by
somebody interested in it.

However, I'd be happy to help on rasdaemon side, main idea is to create more
generic output interface, not only console output and DB.

regards,
Petr

Denys Vlasenko

2013-10-02 10:32:27 UTC

Permalink

Post by Petr Holasek

Post by Denys Vlasenko

Denys is responsible for rasdaemon&ABRT integration, so I'm adding him to the loop.

IIUC rasdaemon does not send its data yet to abrt.
rasdaemon developers work on the way to prevent
floods of error reports: it's semi-trivial to generate
a single report about an isolated ECC error on PCIe bus;
but what if there are thousands of them per second?
We (abrt team) provided documentation necessary
to use abrt's "create problem data" API.
We are ready to aid rasdaemon people if they have
questions or proposals for changes in abrt.
Some of them (Petr Holasek) are colocated with
abrt team and can just walk over and talk with us.

How about reporting first detected error to abrt right away, then,
if more errors happen, hold on for a few seconds, then
batch-report them as one problem ("1234 PCIe parity errors happened
at 12:34 during 4 seconds on the device FOO" would be a nice way to report
such a problem).

Increase cooldown period if errors keep coming, with a cap.
We have something like this elsewhere in abrt:

unsigned cooldown_sec = 5;
...
cooldown_sec *= cooldown_sec;
if (cooldown_sec > 15 * 60)
cooldown_sec = 15 * 60;

With formulas like above cooldown rises quickly, resulting in just
a few problem reports even with constant flood of error events;
yet, it does not grow to astronomical values - "collect PCIe errors
for next 27 hours and report them as one"
is obviously a bad idea too.

--
vda

Junliang Li

2013-10-09 03:26:19 UTC

Permalink

在 2013-10-02三的 12:32 +0200，Denys Vlasenko写道：

Post by Denys Vlasenko

Post by Petr Holasek

Post by Denys Vlasenko

Denys is responsible for rasdaemon&ABRT integration, so I'm adding him to the loop.

IIUC rasdaemon does not send its data yet to abrt.
rasdaemon developers work on the way to prevent
floods of error reports: it's semi-trivial to generate
a single report about an isolated ECC error on PCIe bus;
but what if there are thousands of them per second?
We (abrt team) provided documentation necessary
to use abrt's "create problem data" API.
We are ready to aid rasdaemon people if they have
questions or proposals for changes in abrt.
Some of them (Petr Holasek) are colocated with
abrt team and can just walk over and talk with us.

How about reporting first detected error to abrt right away, then,
if more errors happen, hold on for a few seconds, then
batch-report them as one problem ("1234 PCIe parity errors happened
at 12:34 during 4 seconds on the device FOO" would be a nice way to report
such a problem).
Increase cooldown period if errors keep coming, with a cap.
unsigned cooldown_sec = 5;
...
cooldown_sec *= cooldown_sec;
if (cooldown_sec > 15 * 60)
cooldown_sec = 15 * 60;
With formulas like above cooldown rises quickly, resulting in just
a few problem reports even with constant flood of error events;
yet, it does not grow to astronomical values - "collect PCIe errors
for next 27 hours and report them as one"
is obviously a bad idea too.

Cooldown period is a good idea. Let sysadm customize their report
threshold in rasdaemon would be OK. Maybe we just need add an plugin in
rasdaemon to customize threshold and work as abrt hook.

Regards,
Junliang Li