Any (developer) option to override log quarantine?

Question

Created 1w

Replies 5

Boosts 1

Participants 2

We recently migrated our entire product to Apple Unified Logging due to the various benefits it provides. However we immediately started hitting the "log quarantine" problem ("QUARANTINED DUE TO HIGH LOGGING VOLUME"). This is partly because we are indeed over logging in a few cases (which we have to work on fixing), but also partly because it's a complicated product with potentially hundreds of libraries, and some of the code can legitimately be very busy. For example we have a system extension that's implemented both as a NetworkExtension client and an EndpointSecurity client, if we were to log decent information about each network or file system event so we can troubleshoot something, they are bound to be high volume logs.

Now when our app is running in a normal user environment, this is not a problem. We can disable certain heavy log levels, or at least disable persisting for certain logs (one of the benefits of Apple Unified Logging we really like is that it allows very flexible controls, log config command, OSLogPreferences, configuration profile, we can employ whatever that suits a specific case). But ultimately, the question is what if we end up with a troubleshooting case we don't know exactly where a problem is so we just need the full logs at debug level? And not only just enabled, but because we might not know when the issue can happen either we also need to persist the full set of logs for as long as possible? We will start hitting log quarantine again. Granted this is a very extreme case, but if worst comes to worst, how can we even do that with Apple Unified Logging? Is there an option that allows us to override the quarantine, if but temporarily?

I've searched a few relevant forum posts, some of which described log quarantine but no one had mentioned any solution for it (besides having to stop logging so much from the app but as I explained we do have legitimate cases where log volume can still be huge). I've also read The Eskimo's "Your Friend the System Log" and browsed some of the troubleshooting config profiles provided by Apple hoping to discover some hidden payloads but found none so far.

There is an OSLogRateLimit environment variable that I noticed if I run a launchctl print system/<a-launch-daemon-lable> and it's usually 64. Is this something relevant? And knowing Apple it's probably something that can't be tampered with?

Answered by DTS Engineer in 874248022

Well, that was fun.

Lemme start with some disclaimers.

It’s impossible to talk about logging at this level without discussing implementation details, that is, information about how the system works today but which isn’t considered API. This stuff has changed in the past and could easily change again in the future. Don’t build knowledge of this into a product that you ship to a wide range of users.

Also, the limits imposed by the system log are not arbitrary. They represent a trade-off between convenience — when debugging problems that come in from the field, more logging is always better — and cost. There are three specific costs of concern:

Logging consumes CPU cycles, which leaves less available for real computation and also takes energy.
Persistent logging consumes I/O bandwidth and even more energy.
Persistent logging can contribute to SSD wear.

The system log is a shared resource and it’s important to Apple that it remain useful for debugging a wide range of problems. I touch on this in Your Friend the System Log and I recently went into more detail in this post.

All of the above puts a limit on what I can actually talk about here. I don’t mind straying a little into the world of implementation details, but I’m not going to fully describe everything.

And with that out of the way, let’s return to the actual issue.

Lemme start with a very high-level description of how quarantine works:

The system log regularly rotates its log files [1].
When it rotates a file, the system calculates how fast it filled up.
If it filled up too quickly, that’s a sign of problems, so it takes a deeper look at the cause.
If it finds that a process logged too much, it quarantines that process.
Once quarantined, the process stays that way until it terminates.
That causes its log entries to be dropped.

IMPORTANT This process is based on log entries that persist. Non-persistent log entries aren’t a factor here.

I’m being deliberately vague about what constitutes “too quickly” and “too much”. Sorry.

So, coming back to your direct question:

Is there an option that allows us to override the quarantine, if but temporarily?

AFAICT there’s no good [2] way to do this on a public release of macOS.

Now, normally this is the point where I suggest filing an enhancement request. However, your current situation is largely theoretical: You’re concerned that you might encounter a problem in the future where debugging that problem requires you to enable persistent logging for all your log points. I’m not confident that an ER based on that theoretical concern will get traction.

So I guess my advice here is:

Structure your logging subsystems and categories to give you flexibility in how you enable and persist them your log entries.
Deploy this in the real world.
If you hit a concrete case where you can’t get the logging you need, file an enhancement request with those details.

One other thing to note is that log snapshots (log collect) capture ephemeral log entries. So, you can take a page out of DTrace’s speculative logging approach and log non-persistently until you see a problem and then trigger a log snapshot to go ‘back in time’.

Oh, and speaking of DTrace, that’s still available on macOS. You have to disable SIP [3], but that’s generally acceptable when you dealing with the really gnarly problems.

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] This isn’t as simple as traditional Unix-y logs, where there’s a single text file that gets rotated, but the general concept of log file rotation applies.

[2] And by “good” I mean something that I’m comfortable sharing here. Honestly, I’m not sure if this qualifier is even required. The controls that I uncovered to disable quarantine are not available on public releases of macOS.

[3] At least partially (-:

https://stackoverflow.com/questions/60908765/mac-osx-using-dtruss

Boost

Answer 1

DTS Engineer OP

Apple

6d

I’ve been looking for an excuse to explore log quarantine in more depth, so thanks for asking this (-:

The logging subsystem definitely has knobs you can twiddle here, but it’s not clear how many of them actually work on the public releases of macOS. I’m gonna do some research and get back to you.

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

1

Answer 2

DTS Engineer OP

Apple

5d

Accepted Answer

Recommended

Well, that was fun.

Lemme start with some disclaimers.

It’s impossible to talk about logging at this level without discussing implementation details, that is, information about how the system works today but which isn’t considered API. This stuff has changed in the past and could easily change again in the future. Don’t build knowledge of this into a product that you ship to a wide range of users.

Also, the limits imposed by the system log are not arbitrary. They represent a trade-off between convenience — when debugging problems that come in from the field, more logging is always better — and cost. There are three specific costs of concern:

Logging consumes CPU cycles, which leaves less available for real computation and also takes energy.
Persistent logging consumes I/O bandwidth and even more energy.
Persistent logging can contribute to SSD wear.

The system log is a shared resource and it’s important to Apple that it remain useful for debugging a wide range of problems. I touch on this in Your Friend the System Log and I recently went into more detail in this post.

All of the above puts a limit on what I can actually talk about here. I don’t mind straying a little into the world of implementation details, but I’m not going to fully describe everything.

And with that out of the way, let’s return to the actual issue.

Lemme start with a very high-level description of how quarantine works:

The system log regularly rotates its log files [1].
When it rotates a file, the system calculates how fast it filled up.
If it filled up too quickly, that’s a sign of problems, so it takes a deeper look at the cause.
If it finds that a process logged too much, it quarantines that process.
Once quarantined, the process stays that way until it terminates.
That causes its log entries to be dropped.

IMPORTANT This process is based on log entries that persist. Non-persistent log entries aren’t a factor here.

I’m being deliberately vague about what constitutes “too quickly” and “too much”. Sorry.

So, coming back to your direct question:

Is there an option that allows us to override the quarantine, if but temporarily?

AFAICT there’s no good [2] way to do this on a public release of macOS.

Now, normally this is the point where I suggest filing an enhancement request. However, your current situation is largely theoretical: You’re concerned that you might encounter a problem in the future where debugging that problem requires you to enable persistent logging for all your log points. I’m not confident that an ER based on that theoretical concern will get traction.

So I guess my advice here is:

Structure your logging subsystems and categories to give you flexibility in how you enable and persist them your log entries.
Deploy this in the real world.
If you hit a concrete case where you can’t get the logging you need, file an enhancement request with those details.

One other thing to note is that log snapshots (log collect) capture ephemeral log entries. So, you can take a page out of DTrace’s speculative logging approach and log non-persistently until you see a problem and then trigger a log snapshot to go ‘back in time’.

Oh, and speaking of DTrace, that’s still available on macOS. You have to disable SIP [3], but that’s generally acceptable when you dealing with the really gnarly problems.

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] This isn’t as simple as traditional Unix-y logs, where there’s a single text file that gets rotated, but the general concept of log file rotation applies.

[2] And by “good” I mean something that I’m comfortable sharing here. Honestly, I’m not sure if this qualifier is even required. The controls that I uncovered to disable quarantine are not available on public releases of macOS.

[3] At least partially (-:

https://stackoverflow.com/questions/60908765/mac-osx-using-dtruss

0

Answer 3

qb_s OP

2d

While this isn't the answer I had been wishing for it does give me a lot of food for thought. Thanks The Eskimo for looking into it!

When it rotates a file, the system calculates how fast it filled up.

So wrt this note, I understand it's more complicated than just rotating a single file, but still it sounds like everyone writing into the Unified Logging system can contribute to this. It strikes me at least one strategy (if not a preferable one), is to disable logging for all other processes/subsystems (even some of the native OS processes) that tend to be noisy too? Of course this is meant as a temporary and desperate measure, in theory it should raise our survival chance?

And another thought, what if we disable persisting, but then have an external party help redirect our logs to be persisted?

I first thought the xctrace record --template "Logging" might be a good candidate for this task but later found out the "deferred", or "windowed" log tracing seemed to be taking only from the snapshot too. If we'd run a deferred log tracing for an hour, it would likely only save the last 5 minutes out of the snapshot. Only the "live" (i.e. the "immediate") log tracing can help persist log events it has intercepted. But conceivably live log tracing would have an even worse performance penalty since it attempts to remodel the events in real time? I've seen errors reported by the Instruments app when doing a live log tracing that due to the sheer number of events it had to drop some of them from time to time.

And what of log stream ... > persisted_logs.log? Even if performance impact introduced by live log intercepting and file writing can be deemed acceptable (as long as this is only, again, a desperate measure), would this one also suffer from dropping events?

And lastly, I filed an improvement request anyway (FB21839588). I think this problem is practical enough, in the sense that if we do a persist:debug for our heaviest component, we deterministically get quarantined. That part in itself is not a hypothetical question (but I do agree the hypothetical part is that we don't exactly know if there's a concrete case that would actually require us to enable persist:debug in the first place). I attached a logarchive example and its log stats output in the FB ticket to demonstrate our current use case and why we hit log quarantine seemingly easily.

0

Answer 4

DTS Engineer OP

Apple

1d

disable logging for all other processes/subsystems … that tend to be noisy too?

Yeah, I could see how that might actually help in your situation.

I filed an improvement request anyway (FB21839588).

Fair enough. I’ll be interested to see how that pans out (-:

And what of log stream …?

Honestly, I’m not sure what’ll happen in that case. Live streaming of the full system log has a significant impact on system performance, to the point that we even added a little warning about it in Console (“Stream log messages will impact system performance.”). My concern here is that, regardless of whether log entries get dropped or not [1], that performance impact could perturb the system to the point where the bug you’re hunting no longer reproduces.

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] And I believe they will. I’ve see the log stream command output Messages dropped during live streaming message [2]. But having said that, I’ve never actually looked at what triggers that, or whether there are other places it can occur.

[2] You can disable that message with the --ignore-dropped option, but that only disables the message, not the dropping.

0

Answer 5

qb_s OP

5h

Apologies I meant to ask about activity tracing as well (they are related to this topic but if we think another forum post is better I can do it too). At certain point this also became one of the directions we wanted to explore but we've only uncovered more questions.

The related documentation stated something but very vaguely.

https://developer.apple.com/documentation/os/generating-log-messages-from-your-code?language=objc#Choose-the-Appropriate-Log-Level-for-Each-Message

If an activity object exists, the system captures information for the related process chain

We had hoped that this would somewhat play into the "speculative logging" approach we had touched upon, in the sense that if we try to log an error or fault within an activity, then it helps to capture and persist other logs on the activity chain even though they are originally not meant to be. But unfortunately from our test it didn't seem to be behaving towards that understanding. Then our question is, if we may ask - what are the exact additional information the system captures "for the related process chain"?

Are activities always persisted and would they also contribute to log quarantine?

Our thoughts were that if we can, we'd like to create an independent activity for each event we receive from the system so the logs on every event are automatically correlated (please refer to the same use case described in FB21839588). However as demonstrated in the FB ticket, that would mean A LOT of events and hence a lot of activities. I noticed we do have a flag Signpost-Persisted to control persisting for signposts, but there isn't a control for activities. My assumption is that they (the activities themselves) are always to be persisted so they would indeed contribute to quarantine in the worst case, is that correct? (Although from log stats it looks like each individual activity is tiny in terms of storage size, so maybe they are not a big concern themselves?)

We noticed this control flag Propagate-with-Activity used by some logging configuration files from Apple. We didn't find any official Apple documentation but some 3rd party MDM vendors mentioned something about

Propagate-with-Activity

Messages attached to the activity tree

Messages are attached to the activity tree in Console and crash dumps

Right now based on our observation messages are always attached to the activity tree in Console regardless, and we can't seem to be able to find anything to do between activities and crash dumps. Maybe this flag is obsolete?

0