Improvements #13676: Self-deleting Http::Client - Wt - Redmine

Actions

Copy link

Improvements #13676

open

Self-deleting Http::Client

Added by Steven Köhler 4 months ago. Updated about 1 month ago.

Status:

Review

Priority:

Normal

Assignee:

Target version:

4.12.1

Start date:

03/29/2025

Due date:

% Done:

Estimated time:

Description

I am in a situation where I need a Http::Client to perform a web request (a payment process to be precise), which is triggered from within an application by the user. The request sends relevant data to an external web server and receives a response. As you can imagine, it is important that the response is properly handled. Due to the asynchronous nature of the web request, this raises some questions regarding the lifetime of the Http::Client. Storing it within the application is not viable, since the application (and thus the client) might be destroyed before the response is received. Storing it in the server itself is possible, but raises the question how and when to delete the client, since keeping it as long as the server is running seems unreasonable. Since there seems to be no direct way to check whether a client is done, a server-side polling solution seems unviable. The only possibility would be to connect a handler to the done signal to set a flag that the client is done, and regularly checking this flag, but this intoduces race conditions, since the server (due to threading issues) might detect the set flag and delete the client before the handler finished.

Long story short: When trying to find a solution, I came up with the idea of a self-deleting Http::Client, that can be used like this:

auto client = std::make_shared<Http::Client>();
client->done().connect([client](AsioWrapper::error_code ec, Http::Response& r) mutable
{
  /* do something */
  client.reset();
});
client->post(...);

The idea is to pass a shared_ptr owning the client to the connected event handler, which ultimately deletes the client at the end of the handler. This way, the client is ensured to live as long as the request is pending and either succeeds or gets a timeout - both triggers the done signal. This used to work until Wt 4.10.1 for the following reasons: An object is allowed to delete itself within a member function if the function does not access/modify the object after deleting it, since that would cause an access violation. Emitting the done signal in Client::emitDone is the last command in that function. As by the documentation, "the signal may be deleted at any time, in particular also while it is being emitted". Thus it is protected against self deletion and the Client object is no longer accessed as soon as the signal is being emitted, making it safe for self-deletion, too.

Unfortunately, WT-11408 changed that by introducing the (perfectly fine and reasonable) lock_guard/mutex to the Client::emitDone function, since the lock_guard is destroyed at the end of the function, which has to release the mutex and thus needs to access the client again, resulting in an access violation if the client already deleted itself from a connected handler. The same would happend, if the deletion happens anywhere else before the emitDone function is completely finished.

The proposed solution/improvement for this issue is to release the lock right after its job to protect the access to the _impl pointer is done, right before emitting the done signal. Thus, the client is no longer accessed as soon as emitting the signal starts, restoring the old behavior. I will create a pull request containing the change shortly. I also attached a small demo app to demonstrate the access violation with the current implementation.

I would be happy if you could merge that change into Wt. I am aware that self deleting objects may pose a risk if not handled with proper care, but it is also a powerful tool to handle the lifetime of asynchronous objects.

Best,
Steven

Files

Download all files

main.cpp (1.07 KB) main.cpp		Steven Köhler, 03/29/2025 01:41 PM
HttpClient.patch (945 Bytes) HttpClient.patch		Romain Mardulyn, 06/16/2025 03:33 PM

Actions

Copy link

Updated by Steven Köhler 4 months ago

I have created the corresponding pull request #230.

Actions

Copy link

Updated by Matthias Van Ceulebroeck 2 months ago · Edited

Target version set to 4.12.1

Hello Steven,

from a glance, this looks fine. This maintains the intended behavior of the locks, and allows for some flexibility.
I have scheduled this for 4.12.1, as I will of course need to verify the constraints of #11408 still are matched.

Actions

Copy link

Updated by Romain Mardulyn about 2 months ago

Status changed from New to InProgress
Assignee set to Romain Mardulyn

Actions

Copy link

Updated by Romain Mardulyn about 2 months ago

Status changed from InProgress to Review
Assignee deleted (~~Romain Mardulyn~~)

Actions

Copy link

Updated by Romain Mardulyn about 2 months ago

Status changed from Review to InProgress
Assignee set to Romain Mardulyn

Actions

Copy link

Updated by Romain Mardulyn about 2 months ago

Hello Steven,

The problem with the fix you proposed is that it can cause a race condition. While a signal can be deleted while being emitted, this is only safe when done from inside the same thread. The issue with Http::Client is that a different thread could delete the Client object while Client::emitDone is executing in another thread. This leads to the signal being emitted and deleted concurrently, which is not thread-safe.

We don't plan to make signals thread-safe due to the complexity that this would introduce. Therefore, the emission of that signal must be guarded by a lock, as it currently is.

I'm sorry this breaks your self-deleting Http::Client. I hope you'll be able to find an alternative solution to your issue.

Best,
Romain

Actions

Copy link

Updated by Steven Köhler about 1 month ago

Hello Romain,

thanks for your reply. That's not exactly what I have hoped for...

Your statement, that Client::emitDone and deleting a Client might happen concurrently, made me skeptical. So I did a quick check and I'm afraid, there is a much bigger problem at hand, if those calls can happen concurrently. I'll try to elaborate. Assume we have two threads A and B, where A ist calling emitDone and B is destroying the Client and lets further assume, that they are called concurrently. Now there are two options:

A grabs the implementationMutex_, B has to wait: In this case emitting the done signal is finished before the mutex is unlocked and the destructor can grab it, which works fine - albeit by luck.
B grabs the implementationMutex_, A has to wait: This case is highly problematic, since the mutex is only grabbed in the abort function, which is called first in the destructor. As soon as abort is done the mutex is unlocked - before the rest of the destructor runs. At this point there is nothing to prevent A from grabbing the mutex. This induces several issues. The first one is the concurrent call to `impl_.lock() in the destructor and the call to impl_.reset() in emitDone. If that works by chance, the destructor begins to destroy the Client member objects like the done_ signal, while emitDone will try to call it. Depending on what happens first or how the thread scheduler interleaves it, this might have various ill behaviors. If by chance the destructor manages to finish before A continues, A will try to lock the (by now) deleted mutex, which is undefined behavior. If, by any chance, that works in our favor, accessing the impl_, redirectCount_ or done_ members will wreak havoc, since they have been deleted at that point.

Summarizing: There already is a race condition in Client. Even if signals were thread safe, it would provide no protection against it. Explicitly locking the mutex in the destructor will not prevent that, since it has to be unlocked at the end of the destructor body - before the members are deleted. Not unlocking it would mean undefined behavior in the mutex destructor. What my change does is merely allowing the race condition to surface while emitting the done signal, since the mutex is unlocked earlier - but it does not cause the underlying problem.

Since there is no way for an object to check if it was deleted, the described 'use after free' issue if emitDone executes after the destructor finished cannot be solved inside the Client class, but needs proper external synchronisation.

Best,
Steven

PS: I added a conversation with ChatGPTs o4-mini-high model to double check my suspicion. https://chatgpt.com/share/68494307-dc60-8005-bfbc-908c1fc413b8

Actions

Copy link

Updated by Romain Mardulyn about 1 month ago

Hello Steven,

Thanks for your detailed response. After a closer look, I agree with your finding: there was indeed a race condition because impl_.lock() wasn't guarded by implementationMutex_. Fortunately, simply protecting that specific line with implementationMutex_ is enough to fix that issue.

Regarding your broader concern about the Http::Client object's destruction while Client::emitDone() is running, there's a crucial distinction. Client::emitDone() is guarded by Impl::updateMutex_, which Impl::removeClient() (involved in client destruction) also acquires. This design effectively prevents the Http::Client instance from being destroyed while its impl_ object is actively calling one of its methods.

However, the problem with Client::emitDone() lies in its specific action: it resets the impl_ pointer. When this happens, the Impl::updateMutex_ is no longer protecting the Http::Client's lifetime from external destruction. This is precisely why implementationMutex_ must remain locked until the very end of Client::emitDone(). It effectively takes over the role of blocking the Http::Client's destruction once impl_ has been reset, ensuring the client object remains valid for the duration of the function's execution.

Best,
Romain

Actions

Copy link

#10

Updated by Steven Köhler about 1 month ago

Hello Romain,

thanks again for your reply. I took another, more detailed look at the whole implementation. I found the clientMutex_ you mentioned and it indeed is locked in Client::Impl::emitDone and Client::Impl::removeClient, thus preventing the Client destructor to continue beyond the impl->removeClient(); line in Client::~Client() while Client::emitDone runs - assuming impl_ was not yet reset, which indeed could be ensured with properly locking the mutex.

However, since Client::impl_ is a std::weak_ptr, it does not own the Client::Impl instance. Resetting it in Client::emitDone does not delete the Client::Impl instance, but only loses access to it. Thus, clientMutex_ is not affected by resetting the impl_ ptr, which leads to Client::emitDone is completey protected by clientMutex_ - especially since Client::Impl::emitDone properly locks the mutex and is the only place where Client::emitDone is called. Thus, your statement that clientMutex_ cannot fully protect Client::emitDone seems false.

Interestingly, the Client::Impl instance cannot be deleted from inside of Client, since it has no shared_ptr to it, but only a weak_ptr. There are, however, serveral strand_.wrap(std::bind(...)) calls in Client::Impl that pass a copy of the actual Client::Impl shared_ptr, so its lifetime extends until the last of those function objects is deleted, after it has been executed. No lifetime issues there. But this also means that Client::Impl can potentially outlive Client. That causes the race issues, because the longer living Client::Impl might try to access the shorter living Client.

A rather straight forward solution would be to wrap all of Clients relevant data in a helper struct, let's say Client::Data and have Client hold a shared_ptr to it, as well as Client::Impl. That way, the lifetime of all signals and other data would be ensured to match the Client::Impl lifetime, resolving all potential race conditions from prematurely deleting Client at once. There are still a few minor details to work out, like maybe adding helper functions to the Data struct, that are called from Client as well as from Client::Impl to ensure they are still available after Client was destructed, but I think you get the idea.

What do you think?

Best,
Steven

Actions

Copy link

#11

Updated by Romain Mardulyn about 1 month ago

Hello Steven,

The scenario you described isn't possible. The Impl::emitDone call, which causes the Client to reset its impl pointer, only happens once the request has either completed or failed. This means there won't be any impl_ function calls remaining in the strand_ (with the possible exception of Impl::stop).

Regarding your proposed solution, having the done_ signal potentially emitted after the Client's destruction is problematic and goes against the fundamental design of a signal. A signal is meant to be emitted by its associated object; therefore, it doesn't make sense for done_ to be emitted after the Client object no longer exists.

While this does break your self-deleting Client, the underlying problem it aimed to solve can now be effectively addressed. As you initially suggested, you can regularly check for a flag set by the done_ signal. This approach is now viable since the Client cannot be deleted anymore before the signal handler has finished executing.

Best,
Romain

Actions

Copy link

#12

Updated by Steven Köhler about 1 month ago

Hi Romain,

thanks for your reply and sorry for taking up so much of your time. I just wanted to point out that the Impls lifetime is acutually bound to the strand, and not the Client instance itself, and hence the lifetime issues occur. Your are, of course, absolutely right that emittig the signal after the Client is deleted is not reasonable - my intention was merely to prolong its lifetime until emitDone is finished (if it already started) - after that, impl_ is destroyed within the strand and everything would be cleanud up nicely. And since Impl checks the client_pointer before actually calling any of its functions, deleting the client somewhat earlier would result in client_ == nullptr and thus skip the calls to Client - just as intended.

I'm still not convinced that guarding the Client destructor with the mutex is enough to resolve the race issue, but I don't want to waste your time either. Would you mind sharing a preview of your intended fix with me (a small patch or something like this)? Then I could test it more thoroughly and either find that it works as intended and subsequently try to integrate it into a solution for my initially described issue - which I'm sure you'll understand is important to me to work properly - or get proof that the issue is still there, in which case another solution might need to be considered.

Best,
Steven

Actions

Copy link

#13

Updated by Romain Mardulyn about 1 month ago

File HttpClient.patch HttpClient.patch added

Hi Steven,

Here's the patch I've put together that I believe resolves the issue. Please note that this is still subject to change, as Matthias needs to validate it.

Let us know if you find the issue is not completely solved or if this patch introduces any new problems.

Thanks again for your valuable input on this; it's been very helpful.

Best,
Romain

Actions

Copy link

#14

Updated by Romain Mardulyn about 1 month ago

Status changed from InProgress to Review

Actions

Copy link

#15

Updated by Romain Mardulyn about 1 month ago

Assignee deleted (~~Romain Mardulyn~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Wt