Common Error Patterns - Case Studies

Describes a few common error patterns that were found while testing code designed for use in an SMP environment.

During development and testing, a number of areas of code have been identified that demonstrate some of the SMP common error patterns. These case studies are described below to illustrate how these errors can be identified and typical code fixes.

Relying on linear execution

The ETEL client API makes use of an unusual implementation class called CPtrHolder. Each ETEL client sub-session (derived from RTelSubSessionBase) owns a CPtrHolder. This class stores a dynamic array of descriptors, each of which corresponds to the IPC data used for a specific subset of ETEL client/server messages. This allows the server to read/write data from/to valid client descriptors when there are multiple outstanding messages from a particular client sub-session.

A problem occurred when the same client sub-session sent two asynchronous messages of the same type (therefore using the same CPtrHolder descriptor) to the ETEL server in a short space of time. The problem did not manifest itself in a single core environment because the ETEL server always ran at a higher priority than the ETEL client, and so the first message was always completed before the client sent the second message. However, in an SMP environment, the server runs on a different CPU to the client and had not started processing the first message by the time the client sent the second message. The CPtrHolder descriptor was updated with data from the second message but the server was still expecting to read data from the first message. As a result, the first message was completed by the server with a KErrCorrupt error code.

The solution was to identify which messages could suffer from this problem and send a synchronous EEtelFlushInterfaceQueue message to the server immediately after the asynchronous request. This guaranteed that the server would have read the data from the first asynchronous message by the time the second asynchronous message was sent.

Relying on thread suspension when making asynchronous calls

In a way, the "Non-synchronized IPC calls" category is a subset of this category. Whenever code makes assumptions about which order threads are executed based on thread priority, there is a potential problem. In most cases, it is best to think of thread priority as nothing more than a hint to the scheduler. In an environment where multiple processors are being activated and deactivated dynamically it is not safe to make any assumptions concerning which threads will run at a given time.

Consider the case below where a mutex is acquired by two threads with different priorities:

// High priority thread (T1)    
iMutex.Wait();
DoFirstThing();         
iMutex.Signal();         

// Low priority thread (T2)
iMutex.Wait();
DoSecondThing();
iMutex.Signal();

Even on a single processor system it is unwise to assume that T1 would acquire the lock before T2. For example, T1 could be suspended because of a page fault. While it is suspended, the scheduler could run T2, causing it to acquire the lock and execute DoSecondThing(), which is not the desired behaviour. In an SMP context both threads could be running concurrently, making it even harder to predict which thread will acquire the lock first. While a mutex may be used to prevent concurrent access to a shared resource, it cannot be used to force two threads to perform actions in a certain sequence. The solution is to use an RSemaphore initialised with a value of 0, allowing one thread to suspend until another thread signals that it is safe to proceed. We would use this in the above example as follows:

// T1 
DoFirstThing();
iSemaphore.Signal();

// T2
iSemaphore.Wait();
DoSecondThing();

Now, T2 waits for explicit notification from T1. We are completely agnostic about the relative thread priorities. Should T2 execute first, the semaphore would be decremented and the thread blocked before executing DoSecondThing(). It can only proceed once T1 has executed DoFirstThing() and signalled the semaphore.

Case study: Thread EwSrv.exe panics with E32USER-CBase 46 on H4.

This is another case of a bad assumption about thread priorities but with a different result. The problem manifested itself on an SMP environment when pressing keys a lot within an edit window. The edit window cursor is an animation that periodically generates redraw events within the window server. The scheduling of these redraw events was altered by the events generated by the key presses, which led to a stray event panic (E32USER-CBase 46) in the window server main thread. As this is a critical thread, the kernel then faulted. Like many stray event panics, this problem was difficult to diagnose but easy to fix.

The window server has a mechanism for redrawing animations involving a separate low-priority thread (called the kickback thread) encapsulated within an active object that controls that thread, called CKickBack. Kickbacks from the window server main thread are executed through this function:

void CWindowServer::CDefaultAnimationScheduler::CKickBack::RequestKickBack()
    {
    if (!IsActive())
        {
        TRequestStatus * status = &idleStatus;
        iIdleThread.RequestComplete(status, KErrNone);
        iStatus = KRequestPending;
        SetActive();
        }
    }

The kickback thread is signalled using a handle to that thread (iIdleThread) and the iIdleStatus semaphore. The CKickBack active object is then set active and waits for the kickback thread to signal back to the main thread using the iStatus semaphore.

After construction, the kickback thread spends its whole time in an endless loop, waiting for iIdleStatus to be signalled from the window server main thread:

void CWindowServer::CDefaultAnimationScheduler::CKickBack::Loop()
    {
    FOREVER
        {
        User::WaitForRequest(iIdleStatus);
        TRequestStatus * status = &iStatus;
        iWservThread.RequestComplete(status,KErrNone);
        }
    }

Once iIdleStatus is completed, the kickback thread resumes execution, immediately signals completion of the iStatus semaphore back to the window server main thread (via the iWservThread RThread handle), then waits for the next kickback.

Back in the window server thread, the CKickBack active object is run, which simply calls back a function that was supplied at construction time:

void CWindowServer::CDefaultAnimationScheduler::CKickBack::RunL()
       {
       iCallBack.CallBack();
       }

The time between the initial RequestKickBack() and CKickBack::RunL() should be very short but there is an opportunity for higher priority threads (or active objects in the window server) to get some CPU time in-between.

The problem occurs after the call to RequestComplete() in RequestKickBack(). On a single core environment, the subsequent assignment iStatus=KRequestPending will always happen before the kickback thread can resume execution (because the main thread has a higher priority than the kickback thread). On an SMP environment however, the kickback thread can resume execution in parallel with the main thread, on a different CPU. It's possible for the RequestComplete() on iStatus in the Loop() function of the kickback thread to occur before the iStatus=KRequestPending assigment in the main thread. If this happens then the KErrNone value of iStatus will be overwritten by KRequestPending, and the CKickBack active object will end up being active but with an iStatus of KRequestPending. The active scheduler will then receive the signal from the kickback thread and iterate through its active objects, looking for the one that is active but not pending. It will get to the end of the list without finding a valid active object, causing a stray event panic.

The simple solution is to bring forward the assignment of iStatus in RequestKickBack() to before the call to RequestComplete().

void CWindowServer::CDefaultAnimationScheduler::CKickBack::RequestKickBack()
    {
    if (!IsActive())
        {
        iStatus = KRequestPending;
        SetActive();
        TRequestStatus * status = &idleStatus;
        iIdleThread.RequestComplete(status, KErrNone);
        }
    }

Stray event panics can be particularly difficult to track down because the call stack at the point it panics does not show the root cause of the problem.

Unsafe assumption of Publish and Subscribe message ordering

The Publish and Subscribe mechanism is one of the standard IPC mechanisms available in Symbian platform. There are unique SMP error patterns associated with its use.

Case study: t_InitialiseLocale fails on X86 SMP Hardware

The Generic OS Services utility, initialiselocale.exe uses a Publish and Subscribe property to indicate to subscribers that locale initialisation is complete. It also uses the Rendezvous mechanism to synchronise with threads that are waiting for initialiselocale.exe to start. The problem manifested itself in the test program t_initialiselocale.exe, which first rendezvous with initialiseLocale.exe and then checks whether it has initialised using the P&S property. Here are the respective code fragments:

From os/ossrv/lowlevellibsandfws/apputils/initlocale/src/initialiselocale.cpp:

void CExtendedLocaleManager::ConstructL()
    {
    TUid LocalePropertyUid;
    LocalePropertyUid.iUid = KUidLocalePersistProperties ;
    ...
    //now call RProcess::Rendezvous to indicate initialiseLocale is completed with no error
    RProcess().Rendezvous(KErrNone);

    // Set property to indicate initialisation complete
    User::LeaveIfError(RProperty::Set(LocalePropertyUid, EInitialisationState, ELocaleInitialisationComplete));

    // All ready to go! Start the active sheduler.
    CActiveScheduler::Start();
    }

From os/ossrv/lowlevellibsandfws/apputils/initlocale/test/t_initialiselocale.cpp:

void TestLocaleChanges()
    {
    RProcess process;
    ...
    TInt r = process.Create(KInitialiseLocaleExeName, KNullDesC);
    if(r == KErrNone)
        {
        TRequestStatus stat;
        process.Rendezvous(stat);
        process.Resume(); // Start the process going
        //wait for the locale initialisation to complete first before testing
        User::WaitForRequest(stat);
        TEST2((stat.Int()==KErrNone)||(stat.Int()==KErrAlreadyExists),ETrue);
        TInt flag;
        TUid initLocaleUid=TUid::Uid(KUidLocalePersistProperties);
        r=RProperty::Get(initLocaleUid,(TUint)EInitialisationState,flag);
        TEST2(r,KErrNone);
        //check for the P&S flag that indicates localeinitialisation complete
        TEST2(flag,ELocaleInitialisationComplete); //!!! Test failed !!!
        ...

The problem manifested itself in an SMP environment because the t_initialiselocale main thread reached the test marked "Test failed" before the setting of this property in the InitialiseLocale main thread was published. Therefore, the test failed. The solution was to simply move the setting of the KUidLocalePersistProperties property in InitialiseLocale to before the call to Rendezvous(). It could then be guaranteed that the property would be published by the time t_initialiselocale had rendezvoused with InitialiseLocale.

void CExtendedLocaleManager::ConstructL()
    {
    TUid LocalePropertyUid;
    LocalePropertyUid.iUid = KUidLocalePersistProperties ;
    ...
    // Set property to indicate initialisation complete
    User::LeaveIfError(RProperty::Set(LocalePropertyUid, EInitialisationState, ELocaleInitialisationComplete));
    //now call RProcess::Rendezvous to indicate initialiseLocale is completed with no error
    RProcess().Rendezvous(KErrNone);
    // All ready to go! Start the active sheduler.
    CActiveScheduler::Start();
    }

Relying on thread priority for execution order

Case study: Priority dependence in the file server in T_CFSSIMPLE

The file server has a main thread, a thread for each drive and a disconnect thread (plus some other ones not relevant to this discussion). All requests are received by the main thread and each drive has a lock associated with it.

When an RFs session is closed, the main thread passes the request to the disconnect thread, and the disconnect thread eventually ends up running TFsSessionDisconnect::DoRequestL(). Closing the session is asynchronous within the file server even though it appears to be synchronous to the client. There may be outstanding requests on any number of drives so every drive thread needs to be told to cancel any pending requests before the session can be entirely deleted by the file server.

Before the fix, TFsSessionDisconnect::DoRequestL() looped over the drive array and did the following for each drive:

  1. Lock the drive.

  2. Check whether the drive exists.

  3. If it does exist, send a request to the drive thread to clean up any pending requests that are were going to be handled for the session being closed.

  4. Unlock the drive.

  5. Wait for the cleanup request to complete.

  6. Check whether write caching is enabled on that drive.

  7. If it is enabled, send a request to the drive thread to flush any dirty data.

  8. Wait for the flush request to complete.

The problem occurred when the T_CFSSIMPLE test application (after performing its tests) unmounted a drive just before closing its RFs session. Unmounting a drive is also asynchronous from the file server's perspective - the main thread has to send a request to the drive thread so that it shuts itself down cleanly. While unmounting a drive, the drive is locked.

The disconnect thread is normally the highest priority thread in the file server, even higher than the main thread, so usually nothing else gets to issue any new requests while this is happening. However, in an SMP environment (or when the crazy scheduler is being used), it's possible for an unmount request to happen at the same time as a session disconnect.

If one of the drives is unmounted at the same time as the session disconnect, then the unmount sets the drive as not present and kills the associated thread. Step 3 above is safe because it acquires the lock on the drive before modifying it. The disconnect thread may get there first and send the cleanup request; then the drive thread unmounts itself (since the lock is released) and the cleanup request completes with KErrNotReady (which is ignored by the code). Alternatively, the unmount request may get there first; then the disconnect thread never thinks the drive is there in the first place so does not issue the cleanup request.

The problem was in step 7 - this part of the code didn't lock the drive or check whether it existed. It assumed that because the drive existed earlier in the loop it still continued to exist. Since step 4 unlocked the drive, this was not necessarily true - there was an opportunity for it to be unmounted and a flush request sent to a non-existent drive, resulting in a panic.

The solution was to protect step 7 with a lock and a "check whether drive exists" operation like in step 3. Here is the actual code change:

Before:

From os/kernelhwsrv/userlibandfileserver/fileserver/sfile/sf_ses.cpp

if (TFileCacheSettings::Flags(i) & (EFileCacheWriteEnabled | EFileCacheWriteOn))
    {
    // Flush dirty data pR->Set(FlushDirtyDataOp,aRequest->Session());
    pR->SetDriveNumber(i);
    pR->Status()=KRequestPending;
    pR->Dispatch();
    User::WaitForRequest(pR->Status()); // check request completed or cancelled (by file system dismount)
    }
__ASSERT_ALWAYS(pR->Status().Int()==KErrNone||pR->Status().Int()==KErrCancel,Fault(ESessionDisconnectThread2));
__THRD_PRINT2(_L("Flush dirty data on drive %d r=%d"),i,pR->Status().Int());
}

After:

if (TFileCacheSettings::Flags(i) & (EFileCacheWriteEnabled | EFileCacheWriteOn))
    {
    FsThreadManager::LockDrive(i);
    if(!FsThreadManager::IsDriveAvailable(i,EFalse)||FsThreadManager::IsDriveSync(i,EFalse))
        {
        FsThreadManager::UnlockDrive(i);
        continue;
        }
    // Flush dirty data
    pR->Set(FlushDirtyDataOp,aRequest->Session());
    pR->SetDriveNumber(i);
    pR->Status()=KRequestPending;
    pR->Dispatch();
    FsThreadManager::UnlockDrive(i);
    User::WaitForRequest(pR->Status());
    }
// check request completed or cancelled (by file system dismount which completes requests with KErrNotReady)
__ASSERT_ALWAYS(pR->Status().Int()==KErrNone||pR->Status().Int()==KErrNotReady,Fault(ESessionDisconnectThread2));
__THRD_PRINT2(_L("Flush dirty data on drive %d r=%d"),i,pR->Status().Int());
}

Case study: Unexpected running of low priority active object

There is a common pattern used in test code (and possibly also production code) where:

  1. A low priority active object is used and immediately set "ready to run". (Typically CAsyncOneShot is used for the active object.)

  2. The active scheduler is nested.

  3. One or more high priority active objects are started (probably several times), which all make requests on a server.

  4. The high priority active object(s) are serviced and immediately restarted. This continues until there is no more work for the high priority active object(s) to do.

  5. The low priority active object runs and un-nests the active scheduler.

In a single core environment, the server always processes the client requests before the client can run any other code (because the server runs in a higher priority thread). When the client thread runs, the high priority active objects are already "ready to run" and get serviced before the low priority active object.

This pattern breaks under SMP conditions because the client thread can continue to run on a different CPU at the same time that the server is processing the client requests. At this point, the only active object in the client thread that is "ready to run" is the low priority active object. This object gets serviced and the active scheduler is stopped earlier than expected.

For example, the window server test code function WaitForRedrawsToFinish() suffered from the problem described above:

EXPORT_C void WaitForRedrawsToFinish()
    {
    CStopTheScheduler* ps=new CStopTheScheduler(ETlibRedrawActivePriority-1);
    if(ps)
        {
        ps->Call();
        CActiveScheduler::Start();
        delete ps;
        }
    }

This creates a low priority active object and then nests the active scheduler. The low priority active object used was defined like this:

class CStopTheScheduler : public CAsyncOneShot
    {
public:
    CStopTheScheduler(TInt aPriority); //Pure virtual function from CActive void RunL();
    };

CStopTheScheduler::CStopTheScheduler(TInt aPriority) : CAsyncOneShot(aPriority)
    {}

void CStopTheScheduler::RunL()
    {
    CActiveScheduler::Stop();
    }

This just un-nests the active scheduler when it runs. Before WaitForRedrawsToFinish() is called, there is already an outstanding request on the window server for it to notify the test thread that there is a redraw event ready. If there is a redraw ready, then a high priority active object called CTRedraw is serviced, which processes the redraw event and immediately requests notification of the next redraw from the server.

EXPORT_C void CTRedraw::RunL()
    {
    TWsRedrawEvent redraw;
    iWs->GetRedraw(redraw);
    if (redraw.Handle()!=0 && redraw.Handle()!=ENullWsHandle)
        ((CTWin*)redraw.Handle())->Redraw(redraw.Rect());
    iWs->RedrawReady(&iStatus);
    SetActive();
    }

This is a simplified version of the code.

The intention was that CTRedraw::RunL() would run many times, as long as there were outstanding redraw events, before CStopTheScheduler::RunL() was able to run. When there were no more outstanding events, CStopTheScheduler::RunL()would run, which then allowed the code after WaitForRedrawsToFinish() to run.

In an SMP environment though, every time WaitForRedrawsToFinish() was called, only one window was redrawn, even though redraw events for several windows were waiting. CTRedraw::RunL() only ran once before CStopTheScheduler::RunL() managed to run.

There are at least two ways to fix this problem:

  1. After the call to RedrawReady(), add another synchronous call to the window server. This solution is undesirable as the additional IPC compromises the speed of the test code.

  2. Modify the class CStopTheScheduler so that in its RunL() it does:

    void CStopTheScheduler::RunL()
        {
        //Here a synchronous call to WSERV is made to make sure the asynchronous call has finished being processed
        if (iClient->RedrawHandler()->iStatus==KRequestPending)    //Check to see if the redraw active object is now ready to run
            CActiveScheduler::Stop();    //If it is not ready to run un-nest the active scheduler
        else
            Call(); //If it is ready to run set this active object ready to run again
        }
    

    This was the solution that was chosen.

Premature recreation of named objects

This category of problems relates to the asynchronous cleanup of objects owned by the kernel. On the user-side this involves classes derived from RHandleBase, corresponding to a DObject in the kernel. These problems take the following form:

  • 1a. An RHandleBase derived object is created user-side.

  • 1b. A corresponding DObject is created kernel-side.

  • 2. Something is done with that object.

  • 3a. All user-side handles to that object are destroyed.

  • 3b. Deletion of the corresponding DObject is initiated.

  • 4a. A second RHandleBase derived object with the same name is created user-side.

  • 4b. A second corresponding DObject with the same name is created kernel-side.

The problem is that all DObject names must be unique and 3b is not guaranteed to complete before 4b starts, therefore 4b (and 4a) may fail with KErrAlreadyExists.

This most commonly manifests itself during thread cleanup and so this case warrants more detailed discussion. When a thread either exits or is killed externally, the thread itself runs an exit handler inside the kernel. This runs at a slightly higher priority than normal application threads and does part of the cleanup, including completing log-ons. The last thing the exit handler does is to queue a DFC which gets executed by the supervisor thread inside the kernel. This completes the cleanup, including freeing the thread stacks and closing the DThread object itself. If no other threads or processes have open handles on the DThread, it is completely deleted at this point.

On a single processor system, the following code would always work provided that the thread running it has a lower priority than the supervisor thread and there are no undertakers in the system (which is true for a lot of test code):

RThread t;
TInt r = t.Create(KName, ...);
assert(r==KErrNone);
TRequestStatus s;
t.Logon(s);
assert(s.Int()==KRequestPending);
t.Resume();
User::WaitForRequest(s);
// read exit information for thread and do something with it
t.Close(); // thread actually disappears here
r = t.Create(KName,...); // recreate with same name
assert(r==KErrNone); 

The exit handler and supervisor would both run ahead of the calling thread, so by the time the t.Close() line is executed there would only be one handle remaining on the DThread. If there are undertakers in the system, the supervisor thread exit handler would open a handle on the exiting thread for each undertaker. The thread would not then go away until all those handles were closed and the code above would potentially fail.

On an SMP system the code above would clearly fail, since regardless of thread priorities or undertakers, the supervisor thread could run on a separate CPU from the thread calling Close() and there is no guarantee that on return from Close() the thread would have been destroyed.

A similar issue exists where, instead of recreating the thread with the same name, a kernel heap check is performed to ensure all kernel memory has been cleaned up.

There are three possibilities to fix this problem:

  1. Recreate the thread with a different name.

  2. Wait a "little bit" before recreating the thread (or performing the heap check).

    This is almost always a bad idea as there is no way of knowing how long to wait and this may be different on any future system.

  3. Wait for the object concerned to be completely destroyed.

In general, option 1 is preferred. For code that is likely to encounter the problem in practice, APIs should be provided to aid with creating objects with a unique name. For example, the TDynamicDfcQue and Kern::DynamicDfcQCreate APIs are provided to help with creating DFC threads with unique names.

Using a different name is easy in the above code example but it's not possible when checking for kernel memory leaks (as in a lot of test code). In this case, solution 3 can be employed, with the aid of RHandleBase::NotifyDestruction (in e32cmn.h) and CLOSE_AND_WAIT (in e32test.h). This solution is acceptable in a closed system, where there is control over all handles to the object being recreated but in general there is no way of knowing whether some component has an open handle to a thread (or what component it might be). Any component can open a thread by name and hold the handle forever, jamming the system while it waits for the handle to be closed. Therefore, in practice you should always attempt to recreate it with a different name (option 1).

Case study: t_NamedPlugins fails on X86 SMP Hardware

This is a classic example of the thread recreation problem, with solution 3 employed as the fix.

From os/ossrv/lowlevellibsandfws/apputils/tsrc/t_namedplugins.cpp:

LOCAL_C void NamedPluginsPanicConditionsThread()
    {
    ...
    _LIT(KName, "NamedPlugins_Panic_Thread");
    RThread thread;
    ...
    // Test EBafPanicBadResourceFileFormat
    TInt rc = thread.Create(KName,...);
    TEST(rc==KErrNone);
    ...
    thread.Close();
    // Test EBafPanicBadArrayPosition
    rc = thread.Create(KName,...);
    TEST(rc==KErrNone); //!!! Test failed !!!
    ...

The simple solution was to change the thread.Close() call to CLOSE_AND_WAIT(thread) instead. This is acceptable for test code because it is known that there will be no other threads holding on to a handle to the NamedPlugins thread.

Related concepts
SMP Developer Tips