Tuesday, 7 August 2012

How batch processing works under the hood AX2009


In this article I am going to explain how batch processing in AX2009 works, I don't mean
how to set up a batch group or any of that kind of thing that you find in the manual,
what I mean is what each AOS is doing in the background to decide how and when to pick
up batches and process and complete them. Understanding this background can help in
advanced batch troubleshooting or development scenarios.
In AX2009 batch processing changed. Now we have AOSes which can run batch processes
directly, if you want to see what's happening with a batch process, it can be more difficult than
 in AX3 or AX4 as there is no client sitting there running to look at.
What happens now is that each AOS has a dedicated thread which checks for batches, basically
 all this does is calls Classes\BatchRun.ServerGetTask() once every 60 seconds (timing is not 
configurable) and if there is any work for that AOS to do then the AOS will pick up a task from here.

I'll give an example of an end-to-end batch process to show what happens where and when:
- A report is sent to batch by a user, it goes into the batch queue in BATCHJOB (header) and 
BATCH (the batch tasks).
- Once every 60 seconds each AOS that has been configured for batch processing 
(in administration->setup->server configuration) will call the X++ method - 
Classes\BatchRun.serverGetTask()
- In serverGetTask() the logic is exposed in X++ so we can all see what happens, this is the
 main place that we decide what to pick up for batch processing. Basically it checks if there 
is any tasks in the BATCH table waiting for this AOS - based on the batch groups that this 
AOS is configured to process, and based on the time that the records in BATCH are due to
 be processed (i.e. something processes at 21:00 each day then it won't get picked up until 
21:00 despite the fact that the AOS polls every 60 seconds). There are a few stages to this method:
 1. First we check if there is a task (a task is a record in BATCH table) ready for us, the query for 
this is like this:
select firstonly pessimisticlock RecId, CreatedBy, ExecutedBy, StartDateTime, Status,
        SessionIdx,SessionLoginDateTime, Company, ServerId, Info
    from batch
    where batch.Status == BatchStatus::Ready
    && batch.RunType == BatchRunType::Server
    && (Session::isServer() || batch.CreatedBy == user)
    join Language from userInfo
        where userInfo.Id == batch.CreatedBy
        && userInfo.Enable == true
    exists join batchServerGroup
        where batchServerGroup.ServerId == serverId
            && batch.GroupId == batchServerGroup.GroupId;
 2. If a task is returned in step 1 then there's nothing more to do and we start processing that task.
 If no task is returned then we look to see if any batch jobs need to be started, the query for this is like this:
 update_recordset batchJob setting
                Status = BatchStatus::Executing,
                StartDateTime = thisDate
            where batchJob.Status == BatchStatus::Waiting
                &&  batchJob.OrigStartDateTime   <= thisDate
            exists join batch
                where batch.BatchJobId == batchJob.RecId
            exists join batchServerGroup
                where batch.GroupId == batchServerGroup.GroupId
                && batchServerGroup.ServerId == serverId;

 
3. After step 2 we will run Classes\batchRun.serverProcessDependencies(). In here something 
interesting happens - we see that we use this table "BatchGlobal", this is used as a focal point,
 because we might have several AOSes running batch processing in the same environment,
 and so for some operations we look to this table to see if another AOS has already done
 something, to decide whether the current AOS needs to do it as well or not. For dependencies 
we just make sure that another AOS is not doing this in the same second. So if we continue here, 
the queries we run to set more tasks (again tasks are just records in the BATCH table) ready for 
processing are below - you can see in the queries how we update the status on the BATCH
 table records, checking that we only do it for records which are ready and do not have any 
constraints that are not completed yet:
   
//There are no more available tasks and the user is asking for any task.
Search for more tasks with
    //dependencies
    update_recordset batch setting Status = BatchStatus::Ready
    where batch.Status == BatchStatus::Waiting
        && batch.ConstraintType == BatchConstraintType::Or
    exists join batchJob
        where batchJob.Status == BatchStatus::Executing
            && batch.BatchJobId == batchJob.RecId
    exists join constraintsOr
        where constraintsOr.BatchId == batch.RecId
    exists join batchDependsOr
      where
     (
        ((batchDependsOr.Status == BatchStatus::Finished
            && (constraintsOr.ExpectedStatus == BatchDependencyStatus::Finished
                || constraintsOr.ExpectedStatus == BatchDependencyStatus::FinishedOrError))
        || (batchDependsOr.Status == BatchStatus::Error
            && (constraintsOr.ExpectedStatus == BatchDependencyStatus::Error
                || constraintsOr.ExpectedStatus == BatchDependencyStatus::FinishedOrError)))
        && constraintsOr.DependsOnBatchId == batchDependsOr.RecId
     );

    update_recordset batch setting Status = BatchStatus::Ready
    where batch.Status == BatchStatus::Waiting
        && batch.ConstraintType == BatchConstraintType::And
    exists join batchJob
        where batchJob.Status == BatchStatus::Executing
            && batch.BatchJobId == batchJob.RecId
    notexists join constraintsAnd exists join batchDependsAnd
    where
     (
        constraintsAnd.DependsOnBatchId == batchDependsAnd.RecId
        && constraintsAnd.BatchId == batch.RecId
        && ((batchDependsAnd.Status != BatchStatus::Finished && batchDependsAnd.Status != BatchStatus::Error)
            || (constraintsAnd.ExpectedStatus == BatchDependencyStatus::Finished
                && batchDependsAnd.Status == BatchStatus::Error)
                || (constraintsAnd.ExpectedStatus == BatchDependencyStatus::Error
                && batchDependsAnd.Status == BatchStatus::Finished))
     );
4. When this serverProcessDependencies() is complete in step 3 we call again to serverGetOneTask() (same as in step 1), if there were some more tasks set to "ready" in step 3 then we might pick up a task to work on here. Of course if not tasks were "ready" in step 3 then we won't find a task and we'll just do nothing.
- So our report which we sent to batch, if in the steps numbered 1-4 above, we found this record was ready to process, and we picked it up, what happens next inside the AOS kernel is that we start a worker session, which can be thought of a bit like a client session, just without a client, it will have it's own session ID and you'll see the ID recorded against the record in the Batch table. From this point it calls BatchRun.runJobStatic() and actually runs the batch process - this is just normal X++ running the process here. When this runJobStatic() completes we call BatchRun.ServerFinishTask(), which just sets the status of the record in BATCH to either "finished" or "error" (if it failed for some reason).
- Now our batch task is finished - the record in the BATCH table. But the header for this batch, the Tables\BatchJob record is not set to finished yet. For this part there is another background process running every 60 seconds on each AOS which just calls into BatchRun.serverProcessFinishedJobs(). Now we can see in this X++ method what it does - we use this BatchGlobal table again, to make sure that between all AOSes we only check for finished jobs a maximum of once every 60 seconds, if it has been 60 seconds then we will run a whole load of queries (too many to copy here but you can check there to see them) to create the batch history (various tables), set the BatchJob record to finished and delete the completed tasks and constraints.
There are a couple of other background things that happen in the AOS kernel for batch processing:
1. Every 5 minutes it will call to BatchRun.serverCleanUpDeadTasks() - again we use the BatchGlobal table, so that we'll only run this once every 5 minutes between all AOSes. This just sets tasks back to "ready" if the session ID for the worker session (I mentioned this earlier - we create this worker session when we start processing a task) is no longer a valid session - basically if a task fails with an X++ exception, or something like that, then the worker session will end, and if you have configured this batch task to allow some retries, then it's this method which will reset the task for it to have a retry.
2. Every 5 minutes each AOS will check the server settings, to see if it's supposed to process the same batch groups - or if it's not supposed to be a batch server any more, all those settings.

No comments:

Post a Comment