Identifying batches of consecutive dates in SQL

Disclaimer: This is a solution done in T-SQL aka Microsoft SQL. It may be possible to recreate in other database systems, but I haven’t explored that option yet.

This is a solution to a very specific problem I’ve encountered, which might prove useful to others. A while ago I started looking into the code of our solution for overdue payments (in regards to subcriptions, e.g. the customers are billed monthly but not always paying), which appeared to have some issues. I found an oddly constructed procedure where I couldn’t for the life of me understand why it would’ve been built that way. My first instinct was to try to build it from scratch and see what the major showstopper would be.

The biggest hurdle was identifying each specific cluster of undisrupted payments, as that’s not quite straightforward when looking at the data.

To take you through this process, let’s start off by defining the table we’re working with. I’ll exclude fields that aren’t relevant, keeping this explanation as simple as possible.

SubscriptionID (bigint – Identifier for the subscription whose payment is overdue)AmountDue (float/money)
AmountPaid (float/money)
DueDate (date)

This should be enough to get us going.
Now, on to the previously mentioned hurdle; How do we split the data into groups for each series of completed (or missing) payments?

Our first query gives a list of all the missing payments. This is where we start.

FROM Invoices i
WHERE i.AmountPaid = 0 /* let's only look at the completely unpaid ones for now */
AND i.DueDate < GETDATE()

Now comes the clever part – a trick I came across somewhere on Stack Overflow. We want to group each series of recurring payments into its own group, and figure out how long the break lasted and how much was due. There isn’t any obvious variable to group this data by, but we can create one! We know the time series will be increasing by 1 month for each row, and so any deviation from that formula should mean a break in the pattern (and thus a new series).

You may be familiar with the MSSQL function ROW_NUMBER(), which can enumerate rows based on your given criteria. This will give a steadily rising number for the rows we’re already looking at. It won’t care whether there was a break in the dates or not, so we need something more. By calculating the number of months (for a monthly recurring payment – if the intervals are different then this method would need to be adjusted for that) that have passed since a given point in time, we’ll be closing in on our goal.

ROW_NUMBER() OVER( ORDER BY i.SubscriptionID, i.DueDate ) AS rn,
DATEDIFF(month, '1900-01-01' , i.DueDate) AS mnth

These two combined will give us what we need. The number of months minus row number will give the same result for each recurring record, giving us a variable we can start grouping them by!

ROW_NUMBER() OVER( PARTITION BY SubscriptionID, Batch  ORDER BY DueDate) AS IntervalsMissed,
DueDate, SubscriptionID,
ROW_NUMBER() OVER (PARTITION BY SubscriptionID ORDER BY InvoiceStartDate DESC) AS rn -- the row with rn=1 would be the most recent

DATEDIFF(month, '1900-01-01' , i.DueDate)- ROW_NUMBER() OVER( ORDER BY i.SubscriptionID, i.DueDate ) AS Batch
FROM Invoices i
WHERE i.AmountPaid = 0
AND i.DueDate < GETDATE()
) a

Here we go, the resulting data set gives us a list of every cluster of missed payments our customers have, complete with how many recurring months they missed their payments for each of them. In our case we use this to identify those who are currently in a miss-streak, and ignore the historical data. There’s quite the number of things one can do with such data, but that’s a topic for another day.

Until next time!