Batch Regression Analysis in R

Tying into my previous post (regarding the data warehouse and aggregated tables), here’s a more technical description of how I put together a script for simple correlation discovery in R.

Bear in mind – this is using basic Pearson Correlation and is not at all on par with “proper” data exploration methods such as gradient boosting and the likes. I’m using the methods cor.test() and cor(), and I catch the “significant” data which has p-value below 0.05.

This test can only work with integers – this means that I can’t use textual fields from the database directly. A way to handle that is to categorize the significant values as binary choices, e.g. if you’ve got the customer categories “individual”, “company” and “school” then you’d set up case-based values named is_individual, is_company and is_school.

The query would go something like this:

SELECT 
CASE WHEN category LIKE 'individual' THEN 1 ELSE 0 END AS is_individual, 
CASE WHEN category LIKE 'company' THEN 1 ELSE 0 END AS is_company
FROM sometable

With this data in hand, we’re ready to run an analysis. I’ve built this script to find correlations in datasets of any size, but I’ll just include a couple of rows in the sample script (my real queries tend to be up to 100 lines, and do not add anything to this example). Instead of describing textually how I’ve built the script, I’ve added comments throughout the code to better explain what I’m doing where I’m doing it.

Unfortunately, it seems I’m unable to recreate my indentation in this post, but it only makes the loop a bit less readable. Bear with me.

library(RODBC)
dbhandle = odbcDriverConnect('driver={SQL Server};server=mydatabaseserver;database=warehouse01;trusted_connection=true') # trusted_connection does authentication through your logged-in AD credentials

query = "
SELECT
MonthlyAmount, 
DATEDIFF(day, FirstInstalmentPaymentDate, GETDATE()) AS DaysSinceFirstPayment, 
NoOfMissedInstalmentsPast2Years, 
NoOfAdHocPaymentsPast2Years,
Age,
DATEDIFF(day, LastContacted, GETDATE()) AS DaysSinceLastContact,
FROM warehouse01.dbo.customer;"

res = sqlQuery(dbhandle, query)

# these two arrays/vectors are used to match column names to values inside the loop
columnnames = colnames(res)
columns = seq_along(res)

results = vector()

# iterate through all columns
for(i in columns){ 

# iterate through the columns after the active one,
# to avoid duplicate matches
inner_columns = columns[i:length(columns)]

# only handle outer row if it is numeric
if(is.numeric(res[[i]])){

# same numeric check for the inner ones
for(j in inner_columns){
if(is.numeric(res[[j]])){

# correlation testing
tmp = cor.test(res[[i]], res[[j]], na.rm = TRUE)$p.value
coef = cor(res[[i]], res[[j]], use="complete")

# if significant find, add to a results-array
if( (tmp < 0.05) && (tmp > 0) ){
results = c(results, columnnames[i], columnnames[j], coef)
}
}
}
}
}

# make the results-vector a matrix for readability
results = matrix(results, ncol = 3, byrow=T)

odbcClose(dbhandle)

This script leaves me with a table (or matrix, as it’s called in R-land) consisting of three columns: The first two are the pair of columns tested and the last is the correlation coefficient (strength). RStudio allows for sorting and filtering, which makes it easy to find the strongest correlations.

That’s it for now.