Tying into my previous post (regarding the data warehouse and aggregated tables), here’s a more technical description of how I put together a script for simple correlation discovery in R.
Bear in mind – this is using basic Pearson Correlation and is not at all on par with “proper” data exploration methods such as gradient boosting and the likes. I’m using the methods cor.test() and cor(), and I catch the “significant” data which hasĀ p-value below 0.05.
This test can only work with integers – this means that I can’t use textual fields from the database directly. A way to handle that is to categorize the significant values as binary choices, e.g. if you’ve got the customer categories “individual”, “company” and “school” then you’d set up case-based values named is_individual, is_company and is_school.
The query would go something like this:
SELECT CASE WHEN category LIKE 'individual' THEN 1 ELSE 0 END AS is_individual, CASE WHEN category LIKE 'company' THEN 1 ELSE 0 END AS is_company FROM sometable
With this data in hand, we’re ready to run an analysis. I’ve built this script to find correlations in datasets of any size, but I’ll just include a couple of rows in the sample script (my real queries tend to be up to 100 lines, and do not add anything to this example). Instead of describing textually how I’ve built the script, I’ve added comments throughout the code to better explain what I’m doing where I’m doing it.
Unfortunately, it seems I’m unable to recreate my indentation in this post, but it only makes the loop a bit less readable. Bear with me.
library(RODBC) dbhandle = odbcDriverConnect('driver={SQL Server};server=mydatabaseserver;database=warehouse01;trusted_connection=true') # trusted_connection does authentication through your logged-in AD credentials query = " SELECT MonthlyAmount, DATEDIFF(day, FirstInstalmentPaymentDate, GETDATE()) AS DaysSinceFirstPayment, NoOfMissedInstalmentsPast2Years, NoOfAdHocPaymentsPast2Years, Age, DATEDIFF(day, LastContacted, GETDATE()) AS DaysSinceLastContact, FROM warehouse01.dbo.customer;" res = sqlQuery(dbhandle, query) # these two arrays/vectors are used to match column names to values inside the loop columnnames = colnames(res) columns = seq_along(res) results = vector() # iterate through all columns for(i in columns){ # iterate through the columns after the active one, # to avoid duplicate matches inner_columns = columns[i:length(columns)] # only handle outer row if it is numeric if(is.numeric(res[[i]])){ # same numeric check for the inner ones for(j in inner_columns){ if(is.numeric(res[[j]])){ # correlation testing tmp = cor.test(res[[i]], res[[j]], na.rm = TRUE)$p.value coef = cor(res[[i]], res[[j]], use="complete") # if significant find, add to a results-array if( (tmp < 0.05) && (tmp > 0) ){ results = c(results, columnnames[i], columnnames[j], coef) } } } } } # make the results-vector a matrix for readability results = matrix(results, ncol = 3, byrow=T) odbcClose(dbhandle)
This script leaves me with a table (or matrix, as it’s called in R-land) consisting of three columns: The first two are the pair of columns tested and the last is the correlation coefficient (strength). RStudio allows for sorting and filtering, which makes it easy to find the strongest correlations.
That’s it for now.