Randomized Nonsense: SQL Server, R, and Logistic Regression

SQL Server, R, and Logistic Regression

This demonstrates the retrieval of data from SQL Server, performing logistic regression to predict the probability of an event and saving the data back to SQL Server for storage, reporting or additional analysis. The motivation here is the desire to extend the capabilities of your average SQL Server developer who may be looking to expand their skills with predictive analytics and R.

Logistic regression returns the probabilty of an event, and in the case of my example, the event is represented by 1, and a non-event is represented by a 0. In order to start predicting probabilities we should follow the basic protocol of breaking our dataset into two groups, test and train. For example, we might have 100 cases (rows) of events with explanatory variables (those that are used to predict the resulting event). We would break up this data, for which we already know the outcome, into 50 cases of training data, and 50 cases of testing data. This allows us to not only build our predictive model, but then put the model up against another set of known outcomes to compare how our model is performing. In the example data I will use, the test and train data are identical, for the sake of simplicity. (keep it stupid, simple.)

SQL Server Sample Data
1.) In SQL Server, create a global temporary table with sample data;

 -- test data for logistic regression  
 -- Find the probability of an event  
 if object_id('tempdb..##testlogit') is not null begin drop table ##testlogit end;  
 create table ##testlogit (  
      rowid int identity(1,1)  
      , [a] int not null   
      , [b] int not null   
      , [event] int null   
      , dataset varchar(5) not null  
 )  
 -- Insert the model training data.   
 -- We will use this to train our predictive prowess.  
 insert into ##testlogit select 1, 0, 1, 'train'  
 insert into ##testlogit select 1, 0, 1, 'train'  
 insert into ##testlogit select 1, 0, 0, 'train'  
 insert into ##testlogit select 1, 1, 1, 'train'  
 insert into ##testlogit select 1, 1, 1, 'train'  
 insert into ##testlogit select 0, 0, 0, 'train'  
 insert into ##testlogit select 0, 0, 0, 'train'  
 insert into ##testlogit select 0, 1, 0, 'train'  
 insert into ##testlogit select 0, 1, 1, 'train'  
 -- Insert the model testing data. We will predict the event in this set.  
 insert into ##testlogit select 1, 0, 1, 'test'  
 insert into ##testlogit select 1, 0, 1, 'test'  
 insert into ##testlogit select 1, 0, 0, 'test'  
 insert into ##testlogit select 1, 1, 1, 'test'  
 insert into ##testlogit select 1, 1, 1, 'test'  
 insert into ##testlogit select 0, 0, 0, 'test'  
 insert into ##testlogit select 0, 0, 0, 'test'  
 insert into ##testlogit select 0, 1, 0, 'test'  
 insert into ##testlogit select 0, 1, 1, 'test'

2.) The resulting table should contain the following;

Copy the data to R for Logistic Regression

1.) Connect to the data source and pull the data into a variable called data. I've used the library RODBC, though I've seen others use rsqlserver. I hear that rsqlserver is faster for data movement, however, when you're working with statistical models in R, it's actually pretty rare that you're moving mass amounts of data around. This is even a limitation of R, in that it stores your data in objects in memory - meaning that large datasets need to take into account the available hardware limitations as well. Frankly, the words "large amount of data" is a subjective opinion, in my view. I've seen people who consider 1GB large, and some won't use the word until they see a terabyte. Nonetheless, for the incredibly tiny sample data we're using here, and everything I have ever done with R, I've found RODBC to be stunningly fast enough.

 ## Load the needed libraries  
 library(RODBC)  
 library(data.table)  
 library(ggplot2)  
 ## For the ConfusionMatrix you'll need the following  
 ##install.packages("caret")  
 ##install.packages("e1071")  
 library(caret)  
 ## Set the working directory  
 setwd("c:/personal/r")  
 ## Connect to and pull your sample data from SQL Server.  
 driver <- "Driver={SQL Server};Server=YourServerName;Database=YourDatabaseName;Trusted_Connection=Yes;"  
 localdb <- odbcDriverConnect(driver)  
 data <- data.table(sqlQuery(localdb, paste("select * from ##testlogit")))

Be sure to replace the YourServerName with your actual server name, and YourDatabase name with your actual database name. The database I'm using I've lovingly called CommunalToilet - it's where I put all my garbage data. :)

2.) Subset your data into training and testing data sets.

 ## Subset the data into Training and Testing datasets.  
 traindata <- subset(data, dataset == 'train')  
 testdata <- subset(data, dataset == 'test')

3.) Create the glm (generalized linear model) using family binomial for logistic regression. At this point we're playing with data in which we already know the outcome. Because we will have data in the future in which we do not know the outcome, but would like to predict it, we can save our logistic regression model to an rda file. The next time I use this script with data that has no known events, I do not need to re-train my model. Finally, I viewed the summary of that model.

 ## Create the logistic regression model using family binomial link=logit  
 ## Using the training dataset.  
 ## If the model was already built, then simply reload it.  
 if(file.exists("my_testglm.rda")) {  
   ## load model  
   load("my_testglm.rda")  
 } else {  
   ## (re)fit the model  
   testglm <- glm(data = traindata, event ~ a + b + a * b, family="binomial")  
   save(testglm, file = "my_testglm.rda")  
 }  
 ## View summary of the model  
 summary(testglm)

The glm (lm linear regression) model function is looking particularly for the formula. The formula (event ~ a + b + a * b) is your dependent variable (the event you're predicting) ~ over the explanatory variables (the variables that explain how the prediction is made). In this case, I'm using a and b as my explanatory variables, which might represent, for example, age and income. I also included an interaction of a*b. While this doesn't help my model in in this case, it is a powerful way of adding predictive ability when you determine that one variable has direct interactivity with another - their combined weight may be significant. For example, you might find it more significant when someone has previously had both a heart attack -and- has high blood pressure, compared to simply one or the other.

4.) Using your model, predict the probability, the rounded probability, and put those into our data sets.

 ## Create prediction based on the model parameters.  
 traindata$prob_of_event <- predict(testglm, newdata=traindata, type="response")  
 testdata$prob_of_event <- predict(testglm, newdata=testdata, type="response")  
 ## round the predicted probabilities to 1 and 0  
 traindata$prob_of_event_round <-round(traindata$prob_of_event, digits = 0)  
 testdata$prob_of_event_round <- round(testdata$prob_of_event, digits = 0)

5.) Create and view the confusionMatrix results. The confusion matrix shows your predictive power by creating a matrix of the actual events to your predicted events. This results in predictions called true positives, false positives, true negatives, and false negatives. Thus, a true positive would indicate you predicted true in an actual event, and a false positive would indicate you've predicted false even though the event actually occurred.

 ## View the accuracy of the model through a confusion matrix  
 confusionMatrix(traindata$prob_of_event_round, traindata$event, positive="1")

In the confusionMaztrix above, you can see I've got 2 false negatives. I falsely predict they are events (1) even though they are not actual events (0) in the data. Not bad, considering the data we've put into the model, there are two explanatory variable scenarios where the predictive value isn't perfect. The data leans towards being an event, but I have actual cases that turned out to be (0) non-events. Nonetheless, the model says that, given these variables, there's a 66% probability of it being an event. It just so happens, that 66% is not 100%. You can see these scenarios in the traindata output above.

Merge and copy the data back to SQL Server
1.) Merge the training and testing data sets into one data table, and then plot the resulting predictions.

 ## Merge your training and testing sets back together.  
 ## this is the data.table that we will copy back to SQL Server.  
 mergeddata <- data.table(rbind(traindata, testdata))  
 ## Plot the results of your predicted probabilities.  
 ggplot(data=mergeddata, aes(x=rowid, y=prob_of_event, color=dataset)) +  
   geom_point(size = 4)

2.) Now copy the data back to SQL Server. The first step will check to see if the table I'm copying the data into already exists and if it does then we will remove the data.

 ## First test to see if the table already exists. If so, drop it.  
 sql <- paste("if object_id('CommunalToilet.dbo.testlogit_results') is ",   
       "not null begin drop table CommunalToilet.dbo.testlogit_results end")  
 destroyObject <- sqlQuery(localdb, sql)  
 ## Save the data set to SQL Server.  
 sqlSave(localdb, mergeddata, tablename ="testlogit_results", append = FALSE, rownames = F)

3.) Finally query your resulting data in SQL Server.

 select * from dbo.testLogit_result

Where to from here

Obviously you'll want to use some real data with some actual explanatory variables, and start predicting the future. The most common scenario I've seen around the web is through the use of the adventure works database and predicting if someone is a biker buyer. They usually pull in multiple customer variables like income, commute distance, number of vehicles, etc., and then predict if the person is a likely bike buyer - only so that they can they spam that person with emails or postal mail. I prefer to think of better things to spend my predictive prowess on than marketing. :\

Once you've gathered the data, open and run your R script. Even though you don't have events predicted yet, the model will load and predict events based on that.

Good luck, best wishes, etc., etc.,
-Parker

Randomized Nonsense

Sunday, November 9, 2014

SQL Server, R, and Logistic Regression

No comments:

Post a Comment

Popular Posts

Resources

Spam Me