Saturday, November 22, 2014

SQL Server, Predictive Analytics, Reducing Bias In Train and Test Data

SQL Server, Predictive Analytics, Reducing Bias In Train and Test Data

When training your statistical model to predict something, we never do so without our hold out data - our test data. This allows us to review the performance of our model against data that was not used during training, but for which we still know the outcome. However, the method through which you choose to divide your data in training and testing data can sometimes introduce bias and skew your predictive prowess. Were you to simply divide the data in half, you might have an uneven distribution of the population. For instance, you may have the data ordered by total sales. Splitting this in half might mean you trained your model on an unevenly heavy-handed sales total. 

To that end, we avoid ordering our data on the SQL Server side before moving it to our statistical software of choice (it better be R). However, this still isn't enough. Depending how your table is constructed, or was populated, it may be naturally ordered in some way that will create some sort of population bias. Tables with clustered indexes are great, but it also means your data is pre-ordered by that clustering key. That key might be negatively influencing the distribution. Never select the top 50% either - the data in the table is likely ordered in the table itself - you may still be getting biased sets.

To avoid this biased data selection, we introduce random number generation to our prepared data, and then split it. I show here the method for random number generation in SQL Server, and the method for doing so in R. While an analyst would quickly hop to the R version, which is incredibly easy, the SQL Server developer who wants to automate processes and data preparations, wants to be sure this kind of information is already stored in the database before the analyst or automated procedures need it, reducing the future workload for all involved.


Generating Random Numbers in SQL Server


Create the following view
 create view [dbo].[v_getRandomNum]   
 as   
 select   
      Random_ZeroOrOne = abs(cast(newid() as binary(6)) %2) -- between 0 and 1   
      , Random_Num = abs(cast(newid() as binary(6)) %1000) + 1 -- between 1 and 1000  
      , Random_Big = abs(cast(newid() as binary(6)) %999999)  
      , aGUID = newid()  
 ;  

Test it with some sample data. 
 -- Create Temp Table to hold our Sample Data  
 if object_id('tempdb..##SampleRandomized','u') is not null begin drop table ##SampleRandomized end;  
 create table ##SampleRandomized (  
      RowID               int identity(1,1)  
      , rnd_ZeroOrOne     tinyint  
      , rnd_Int           int  
 );  


 -- Create Sample Dataset with Randomized Numbers from SQL Server  
 -- Insert 1000 rows of sample numbers (using sys.objects as the numerator)  
 insert into ##SampleRandomized (rnd_ZeroOrOne, rnd_Int)   
 select top 100   
      rnd_OneOrZero = random.Random_ZeroOrOne  
      , rnd_Int     = random.Random_Num  
 from sys.objects, dbo.v_getRandomNum as random  
 ;  
 select * from ##SampleRandomized  


Now you can use the random number generating view by joining it to your data query. In the example above I've joined it to sys.objects simply as a means of row generation. The distribution of the random numbers will tend to be evenly distributed, given enough rows. 

In the sample I have only pulled together 100 rows, but if we look at the number of cases that are 1, compared to the number of cases that are 0 we will see it's pretty evenly split near 50%. The same distribution is applicable to our random integer, between 1 and 1000, we find about 50% are below 500, and 50% are at or above 500. Each time you generate the random numbers you should see slightly different distributions.


 select MinInt = min(rnd_Int)  
      , MaxInt = max(rnd_int)  
      , CountCases = count(1)  
      , CountOne = sum(case when rnd_ZeroOrOne =1 then 1 else 0 end)   
      , CountZero = sum(case when rnd_ZeroOrOne =0 then 1 else 0 end)   
      , CountBelow = sum(case when rnd_int < 500 then 1 else 0 end)   
      , CountAbove = sum(case when rnd_int >= 500 then 1 else 0 end)   
 from ##SampleRandomized  



Using this to break our data into a training and testing set is now a trivial matter.


 -- Now we can easily break the set in half, based on Random values, decreasing bias.'  
 if object_id('tempdb..##FullDataset','u') is not null begin drop table ##FullDataset end;  
 select *   
      , dataset_byInt = case when rnd_int < 500 then 'train' else 'test' end  
      , dataset_byBit = case when rnd_ZeroOrOne = 0 then 'train' else 'test' end  
 into ##FullDataset  
 from ##SampleRandomized  


Random Number Generation in R


If you plan on using R for your predictive analytics, you'll likely be using the runif command. runif stands for Random Uniform and generates a uniform distribution of random numbers between 0 and 1.


 ## Load the needed libraries  
 library(data.table)  
 library(RODBC)  
 library(ggplot2)  

 ## Get your sample randomized data  
 url1 <- "Driver={SQL Server};Server=YourServerName;Database=YourDatabaseName;Trusted_Connection=Yes;"  
 dsr1 <- odbcDriverConnect(url1) ## Using RODBC  
 data1 <- data.table(sqlQuery(dsr1, paste("select * from ##FullDataset")))  


 ## Transform rnd_ZeroOrOne into a factor  
 data1$rnd_ZeroOrOne <- factor(data1$rnd_ZeroOrOne)  


 ## Random Number Generation via R  
 RowsInData <- nrow(data1)  
 data1$rnd_FromR <- runif(RowsInData)  

After implementing the runif based on nrows (number of rows), we can view the results of that vector.




Now we can break up our data into different sets, and mark them as train or test.



 ## Create rowcounts, by random methods, for both our test and train sets.  
 ## this can be done in compressed language with aggregate or ddply grouping  
 ## however, this is the more readily accessible to beginners.  
 rCountBelow <- data.table(provider = 'R', dataset = 'train', cases = nrow(subset(data1, rnd_FromR < .5)))  
 rCountAbove <- data.table(provider = 'R', dataset = 'test', cases = nrow(subset(data1, rnd_FromR >= .5)))  
 sCountBelow <- data.table(provider = 'SQL', dataset = 'train', cases = nrow(subset(data1, rnd_Int < 500)))  
 sCountAbove <- data.table(provider = 'SQL', dataset = 'test', cases = nrow(subset(data1, rnd_Int >= 500)))  
 

## union all the row counts  
 caseCounts <- rbind(rCountBelow, rCountAbove, sCountBelow, sCountAbove)  



To show the distribution, we can view(caseCounts)



A very uniform distribution indeed, bias reduced through random numbers.

To demonstrate visually, here's a gpplot of the distribution differences (there is none) between our SQL Server random numbers and the random numbers generated in R.


 ## View the distribution of Random row selection from R  
 ggplot(data = caseCounts,   
     aes(x = factor(provider), y=cases, fill = factor(dataset))) +  
  geom_bar(stat="identity") +  
  scale_y_continuous(limits=c(0,100))  




Enjoy,
-P

No comments:

Post a Comment