Sunday, November 9, 2014

SQL Server and Univariate Linear Regression

SQL Server and Univariate Linear Regression

This is a method for evaluating the correlation coefficient of a univariate linear regression model by simply calling a function from within SQL Server. The motivation here was that I wanted to be able to determine a possible linear correlation without having to leave my SQL Server Management Studio environment.

To start you'll need to create the following;

1.) Create a new table data type.

 create type dbo.XandY as table(  
      [x] float null  
      , [y] float null  
 );  

2.) Create the linear regression function;

 create function dbo.UnivariateLinearRegression (@XandY dbo.XandY READONLY)   
 returns @ReturnThisTable table (  
      correlation_coefficient float  
      , slope float  
      , intercept float   
      , r_squared float   
      , standard_estimate_error float  
 )  
 as  
 begin  
      declare @total_points int   
      , @intercept float   
      , @slope float   
      , @r_squared float   
      , @standard_estimate_error float   
      , @correlation_coefficient float   
      , @average_x float   
      , @average_y float   
      , @sumX float   
      , @sumY float   
      , @sumXX float   
      , @sumYY float   
      , @sumXY float   
      , @Sxx float   
      , @Syy float   
      , @Sxy float   
      , @stdevX float  
      , @stdevY float;  
      Select   
           @total_points = count(*)  
           , @average_x = avg(x)  
           , @average_y = avg(y)  
           , @sumX = sum(x)  
           , @sumY = sum(y)  
           , @sumXX = sum(x*x)  
           , @sumYY = sum(y*y)  
           , @sumXY = sum(x*y)  
           , @stdevX = stdev(x)  
           , @stdevY = stdev(y)  
      from @XandY a  
      ;  
      set @Sxx = @sumXX - (@sumX * @sumX) / @total_points;  
      set @Syy = @sumYY - (@sumY * @sumY) / @total_points;  
      set @Sxy = @sumXY - (@sumX * @sumY) / @total_points;  
      set @correlation_coefficient = @Sxy / sqrt(@Sxx * @Syy);  
      set @slope = (@stdevY / @stdevX) * @correlation_coefficient;  
      set @intercept = @average_y - @slope * @average_x;  
      set @r_squared = (@intercept * @sumY + @slope * @sumXY - power(@sumY,2) / @total_points) / (@sumYY - power(@sumY,2) / @total_points);  
      select @standard_estimate_error = sqrt(sum(power(y - (@slope * x + @intercept),2)) / @total_points) from @XandY;  
      insert into @ReturnThisTable (  
           correlation_coefficient  
           , slope  
           , intercept  
           , r_squared  
           , standard_estimate_error  
      )  
      select correlation_coefficient = @correlation_coefficient  
           , slope = @slope  
           , intercept = @intercept  
           , r_squared = @r_squared  
           , standard_estimate_error = @standard_estimate_error  
      ;  
      return;  
 end  

Try it out

1.) Generate some sample data

 if object_id('tempdb..##example','u') is not null   
      begin drop table ##example end;  
 -- Sample Dataset  
 create table ##example (  
      TheDay int  
      , TheValue float  
      , TheUnits float  
 );  
 insert into ##example (TheDay, TheValue, TheUnits)   
 select 1, 10.2, 4 union all select 2, 11.8, 5 union all   
 select 3, 19.2, 8 union all select 4, 10.2, 4 union all   
 select 5, 12.4, 5 union all select 6, 13.2, 6 union all   
 select 7, 15.2, 7 union all select 8, 17.2, 8 union all   
 select 9, 16.2, 7 union all select 10, 25.2, 16 union all   
 select 11, 12.7, 10 union all select 12, 14.2, 11 union all   
 select 13, 15.9, 13 union all select 14, 13.6, 6 union all   
 select 15, 19.2, 9;  

2.) View your sample data, with bonus code to view the 5 day moving average.

 select a.TheDay, a.TheValue, a.TheUnits  
      , c.FiveDayMovingAverageValue  
 from ##example a   
 outer apply (  
      select FiveDayMovingAverageValue = avg(TheValue)   
      from ##example b   
      where b.TheDay between a.TheDay - 4 and a.TheDay   
      having count(TheValue) > 4  
 ) c;  

Which should return this sample data

3.) Now use your new data type and function to generate the correlation coefficient;

 declare @XandY as dbo.XandY;  
 insert into @XandY (x, y) select TheUnits, TheValue from ##example;  
 select * from dbo.UnivariateLinearRegression(@XandY);  

The results of the function returned are;

As you can see, from our manufactured values, we have a strong correlation coefficient of .76
The additional results can help with plotting, or determining the quality of fit with the r_squared, also known as the coefficient of determination.

Validation through R

The values were compared to the results of performing the linear regression analysis on the same data in R.

 library(RODBC)  
 library(data.table)  
 library(ggplot2)  

 localdb <- odbcDriverConnect("Driver={SQL Server};Server=YourServerName;Database=YourDatabaseName;Trusted_Connection=Yes;")  
 data <- data.table(sqlQuery(localdb, paste("select * from ##example")))
  
 testlm <- lm(data = data, TheValue ~ TheUnits)  

 summary(testlm)  
 cor(data$TheValue, data$TheUnits)  
 coef(testlm)  

 ggplot(data = data, aes(y=TheValue, x=TheUnits)) +  
   geom_point() +  
   geom_smooth(method=lm)  

Using the standard lm function, I generated the univariate linear regression model. Next I ran three diagnostic information queries: summary, cor, and coef.


As you can see from R, the summary shows a multiple R-Squared of .5786
The cor function shows the correlation coefficient at .7606
The coef function shows the slope and intercept, also mirroring the results from SQL Server.

Finally the ggplot output is shown below;


In a subsequent post I'll demonstrate the same capability but with the added ability of performing the correlation over a grouping value, which returns the coefficient for each grouped subset.

Till then...

-Parker

1 comment: