Sunday, December 28, 2014

Generate a Wordcloud from a URL using R

How to generate a wordcloud image from a website URL in just three lines of R code.

I recently found a very groovy wordcloud package for R, written by Ian Fellows, that I then extended to pass in a URL. Once you've got everything setup, that I'll outline here, you can then generate a wordcloud on the fly by simply pasting in the web address. You should have a basic understanding of loading R packages in order to set this up.

Required Libraries


You'll need to install the following libraries;

library(data.table)   # Used in filter and subsetting the words.
library(httr)            # used to access the webpage URL
library(XML)         # Used to extract the body tag content from the URL.
library(tm)              # Text mining features for stopword and punctuation removal
library(wordcloud)    # Used to generate the wordcloud
library(RColorBrewer) # Coloring the wordcloud

Create or download pageCloudify.R file


pageCloudify.R is my solution script file for loading the URL, scraping its contents, and generating the wordcloud image. You can paste the script into a new file called pageCloudify.R or simply download it from here.

1:  ## pageCloudify.R  
2:  ## Creates a wordcloud for one or more webpage URLs.  
3:  ## Most stopwords (ie. "the", "and") are removed by default.  
4:  ##---------------------------------  
5:  ## Function: pageCloudify()  
6:  ## Parameters:  
7:  # TheURLS: a string or vector of strings containing valid URL(s) http://starwars.wikia.com/wiki/Star_Wars  
8:  # freqReq: an integer value representing the minimum number of word repetition to be included in the cloud.  
9:  # toPNG:  Output wordClouds to PNG files. The default is False which sends the wordCloud to the screen device.  
10:  ## Invalid URL: The webpage html must contain a <body> tag. This is the portion of content used   
11:  ## to build the word cloud.  
12:  ## Warnings: Warnings will be generated for clouds containing more words than can fit in the cloud.  
13:  ## Try increasing the freqReq parameter to decrease the number of words in the cloud.  
14:  ## Many english stopwords are removed by default. To view the stopwords removed by the wordCloud package:  
15:  ## stopwords(kind = "en")  
16:  ## Libraries required: They will be attempted to be loaded for you when you call the source R file.  
17:  ## However, they do need to be already installed.  
18:  # library(data.table)  # Used in filter and subsetting the words.  
19:  # library(httr)     # used to access the webpage URL  
20:  # library(XML)     # Used to extract the body tag content from the URL.  
21:  # library(tm)      # Text mining features for stopword and punctuation removal  
22:  # library(wordcloud)  # Used to generate the wordcloud  
23:  # library(RColorBrewer) # Coloring the wordcloud  
24:  ##### Examples #####  
25:  ## Copy the pageCloudify.R file to your working directory.  
26:  ## Set your working directory.  
27:  # setwd("c:/personal/r")  
28:  ## Load the pageCloudify.R source file.  
29:  # source("pageCloudify.R")  
30:  ## Example usage with single URL  
31:  # pageCloudify(TheURLs="http://en.wikipedia.org/wiki/Star_Wars", freqReq = 15, toPNG = F)  
32:  ## Example usage with multiple URLs, to PNG files  
33:  ## each png file will go to the working directory  
34:  ## each png file will be named mywordcloud_X.png where X is an ordinal integer from 1 to the number of URLs.  
35:  # TheURLs <- c("http://incrediblemouse.blogspot.com/2014/11/sql-server-r-and-facebook-sentiment.html",   
36:  #       "http://incrediblemouse.blogspot.com/2014/11/sql-server-predictive-analytics.html",   
37:  #       "http://incrediblemouse.blogspot.com/2014/11/sql-server-r-and-logistic-regression.html",   
38:  #       "http://incrediblemouse.blogspot.com/2014/11/sql-server-and-univariate-linear.html")  
39:  # pageCloudify(TheURLs, freqReq = 5, toPNG = T)  

40:  library(data.table)  
41:  library(httr)  
42:  library(XML)  
43:  library(wordcloud)  
44:  library(tm)  
45:  library(RColorBrewer)  
46:  pageCloudify <- function(theURLs, freqReq = 10, toPNG=F){  
47:    y = 1  
48:    for(x in theURLs){  
49:      ## Using XML to pull the html  
50:      parsedHTML2 <- htmlTreeParse(x, useInternalNodes=T)  
51:      PostEntry <- xpathSApply(parsedHTML2, "//body",simplify=T)  
52:      ## Munging for Wordcloud  
53:      z <- data.table(xmlSApply(PostEntry[[1]], xmlValue))   
54:      z <- data.table(words = tolower(gsub("\n", " ", z$V1, fixed = TRUE)))  
55:      z <- data.table(words = tolower(gsub("[^[:alnum:] ]", "", z$words)))  
56:      z <- data.table(words = unlist(strsplit(z$words," ")))  
57:      z$words <- removeWords(z$words, stopwords(kind = "en"))  
58:      z <- subset(z, words != "" | words != " ")  
59:      z <- data.table(words = removePunctuation(z$words), int = 1)  
60:      z <- z[,(wordcount = sum(int)), by=words]  
61:      z <- subset(z, nzchar(z$words) == T & nchar(z$words) > 2 & nchar(z$words) < 26 & z$V1 > freqReq)  
62:      ## Word Cloud Color Palette  
63:      pal <- brewer.pal(6,"Dark2")  
64:      ##pal <- pal[-(1)]  
65:      if (toPNG) {  
66:        pngFileName = paste("mywordcloud_", as.character(y), ".png", sep="")  
67:        png(file=pngFileName, bg="transparent")  
68:        wordcloud(z$words, z$V1, scale = c(4,.8), min.freq=2, random.color = T, colors=pal)  
69:        dev.off()  
70:      } else {  
71:        wordcloud(z$words, z$V1, scale = c(4,.8), min.freq=2, random.color = T, colors=pal)  
72:      }  
73:      y = y + 1  
74:    }  


The script provides a new function called pageCloudify()
Parameters:
theURLS: a string or vector of strings containing valid URL(s)
freqReq: an integer value representing the minimum number of word repetition to be included in the cloud.
toPNG: Output wordclouds to PNG files. The default is False which sends the wordCloud to the screen device.

Plot a WordCloud


Now that've got your packages installed, and the above script saved to a memorable location, you can now generate a wordcloud in 3 lines of code. Be sure to change your working directory to the same location where you stored the pageCloudify.R file.

1:  setwd("c:/personal/r")  # Change this to the stored location of pageCloudify.R  
2:  source("pageCloudify.R")  
3:  pageCloudify(theURLs="http://en.wikipedia.org/wiki/Star_Wars", freqReq = 25, toPNG = F)  

Optionally run the code again to get a different variation on the placement  and coloring of the words.

You can even pass in multiple URLs all at once, and output them to PNG files.

1:  ## Example usage with multiple URLs, to PNG files  
2:  ## each png file will go to the working directory  
3:  ## each png file will be named mywordcloud_X.png where X is an ordinal integer from 1 to the number of URLs.  
4:  theURLs <- c("http://incrediblemouse.blogspot.com/2014/11/sql-server-r-and-facebook-sentiment.html",   
5:         "http://incrediblemouse.blogspot.com/2014/11/sql-server-predictive-analytics.html",   
6:         "http://incrediblemouse.blogspot.com/2014/11/sql-server-r-and-logistic-regression.html",   
7:         "http://incrediblemouse.blogspot.com/2014/11/sql-server-and-univariate-linear.html")  
8:  pageCloudify(theURLs, freqReq = 5, toPNG = T)  

Modify the function for even more cool features, likes different color palettes. Simply pass a different brewer.pal color palette name.

The first cloud is from Issac Newton's wikipedia page. I mean, who's a better candidate for a colorful wordcloud than the man who figured out that light contains all the visible colors.


While this is awesome, from a programmer standpoint, there are great sites where you can simply paste in some text, and generate highly artistic wordclouds - You might be interested in wordle.net

Enjoy!

-P


Sunday, November 23, 2014

SQL Server, R, and Facebook Sentiment Analysis

Apparently, if you're talking about congress, you're probably not using flowery adjectives.


I love the integration of SQL Server and R for data analysis and storage. I often jump back and forth between platforms depending on what I want to accomplish and the efficiency at which I know I can get the job done. Granting that you can perform all of this analysis, and to a much larger statistical extent, inside of R, I still lean on SQL Server for my data storage engine.

I demonstrate here the use of R for accessing Facebook public posts, SQL Server for storage, manipulation, and sentiment analysis of that data. Again, you could achieve it all in R, but my background is in data architecture, so I have a penchant for storing data responsibly for future needs (I wish I did the same with my money).

The above chart, aggregated in SQL Server and prettied up with Excel, scored word sentiment found in 200 recent public facebook posts, for each search term category.

Software Needed

R can be downloaded from CRAN and R Studio can be downloaded from RStudio.
SQL Server Express 2014, capable of everything here, can be downloaded from Microsoft. I highly recommend the version: Express with Advanced Services (SQLEXPRADV), since it comes with full text search as well (not needed here, but is a very efficient text search engine you might one day want).

The process of the whole project is as follows;

1.) Obtain your facebook developer API access token.
2.) Configure your R environment for access to the API
3.) Prepare a SQL Server table for facebook data.
4.) Retrieve data from Facebook in R and save it to SQL Server.
5.) Prepare a SQL Server function for word parsing.
6.) Create a WordScorer lookup table.
7.) Score the facebook data, by search term, providing the average sentiment score for each.

Obtain your facebook developer API access token


To begin, you'll need to create an app at https://developers.facebook.com
I show here the basic configuration needed. The Site URL may need to be slightly different, but we'll get to that shortly.


Make note of your App ID and App Secret. You'll need them both for use in R.

Configure your R environment for access to the API


You'll need to install and load a few R libraries ahead of time. These will cover the bases for rFacebook, OAuth, SQL Server access, and even future plotting for commands we will using here. Some are just useful for extended features we'll not be using here, but that you should have on hand. All of the facebook API interaction in R is done via the awesome Rfacebook library, hosted on Cran and written by Pablo Barbera.

 library(Rfacebook);  
 library(data.table);  
 library(RCurl);  
 library(rjson);  
 library(RODBC);  
 library(ROAuth);  
 library(ggplot2);  
 
## Also set your working directory.
## Your Facebook OAuth credentials will be saved 
## here for future use.
setwd("C:/Personal/R")


Now you'll authenticate your OAuth connection using your App ID and App Secret from the facebook application basic settings page, as noted previously.

 ## This is only performed once.  
 fb_oauth <- fbOAuth("YourAppIDHere", "YourAppSecretHere", extended_permissions = TRUE)  
 save(fb_oauth, file="fb_oauth")
  
 ## Now we only need to do this for all future OAuth loading.
 load("fb_oauth")  

When you run the fbOAuth, your R console will provide further instructions. These instructions contain the URL you should apply to your facebook application settings for the Site URL.

Copy and paste into Site URL on Facebook App Settings: http://localhost:1410/ 
When done, press any key to continue...

After applying the Site URL, press any key in the console, as the instructions state. Your web browser will open, access facebook, and tell you when it is completed. Afterwards, close your browser.

That's the one-time setup (until the token expires). From this point forward, you'll only need to load("fb_oauth")

Prepare a SQL Server table for your facebook data.


In SQL Server we'll create a table to hold all of the variables the Rfacebook function searchFacebook returns. We could simply pass the sqlSave command and let RODBC create our table for us, however, it's not all that savvy when it comes to data types. It tends to take the path of least resistance, making most numbers a float, and all character vectors as varchar(255). Since one of the vectors in the results, [message], containing the contents of the facebook post can contain a great number of characters, RODBC will truncate this data before inserting it. However, if we specify the table data types more explicitly in advance, we can get greater control. I've chosen a varchar(4000) to hold the message body. Some posts can even be longer, but for this simplistic sentiment analysis it's plenty large enough, even if we get some truncation. I would have preferred to use SQL Server's [text] data type, but was unable to get RODBC to stop complaining about it. I've tabled that for another rainy day.

It's also important to note that your table should mirror, column for column, what you plan on having in your R data table. If you want to insert records to an existing table with additional columns, you should really insert it to a temporary staging table, and migrate the data via trigger or otherwise. Yet another topic for a different rainy day.

 create table dbo.facebooksearchresults (  
      from_id bigint null  
      , from_name varchar(255) null  
      , [message] varchar(2000) null  
      , created_time varchar(255) null  
      , [type] varchar(255) null  
      , link varchar(255) null  
      , id varchar(255) null  
      , likes_count integer null  
      , comments_count integer null  
      , shares_count integer null  
      , searchstring varchar(255) null  
      , dateadded date  
 );  

Retrieve data from Facebook in R and save it to SQL Server.


Now, in R, we can utilize the searchFacebook function for specific terms, and save the results back to SQL Server. I've repeated my search (and data save), pulling down 200 posts, for each search term I was interested in. The example below searches for posts containing the term ebola as an example. You might consider a vector of terms, and looping over them, but for simplicity and training, I've left that out of this article.

1:  ## Search Facebook Public Posts, and save data to SQL Server.  
2:  searchString = "ebola"  
3:  data <- searchFacebook(string=searchString, token=fb_oauth, n = 200, since = NULL, until = NULL)  

4:  ## Convert to a data table.  
5:  datatbl <- data.table(data)  

6:  ## Add a new column to our table with the search string.  
7:  ## This field will allow us to append this data to SQL Server   
8:  ## and can act as the key to filtering for this specific search.  
9:  ## Now we can search multiple terms, and keep loading data.  
10:  datatbl$searchstring <- searchString  

11:  ## Add the dateadded, in case you want to compare a search on one day, to another.  
12:  datatbl$dateadded <- as.character(Sys.Date())  


The last two fields in our SQL Server table, searchString and dateadded, are not returned by the searchFacebook function, but we'll add those to our R data table so they can be stored with our results.

Your console should indicate that x number of posts were retrieved. We can now save the data to our SQL Server table

1:  ## Connect to your SQL Server.  
2:  url1 <- "Driver={SQL Server};Server=YourServerAndInstanceName;Database=YourDatabaseName;Trusted_Connection=Yes;"  
3:  localdb <- odbcDriverConnect(url1)  
4:    
5:  ## For very specific colnames and data type mapping, roll your own.  
6:  varTypes <- c("bigint", "varchar", "varchar", "varchar", "varchar",   
7:         "varchar", "varchar", "integer", "integer", "integer", "varchar", "date")  
8:  names(varTypes) <- c("from_id", "from_name", "message", "created_time",   
9:            "type", "link", "id", "likes_count", "comments_count",   
10:            "shares_count", "searchstring", "dateadded")  
11:    
12:  ## Save the data back to SQL Server.  
13:  sqlSave(localdb, datatbl, varTypes=varTypes, tablename ="facebookSearchResults", append = T, rownames = F)  


Prepare a SQL Server function for word parsing.


In order to parse the large [message] post contents into individual words, we need to delineate the words by spaces. There are a great number of ways to splice strings into words, but I have a favorite function for SQL Server I found a long time ago, that I still use today. It's efficient mostly because of its use of an Itzik Ben-Gan style of exponentially recursive CTEs.

One time process: Create the following function, enabling us to split strings.

 USE [YourDatabaseName]  
 GO  
   
create function [dbo].[delimitedsplit8k]  
 --===== define i/o parameters  
     (@pstring varchar(8000), @pdelimiter char(1))  
 --warning!!! do not use max data-types here! it will kill performance!  
 returns table with schemabinding as  
  return  
 --===== "inline" cte driven "tally table" produces values from 1 up to 10,000...  
    -- enough to cover varchar(8000)  
  with e1(n) as (  
          select 1 union all select 1 union all select 1 union all  
          select 1 union all select 1 union all select 1 union all  
          select 1 union all select 1 union all select 1 union all select 1  
         ),             --10e+1 or 10 rows  
     e2(n) as (select 1 from e1 a, e1 b), --10e+2 or 100 rows  
     e4(n) as (select 1 from e2 a, e2 b), --10e+4 or 10,000 rows max  
  ctetally(n) as (--==== this provides the "base" cte and limits the number of rows right up front  
            -- for both a performance gain and prevention of accidental "overruns"  
          select top (isnull(datalength(@pstring),0)) row_number() over (order by (select null)) from e4  
         ),  
 ctestart(n1) as (--==== this returns n+1 (starting position of each "element" just once for each delimiter)  
          select 1 union all  
          select t.n+1 from ctetally t where substring(@pstring,t.n,1) = @pdelimiter  
         ),  
 ctelen(n1,l1) as(--==== return start and length (for use in substring)  
          select s.n1,  
             isnull(nullif(charindex(@pdelimiter,@pstring,s.n1),0)-s.n1,8000)  
           from ctestart s  
         )  
 --===== do the actual split. the isnull/nullif combo handles the length for the final element when no delimiter is found.  
  select itemnumber = row_number() over(order by l.n1),  
     item    = substring(@pstring, l.n1, l.l1)  
   from ctelen l  
 ;  

Create a WordScorer lookup table


We'll need one final ingredient before we can start scoring the words by search term. We need a lookup table that assigns a sentiment value to words. Naturally we're not going to use the whole dictionary, but we can get started with a pretty straight-forward approach.

Not every possible word needs a score. We need to ignore "a", "and", "the", and all those other words out that don't generally indicate a users sentiment, like the word "chair". The word chair doesn't give us any idea what kind of mood the user is in when they wrote it, or the sentiment behind those words.

My WordScorer table has 2477 words commonly used to score sentiment. They include adjectives and verbs that can generally be considered to be positive or negative influencers on sentiment. For instance, "abusive", "lol", and "jackass".

You'll need to create the following table and import this data (currently in google doc spreadsheet format) into the table.

 create table dbo.wordscorer(  
      wordid int identity(1,1) not null  
      , word varchar(50) not null  
      , score int not null  
      , scoreplusone int not null  
  constraint pk_wordscorer primary key clustered (wordid asc)  
 );  

Score the facebook data, by search term, providing the average sentiment score for each.


Finally, the peas of resistance! We can now query our posts, parse the words, score them, and report the average sentiment for each search string.

1:  with TheWords as (  
2:       select r.SearchString, z.item   
3:       from dbo.facebooksearchresults r   
4:       cross apply (  
5:            select *   
6:            from dbo.DelimitedSplit8K (r.[message],'')   
7:       ) z   
8:       where r.[message] is not null   
9:  )  
10:  , TheWordsScored as (  
11:       select SearchString, item, w.word, w.score   
12:       from TheWords i   
13:       join [dbo].[WordScorer] w   
14:            on i.item = w.word  
15:  )  
16:  select SearchString, AverageSentimentScore = avg(cast(score as float))   
17:  from TheWordsScored   
18:  group by searchString   
19:  order by avg(cast(score as float)) desc  


Additional Data Considerations


There is more analysis and cleaning that should be considered when performing sentiment analysis on these posts. First of all, the search terms are searched individually, and never combined. Therefore, a two word search term will yield inconsistent results. After retrieving the data, you'll have to filter that set where both are found, eliminating most if not all of the results you pulled. So, in my search terms, "Taylor Swift" will return posts that merely mention the word swift. This means the sentiment result for the two word term is erroneous, unless filtered where both are present.

Secondly, the search is performed by facebook over the message content and the comment content. So if you're looking for just the message contents containing the terms, you'll again have to filter for this on the backend after retrieval.

Where to from here


At this point, I'd be very interested in automating a regular pull of the data to see how that sentiment moves and changes over time. Facebook doesn't let you go very far back in time, even with the date parameters. It must be nice to be Facebook and have all this data in their hands already. That's my problem - I've got more side projects of personal interest than I know what to do with.


Enjoy,
-P

Have a great and fantastically wonderfully cheerful day!
(+9 positive sentiment bagged right there)

I am turned into a sort of machine for observing facts and grinding out conclusions.
Charles Darwin

Saturday, November 22, 2014

SQL Server, Predictive Analytics, Reducing Bias In Train and Test Data

SQL Server, Predictive Analytics, Reducing Bias In Train and Test Data

When training your statistical model to predict something, we never do so without our hold out data - our test data. This allows us to review the performance of our model against data that was not used during training, but for which we still know the outcome. However, the method through which you choose to divide your data in training and testing data can sometimes introduce bias and skew your predictive prowess. Were you to simply divide the data in half, you might have an uneven distribution of the population. For instance, you may have the data ordered by total sales. Splitting this in half might mean you trained your model on an unevenly heavy-handed sales total. 

To that end, we avoid ordering our data on the SQL Server side before moving it to our statistical software of choice (it better be R). However, this still isn't enough. Depending how your table is constructed, or was populated, it may be naturally ordered in some way that will create some sort of population bias. Tables with clustered indexes are great, but it also means your data is pre-ordered by that clustering key. That key might be negatively influencing the distribution. Never select the top 50% either - the data in the table is likely ordered in the table itself - you may still be getting biased sets.

To avoid this biased data selection, we introduce random number generation to our prepared data, and then split it. I show here the method for random number generation in SQL Server, and the method for doing so in R. While an analyst would quickly hop to the R version, which is incredibly easy, the SQL Server developer who wants to automate processes and data preparations, wants to be sure this kind of information is already stored in the database before the analyst or automated procedures need it, reducing the future workload for all involved.


Generating Random Numbers in SQL Server


Create the following view
 create view [dbo].[v_getRandomNum]   
 as   
 select   
      Random_ZeroOrOne = abs(cast(newid() as binary(6)) %2) -- between 0 and 1   
      , Random_Num = abs(cast(newid() as binary(6)) %1000) + 1 -- between 1 and 1000  
      , Random_Big = abs(cast(newid() as binary(6)) %999999)  
      , aGUID = newid()  
 ;  

Test it with some sample data. 
 -- Create Temp Table to hold our Sample Data  
 if object_id('tempdb..##SampleRandomized','u') is not null begin drop table ##SampleRandomized end;  
 create table ##SampleRandomized (  
      RowID               int identity(1,1)  
      , rnd_ZeroOrOne     tinyint  
      , rnd_Int           int  
 );  


 -- Create Sample Dataset with Randomized Numbers from SQL Server  
 -- Insert 1000 rows of sample numbers (using sys.objects as the numerator)  
 insert into ##SampleRandomized (rnd_ZeroOrOne, rnd_Int)   
 select top 100   
      rnd_OneOrZero = random.Random_ZeroOrOne  
      , rnd_Int     = random.Random_Num  
 from sys.objects, dbo.v_getRandomNum as random  
 ;  
 select * from ##SampleRandomized  


Now you can use the random number generating view by joining it to your data query. In the example above I've joined it to sys.objects simply as a means of row generation. The distribution of the random numbers will tend to be evenly distributed, given enough rows. 

In the sample I have only pulled together 100 rows, but if we look at the number of cases that are 1, compared to the number of cases that are 0 we will see it's pretty evenly split near 50%. The same distribution is applicable to our random integer, between 1 and 1000, we find about 50% are below 500, and 50% are at or above 500. Each time you generate the random numbers you should see slightly different distributions.


 select MinInt = min(rnd_Int)  
      , MaxInt = max(rnd_int)  
      , CountCases = count(1)  
      , CountOne = sum(case when rnd_ZeroOrOne =1 then 1 else 0 end)   
      , CountZero = sum(case when rnd_ZeroOrOne =0 then 1 else 0 end)   
      , CountBelow = sum(case when rnd_int < 500 then 1 else 0 end)   
      , CountAbove = sum(case when rnd_int >= 500 then 1 else 0 end)   
 from ##SampleRandomized  



Using this to break our data into a training and testing set is now a trivial matter.


 -- Now we can easily break the set in half, based on Random values, decreasing bias.'  
 if object_id('tempdb..##FullDataset','u') is not null begin drop table ##FullDataset end;  
 select *   
      , dataset_byInt = case when rnd_int < 500 then 'train' else 'test' end  
      , dataset_byBit = case when rnd_ZeroOrOne = 0 then 'train' else 'test' end  
 into ##FullDataset  
 from ##SampleRandomized  


Random Number Generation in R


If you plan on using R for your predictive analytics, you'll likely be using the runif command. runif stands for Random Uniform and generates a uniform distribution of random numbers between 0 and 1.


 ## Load the needed libraries  
 library(data.table)  
 library(RODBC)  
 library(ggplot2)  

 ## Get your sample randomized data  
 url1 <- "Driver={SQL Server};Server=YourServerName;Database=YourDatabaseName;Trusted_Connection=Yes;"  
 dsr1 <- odbcDriverConnect(url1) ## Using RODBC  
 data1 <- data.table(sqlQuery(dsr1, paste("select * from ##FullDataset")))  


 ## Transform rnd_ZeroOrOne into a factor  
 data1$rnd_ZeroOrOne <- factor(data1$rnd_ZeroOrOne)  


 ## Random Number Generation via R  
 RowsInData <- nrow(data1)  
 data1$rnd_FromR <- runif(RowsInData)  

After implementing the runif based on nrows (number of rows), we can view the results of that vector.




Now we can break up our data into different sets, and mark them as train or test.



 ## Create rowcounts, by random methods, for both our test and train sets.  
 ## this can be done in compressed language with aggregate or ddply grouping  
 ## however, this is the more readily accessible to beginners.  
 rCountBelow <- data.table(provider = 'R', dataset = 'train', cases = nrow(subset(data1, rnd_FromR < .5)))  
 rCountAbove <- data.table(provider = 'R', dataset = 'test', cases = nrow(subset(data1, rnd_FromR >= .5)))  
 sCountBelow <- data.table(provider = 'SQL', dataset = 'train', cases = nrow(subset(data1, rnd_Int < 500)))  
 sCountAbove <- data.table(provider = 'SQL', dataset = 'test', cases = nrow(subset(data1, rnd_Int >= 500)))  
 

## union all the row counts  
 caseCounts <- rbind(rCountBelow, rCountAbove, sCountBelow, sCountAbove)  



To show the distribution, we can view(caseCounts)



A very uniform distribution indeed, bias reduced through random numbers.

To demonstrate visually, here's a gpplot of the distribution differences (there is none) between our SQL Server random numbers and the random numbers generated in R.


 ## View the distribution of Random row selection from R  
 ggplot(data = caseCounts,   
     aes(x = factor(provider), y=cases, fill = factor(dataset))) +  
  geom_bar(stat="identity") +  
  scale_y_continuous(limits=c(0,100))  




Enjoy,
-P

Windows Tip: God Mode for Windows

Oh my gawd. You have the power. Well, it's not nearly as awesome as god mode used to be when playing the original Quake - but it's interesting anyway.

Check out this windows tip for accessing all those windows god mode features through one control panel-like interface.

What's the secret? Simply right click anywhere on your desktop and create a new shortcut, but name it specifically this:

GodMode.{ED7BA470-8E54-465E-825C-99712043E01C}

Now open up your god panel!!!
(it's really just an expanded control panel with administrative shortcuts)




Enjoy,
-P

A Peek Inside an XLSX Document

Wants to see what's under the hood of an XLSX file? Probably not, but just for giggles, this is a little bit interesting. Did you know that Excel xlsx files are actually zip files? Change the file extension to .zip and [poof], you can view the underlying supporting documents. You'll notice the format of the xlsx file is supported by pretty straightforward xml files.

Make File Extensions Visible

If you don't have file extensions visible - set this folder option first.
Option your control panel, and select "Folder Options".
Then remove the checkmark for "Hide extensions for known file types".


Rename the xlsx file

Using Windows File Explorer, choose any xlsx file and rename the extension from xlsx to zip.



Naturally, say yes to the prompt.


Now open the .zip file, and view the contents.


Oh my.. Look at that xml file. 

Digging inside there you'll notice it's straight forward xml with values, except that excel is using an index lookup to the sharedStrings.xml file (also located in a higher leve inside the zip file) for cell values. So in the Sheet1.xml, a <v>0</v> would correspond to the first value in the sharedStrings file. Yes, you can edit the files, save it back, rename the zip back to an xlsx file, and it still works.

Programmatically, this makes for pretty easy interaction with excel documents.

Anyways, enjoy.
-P

Sunday, November 9, 2014

SQL Server, R, and Logistic Regression

SQL Server, R, and Logistic Regression

This demonstrates the retrieval of data from SQL Server, performing logistic regression to predict the probability of an event and saving the data back to SQL Server for storage, reporting or additional analysis. The motivation here is the desire to extend the capabilities of your average SQL Server developer who may be looking to expand their skills with predictive analytics and R.

Logistic regression returns the probabilty of an event, and in the case of my example, the event is represented by 1, and a non-event is represented by a 0. In order to start predicting probabilities we should follow the basic protocol of breaking our dataset into two groups, test and train. For example, we might have 100 cases (rows) of events with explanatory variables (those that are used to predict the resulting event). We would break up this data, for which we already know the outcome, into 50 cases of training data, and 50 cases of testing data. This allows us to not only build our predictive model, but then put the model up against another set of known outcomes to compare how our model is performing. In the example data I will use, the test and train data are identical, for the sake of simplicity. (keep it stupid, simple.)

SQL Server Sample Data
1.) In SQL Server, create a global temporary table with sample data;

 -- test data for logistic regression  
 -- Find the probability of an event  
 if object_id('tempdb..##testlogit') is not null begin drop table ##testlogit end;  
 create table ##testlogit (  
      rowid int identity(1,1)  
      , [a] int not null   
      , [b] int not null   
      , [event] int null   
      , dataset varchar(5) not null  
 )  
 -- Insert the model training data.   
 -- We will use this to train our predictive prowess.  
 insert into ##testlogit select 1, 0, 1, 'train'  
 insert into ##testlogit select 1, 0, 1, 'train'  
 insert into ##testlogit select 1, 0, 0, 'train'  
 insert into ##testlogit select 1, 1, 1, 'train'  
 insert into ##testlogit select 1, 1, 1, 'train'  
 insert into ##testlogit select 0, 0, 0, 'train'  
 insert into ##testlogit select 0, 0, 0, 'train'  
 insert into ##testlogit select 0, 1, 0, 'train'  
 insert into ##testlogit select 0, 1, 1, 'train'  
 -- Insert the model testing data. We will predict the event in this set.  
 insert into ##testlogit select 1, 0, 1, 'test'  
 insert into ##testlogit select 1, 0, 1, 'test'  
 insert into ##testlogit select 1, 0, 0, 'test'  
 insert into ##testlogit select 1, 1, 1, 'test'  
 insert into ##testlogit select 1, 1, 1, 'test'  
 insert into ##testlogit select 0, 0, 0, 'test'  
 insert into ##testlogit select 0, 0, 0, 'test'  
 insert into ##testlogit select 0, 1, 0, 'test'  
 insert into ##testlogit select 0, 1, 1, 'test'  

2.) The resulting table should contain the following;


Copy the data to R for Logistic Regression

1.) Connect to the data source and pull the data into a variable called data. I've used the library RODBC, though I've seen others use rsqlserver. I hear that rsqlserver is faster for data movement, however, when you're working with statistical models in R, it's actually pretty rare that you're moving mass amounts of data around. This is even a limitation of R, in that it stores your data in objects in memory - meaning that large datasets need to take into account the available hardware limitations as well. Frankly, the words "large amount of data" is a subjective opinion, in my view. I've seen people who consider 1GB large, and some won't use the word until they see a terabyte. Nonetheless, for the incredibly tiny sample data we're using here, and everything I have ever done with R, I've found RODBC to be stunningly fast enough.

 ## Load the needed libraries  
 library(RODBC)  
 library(data.table)  
 library(ggplot2)  
 ## For the ConfusionMatrix you'll need the following  
 ##install.packages("caret")  
 ##install.packages("e1071")  
 library(caret)  
 ## Set the working directory  
 setwd("c:/personal/r")  
 ## Connect to and pull your sample data from SQL Server.  
 driver <- "Driver={SQL Server};Server=YourServerName;Database=YourDatabaseName;Trusted_Connection=Yes;"  
 localdb <- odbcDriverConnect(driver)  
 data <- data.table(sqlQuery(localdb, paste("select * from ##testlogit")))  

Be sure to replace the YourServerName with your actual server name, and YourDatabase name with your actual database name. The database I'm using I've lovingly called CommunalToilet - it's where I put all my garbage data. :)

2.) Subset your data into training and testing data sets.

 ## Subset the data into Training and Testing datasets.  
 traindata <- subset(data, dataset == 'train')  
 testdata <- subset(data, dataset == 'test')  

3.) Create the glm (generalized linear model) using family binomial for logistic regression. At this point we're playing with data in which we already know the outcome. Because we will have data in the future in which we do not know the outcome, but would like to predict it, we can save our logistic regression model to an rda file. The next time I use this script with data that has no known events, I do not need to re-train my model. Finally, I viewed the summary of that model.

 ## Create the logistic regression model using family binomial link=logit  
 ## Using the training dataset.  
 ## If the model was already built, then simply reload it.  
 if(file.exists("my_testglm.rda")) {  
   ## load model  
   load("my_testglm.rda")  
 } else {  
   ## (re)fit the model  
   testglm <- glm(data = traindata, event ~ a + b + a * b, family="binomial")  
   save(testglm, file = "my_testglm.rda")  
 }  
 ## View summary of the model  
 summary(testglm)  

The glm (lm linear regression) model function is looking particularly for the formula. The formula (event ~ a + b + a * b) is your dependent variable (the event you're predicting) ~ over the explanatory variables (the variables that explain how the prediction is made). In this case, I'm using a and b as my explanatory variables, which might represent, for example, age and income. I also included an interaction of a*b. While this doesn't help my model in in this case, it is a powerful way of adding predictive ability when you determine that one variable has direct interactivity with another - their combined weight may be significant. For example, you might find it more significant when someone has previously had both a heart attack -and- has high blood pressure, compared to simply one or the other.


4.) Using your model, predict the probability, the rounded probability, and put those into our data sets.

 ## Create prediction based on the model parameters.  
 traindata$prob_of_event <- predict(testglm, newdata=traindata, type="response")  
 testdata$prob_of_event <- predict(testglm, newdata=testdata, type="response")  
 ## round the predicted probabilities to 1 and 0  
 traindata$prob_of_event_round <-round(traindata$prob_of_event, digits = 0)  
 testdata$prob_of_event_round <- round(testdata$prob_of_event, digits = 0)  



5.) Create and view the confusionMatrix results. The confusion matrix shows your predictive power by creating a matrix of the actual events to your predicted events. This results in predictions called true positives, false positives, true negatives, and false negatives. Thus, a true positive would indicate you predicted true in an actual event, and a false positive would indicate you've predicted false even though the event actually occurred.

 ## View the accuracy of the model through a confusion matrix  
 confusionMatrix(traindata$prob_of_event_round, traindata$event, positive="1")  


In the confusionMaztrix above, you can see I've got 2 false negatives. I falsely predict they are events (1) even though they are not actual events (0) in the data. Not bad, considering the data we've put into the model, there are two explanatory variable scenarios where the predictive value isn't perfect. The data leans towards being an event, but I have actual cases that turned out to be (0) non-events. Nonetheless, the model says that, given these variables, there's a 66% probability of it being an event. It just so happens, that 66% is not 100%. You can see these scenarios in the traindata output above.

Merge and copy the data back to SQL Server
1.) Merge the training and testing data sets into one data table, and then plot the resulting predictions.

 ## Merge your training and testing sets back together.  
 ## this is the data.table that we will copy back to SQL Server.  
 mergeddata <- data.table(rbind(traindata, testdata))  
 ## Plot the results of your predicted probabilities.  
 ggplot(data=mergeddata, aes(x=rowid, y=prob_of_event, color=dataset)) +  
   geom_point(size = 4)  



2.) Now copy the data back to SQL Server. The first step will check to see if the table I'm copying the data into already exists and if it does then we will remove the data.

 ## First test to see if the table already exists. If so, drop it.  
 sql <- paste("if object_id('CommunalToilet.dbo.testlogit_results') is ",   
       "not null begin drop table CommunalToilet.dbo.testlogit_results end")  
 destroyObject <- sqlQuery(localdb, sql)  
 ## Save the data set to SQL Server.  
 sqlSave(localdb, mergeddata, tablename ="testlogit_results", append = FALSE, rownames = F)  

3.) Finally query your resulting data in SQL Server.

 select * from dbo.testLogit_result  



Where to from here

Obviously you'll want to use some real data with some actual explanatory variables, and start predicting the future. The most common scenario I've seen around the web is through the use of the adventure works database and predicting if someone is a biker buyer. They usually pull in multiple customer variables like income, commute distance, number of vehicles, etc., and then predict if the person is a likely bike buyer - only so that they can they spam that person with emails or postal mail. I prefer to think of better things to spend my predictive prowess on than marketing. :\

Once you've gathered the data, open and run your R script. Even though you don't have events predicted yet, the model will load and predict events based on that.

Good luck, best wishes, etc., etc.,
-Parker

SQL Server and Univariate Linear Regression

SQL Server and Univariate Linear Regression

This is a method for evaluating the correlation coefficient of a univariate linear regression model by simply calling a function from within SQL Server. The motivation here was that I wanted to be able to determine a possible linear correlation without having to leave my SQL Server Management Studio environment.

To start you'll need to create the following;

1.) Create a new table data type.

 create type dbo.XandY as table(  
      [x] float null  
      , [y] float null  
 );  

2.) Create the linear regression function;

 create function dbo.UnivariateLinearRegression (@XandY dbo.XandY READONLY)   
 returns @ReturnThisTable table (  
      correlation_coefficient float  
      , slope float  
      , intercept float   
      , r_squared float   
      , standard_estimate_error float  
 )  
 as  
 begin  
      declare @total_points int   
      , @intercept float   
      , @slope float   
      , @r_squared float   
      , @standard_estimate_error float   
      , @correlation_coefficient float   
      , @average_x float   
      , @average_y float   
      , @sumX float   
      , @sumY float   
      , @sumXX float   
      , @sumYY float   
      , @sumXY float   
      , @Sxx float   
      , @Syy float   
      , @Sxy float   
      , @stdevX float  
      , @stdevY float;  
      Select   
           @total_points = count(*)  
           , @average_x = avg(x)  
           , @average_y = avg(y)  
           , @sumX = sum(x)  
           , @sumY = sum(y)  
           , @sumXX = sum(x*x)  
           , @sumYY = sum(y*y)  
           , @sumXY = sum(x*y)  
           , @stdevX = stdev(x)  
           , @stdevY = stdev(y)  
      from @XandY a  
      ;  
      set @Sxx = @sumXX - (@sumX * @sumX) / @total_points;  
      set @Syy = @sumYY - (@sumY * @sumY) / @total_points;  
      set @Sxy = @sumXY - (@sumX * @sumY) / @total_points;  
      set @correlation_coefficient = @Sxy / sqrt(@Sxx * @Syy);  
      set @slope = (@stdevY / @stdevX) * @correlation_coefficient;  
      set @intercept = @average_y - @slope * @average_x;  
      set @r_squared = (@intercept * @sumY + @slope * @sumXY - power(@sumY,2) / @total_points) / (@sumYY - power(@sumY,2) / @total_points);  
      select @standard_estimate_error = sqrt(sum(power(y - (@slope * x + @intercept),2)) / @total_points) from @XandY;  
      insert into @ReturnThisTable (  
           correlation_coefficient  
           , slope  
           , intercept  
           , r_squared  
           , standard_estimate_error  
      )  
      select correlation_coefficient = @correlation_coefficient  
           , slope = @slope  
           , intercept = @intercept  
           , r_squared = @r_squared  
           , standard_estimate_error = @standard_estimate_error  
      ;  
      return;  
 end  

Try it out

1.) Generate some sample data

 if object_id('tempdb..##example','u') is not null   
      begin drop table ##example end;  
 -- Sample Dataset  
 create table ##example (  
      TheDay int  
      , TheValue float  
      , TheUnits float  
 );  
 insert into ##example (TheDay, TheValue, TheUnits)   
 select 1, 10.2, 4 union all select 2, 11.8, 5 union all   
 select 3, 19.2, 8 union all select 4, 10.2, 4 union all   
 select 5, 12.4, 5 union all select 6, 13.2, 6 union all   
 select 7, 15.2, 7 union all select 8, 17.2, 8 union all   
 select 9, 16.2, 7 union all select 10, 25.2, 16 union all   
 select 11, 12.7, 10 union all select 12, 14.2, 11 union all   
 select 13, 15.9, 13 union all select 14, 13.6, 6 union all   
 select 15, 19.2, 9;  

2.) View your sample data, with bonus code to view the 5 day moving average.

 select a.TheDay, a.TheValue, a.TheUnits  
      , c.FiveDayMovingAverageValue  
 from ##example a   
 outer apply (  
      select FiveDayMovingAverageValue = avg(TheValue)   
      from ##example b   
      where b.TheDay between a.TheDay - 4 and a.TheDay   
      having count(TheValue) > 4  
 ) c;  

Which should return this sample data

3.) Now use your new data type and function to generate the correlation coefficient;

 declare @XandY as dbo.XandY;  
 insert into @XandY (x, y) select TheUnits, TheValue from ##example;  
 select * from dbo.UnivariateLinearRegression(@XandY);  

The results of the function returned are;

As you can see, from our manufactured values, we have a strong correlation coefficient of .76
The additional results can help with plotting, or determining the quality of fit with the r_squared, also known as the coefficient of determination.

Validation through R

The values were compared to the results of performing the linear regression analysis on the same data in R.

 library(RODBC)  
 library(data.table)  
 library(ggplot2)  

 localdb <- odbcDriverConnect("Driver={SQL Server};Server=YourServerName;Database=YourDatabaseName;Trusted_Connection=Yes;")  
 data <- data.table(sqlQuery(localdb, paste("select * from ##example")))
  
 testlm <- lm(data = data, TheValue ~ TheUnits)  

 summary(testlm)  
 cor(data$TheValue, data$TheUnits)  
 coef(testlm)  

 ggplot(data = data, aes(y=TheValue, x=TheUnits)) +  
   geom_point() +  
   geom_smooth(method=lm)  

Using the standard lm function, I generated the univariate linear regression model. Next I ran three diagnostic information queries: summary, cor, and coef.


As you can see from R, the summary shows a multiple R-Squared of .5786
The cor function shows the correlation coefficient at .7606
The coef function shows the slope and intercept, also mirroring the results from SQL Server.

Finally the ggplot output is shown below;


In a subsequent post I'll demonstrate the same capability but with the added ability of performing the correlation over a grouping value, which returns the coefficient for each grouped subset.

Till then...

-Parker

The Wretched

I glance down at the steaming cup of coffee, as I sit back into my recliner.
It was set there, as it always is, right on time. The aroma, tells me it's
exactly the way I like it and it's making me sick already.

I know she's there, without even looking. She's always there, somewhere,
right where she should be, at just the right time, doing just the right things.
The thought of her makes me even more sickened. I know, without looking,
the arc of her wretched hunched back will be there, just a couple feet
away in her rocking chair. I try pretending she isn't there, but the clickity-
clack of her knitting has started. It's deafening. This day, like any other,
is making me insane. I can smell her and my coffee. I can smell the detergent
she uses. I can hear her breathing. My head swims.

I can't resist but to glance, one eye squinting in disgust. The beast, and her
wrinkled skin, doesn’t notice me watching. It's 6am. I'm ready to leap out of
my chair and twist her fucking head right off. I imagine her perfectly crafted
hair, white and gray, now soaking in a pool of blood. How’s that for perfect?
If I didn't need this walker to get from my chair to hers, she'd be dead
already.

I'll have to catch her in passing. Maybe lean my walker out just enough to trip
her ass. Her face would flatten to the floor, knocking her perfectly brushed
teeth right in. She’s smacking her lips now. She does that when she's really
into her damned knitting. Every time I hear the sound, another nail drives into
my forehead. 63 years of this shit.

I would have to make it look like an accident though, she knows every damned
neighbor for 10 miles. Hugging and kissing and blessing their damned hearts at
every eager conversation. It sickens me. Must she really stick her nose into
everything? She starts, as usual, rattling off gossip. It's something about Edna.
She was already at her house this morning, dropping off some crap for some
damned recipe. I have no interest whatsoever. As she's waving her hands
around I notice her sagging arms. I'm cringing.

Her breasts are in her lap. She's dressed to the nines, in some flowery printed
crap. She's already been somewhere, done something, talked to someone,
planned some event, learned some ridiculous gossip, is spreading the news
around, and telling me that Susan is coming over in 15 minutes. This’ll have to
happen fast. She's going to want to get out of that squeaking chair soon. She'll
want to start making something in the kitchen, no doubt. Whatever it is, it'll
be wrapped up in a bow, have some unpronounceable foreign ingredients, and
have some fancy name. Who the hell eats that shit.

She's bitching at me now because I didn't take a shower and get dressed yet.
For Christ sake it's 6 god-damned-AM wench! I don't hear the words anymore,
just the tone of her irritatingly subtle, non-stop, nagging voice. I use my foot
and situate the walker a little closer in front of me. I'll just casually start to get
up when she passes, and my foot will accidentally kick the walker on its side.
Yes. I've had enough. I better put my glasses on, I don't want to miss.

I start pretending to watch TV, my eyes straight ahead. My mind though,
working in perfect harmony with my ears. Waiting. Waiting to hear the sound
of her chair as she rises. It's still slowly rocking, creaking, and cracking. I'm
sweating. My foot jerks uncontrollably. I almost blew the whole damned
thing. Get a hold of yourself. Stay calm. Focus. It must be done.

3 excruciating silent minutes later, my leg is so tense, I'm about to burst!
Suddenly, my leg spasms and I kick the walker onto its side! Son of a bitch I
ruined it! I glance over and see Martha is leaning forward in her chair, one arm
stretched outward, holding her cane out in front of herself at a perfect
angle to the floor.

The Contract

My name is Vincent Carrea. I'm 37 years old, single, and not even remotely looking for a relationship. There are far more important things to accomplish. Besides, my life is not ordinary. I don't think anyone else would be quite as accepting of who I am. Who else can you trust in this world, but yourself? Character? I've got character. Many may not agree with the defining attributes of my character, yet there's no denying it, I've got character. I revere a persons' core character as undeniable and inevitably inherent. While you can mask some of these traits, you will never escape them. They are definitive for life. Accept, embrace, and fulfill your destiny. Am I just establishing a method for a guiltless evasion of responsibilities for my actions? I've noticed now, dwelling on and debating these thoughts, the body laying on the floor has started turning new shades of grey. I take a deep breath as I close my eyes and attempt to store an image of her, a silent frame of beauty, into my mind. The slow, short drip of blood at the corner of her slightly open mouth is the only piece of the image that has any color. I will remember her. Thirty eight minutes have passed. I've taken it all in, appreciated and verified every detail. I am done here. With a slight tug on her scarf, I lift it to my nose before folding it and placing it in my pocket. I retrieve the contract out from under my coat, and slide the signed brown parchment under her head. Tilting my hat to her, I walk to the door. As I step out, a cold breeze from the Grady River brushes over my face, and I barely smile. This is me, there's no denying it - and I love it. I arrive at my home, and heat up something to eat. While I'm waiting, I start sketching the burned image from my mind. I sense it before he arrives. The shadow, darkening the hallway, closing nearer my desk. I raise my head from the stock of newspapers around me. He is ready for another soul. I accept the parchment he carries. His gruff tone is calm and commanding, as it fades out, and the shadow recesses to nowhere. I put my hat and coat back on - no rest for the weary or the sinners of the world. He’s back. I sense it. I halt - as I notice the name on the document: Vincent Carrea.