Thursday, January 12, 2012

R generating Data

Copy from http://personality-project.org/r/distributions.html

Simulations of distributions

The central limit theorem is perhaps the most important concept in statistics. For any distribution with finite mean and standard deviation, samples taken from that population will tend towards a normal distribution around the mean of the population as sample size increases. Furthermore, as sample size increases, the variation of the sample means will decrease.

The following examples use the R stats program to show this graphically. The first example uses a uniform (rectangular) distribution. An example of this case is of a single die with the values of 1-6. The second example is of two dice with totals ranging from 2-12. Notice that although one die produces a rectangular distribution, two dice show a distribution peaking at 7. The next set of examples show the distribution of sample means for samples of size 1 .. 32 taken from a rectangular distribution.

This figure was produced using the following R code.

#distributions of a single six sided die #generate a uniform random distribution from min to max  numcases <- 10000                           #how many cases to generate min <- 1                                  #set parameters max <- 6 x <- as.integer(runif(numcases,min,max+1) )        #generate random uniform numcases numbers from min to max                                         #as.integer truncates, round converts to integers, add .5 for equal intervals  par(mfrow=c(2,1))                        #stack two figures above each other hist(x,main=paste( numcases," roles of a single die"),breaks=seq(min-.5,max+.5,1))          #show the histogram boxplot(x, horizontal=TRUE,range=1)              # and the boxplot title("boxplot of a uniform random distribution") #end of first demo 

Distribution of two dice

Distribution of two dice. The sum of two dice is not rectangular, but is peaked at the middle (hint, how many ways can you get a 2, a 3, ... a 7, .. 12.).
The following R code produced this figure.

#generate a uniform random distribution from min to max for numcases samples of size 2 numcases <- 10000                            #how many cases to generate min <- 0                                    #set parameters max <- 6 x <- round(runif(numcases,min,max)+.5)+round(runif(numcases,min,max)+.5) par(mfrow=c(2,1))                        #stack two figures above each other hist(x,breaks=seq(1.5,12.5),main=paste( numcases," roles of a pair of dice"))          #show the histogram                                  boxplot(x, horizontal=TRUE,range=1)      # and the boxplot title("boxplot of samples of size two taken from a uniform random distribution")  #end of second demo  

Samples from a continuous uniform random distribution

We can generalize the case of 1 or two dice to the case of samples of varying size taken from a continuous distribution ranging from 0-1. This next simulation shows the distribution of samples of sizes 1, 2, 4, ... 32 taken from a uniform distribution. Note, for each sample, we are finding the average value of the sample, rather than the sum as we were doing in the case of the dice.
##show distribution of sample means of varying size samples    numcases <- 10000               #how many samples to take?   min <- 0                        #lowest value   max <- 1   ntimes <- 6  op<-  par(mfrow=c(ntimes,1))        #stack ntimes graphs on top of each other  i2 <- 1                     #initialize counters     for (i in 1:ntimes)               #repeat n times    {  sample=rep(0,numcases)        #create a vector       k=0                #start off with an empty set of counters     for (j in 1:i2)          #  inner loop      {     sample <- sample +runif(numcases,min,max)         k <- k+1  }   x <- sample/k   out <- c(k,mean(x),sd(x))   #print(out,digit=3)   hist(x, xlim=range(0,1),prob=T ,main=paste( "samples of size",  k ),col="black")   i2 <- 2*i2   }    #end of i loop         #do the same thing, but this time show a boxplot    numcases <- 10000               #how many samples to take?   min <- 0                        #lowest value   max <- 1   ntimes <- 6   par(mfrow=c(ntimes,1))            #stack ntimes graphs on top of each other  i2 <- 1                     #initialize counters     for (i in 1:ntimes)               #repeat 5 times    {  sample <- 0 ; k <- 0                #start off with an empty set of counters     for (j in 1:i2)          #  inner loop      {     sample <- sample +runif(numcases,min,max)         k <- k+1  }   x <- sample/k   out <- c(k,mean(x),sd(x))  boxplot(x, horizontal=TRUE, ylim=c(0,1), range=1,notch=T)              # and the boxplot   i2 <- 2*i2}     #Demonstration of the effect of sample size on distributions #each sample is then replicated numcases times    filename <- "distributions2.pdf" pdf(filename,width=6, height=9)   numcases <- 10000               #how many samples to take?   min <- 0                        #lowest value   max <- 1   ntimes <- 3   par(mfrow=c(2*ntimes,1))            #stack ntimes graphs on top of each other  i2 <- 1                     #initialize counters     for (i in 1:ntimes)               #repeat 5 times    {  sample <- 0 ; k <- 0                #start off with an empty set of counters     for (j in 1:i2)          #  inner loop      {     sample <- sample +runif(numcases,min,max)         k <- k+1  }   x <- sample/k   out <- c(k,mean(x),sd(x))   print(out,digit=3)   hist(x, xlim=range(0,1),prob=T ,main=paste( "samples of size",  k ))  boxplot(x, horizontal=TRUE, ylim=c(0,1), range=1,notch=T)              # and the boxplotl  titles("Samples of sise ",   i2 <- 8*i2}        dev.off() #€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€   #choose a different type of distribution (binomial) filename <- "distributions3.pdf" pdf(filename,width=6, height=9)     numcases <- 1000               #how big is the population   min <- 0                        #lowest value   max <- 1   ntimes <- 6   par(mfrow=c(ntimes,1))            #stack ntimes graphs on top of each other  i2 <- 1                     #initialize counters     for (i in 1:ntimes)               #repeat 5 times    {  sample <- 0 ; k <- 0                #start off with an empty set of counters     for (j in 1:i2)          #  inner loop      {     sample <- sample +  rbinom(numcases, 1, .5)        k <- k+1  }   x <- sample/k   out <- c(k,mean(x),sd(x))   #print(out,digit=3)   hist(x, xlim=range(0,1),prob=T ,main=paste( "samples of size",  k )) # boxplot(x, horizontal=TRUE, ylim=c(0,1), range=1,notch=T)              # and the boxplot   i2 <- 2*i2}       dev.off()     ####### #choose a different type of distribution (binomial) filename <- "distributions4.pdf" pdf(filename,width=6, height=9)     numcases <- 1000               #how big is the population   min <- 0                        #lowest value   max <- 1   ntimes <- 6   samplesize <- 2   prob <- .5   par(mfrow=c(ntimes,1))            #stack ntimes graphs on top of each other  i2 <- 1                     #initialize counters     for (i in 1:ntimes)               #repeat 5 times    {  sample <- 0 ; k <- 0                #start off with an empty set of counters     for (j in 1:i2)          #  inner loop      {     sample <- sample +  rbinom(numcases, samplesize, prob)            #samples of size 2        k <- k+1  }   x <- sample/k   out <- c(k,mean(x),sd(x))   #print(out,digit=3)   hist(x, xlim=range(0,samplesize),prob=T ,main=paste(numcases," samples of size",  k," with prob = ", prob)) # boxplot(x, horizontal=TRUE, ylim=c(0,samplesize), range=1,notch=T)              # and the boxplot   i2 <- 2*i2}     dev.off()     ###### #choose a different type of distribution (binomial) filename <- "distributions5.pdf" pdf(filename,width=6, height=9)     numcases <- 1000               #how big is the population   min <- 0                        #lowest value   max <- 1   ntimes <- 6   samplesize <- 10   prob <- .2   par(mfrow=c(ntimes,1))            #stack ntimes graphs on top of each other  i2 <- 1                     #initialize counters     for (i in 1:ntimes)               #repeat 5 times    {  sample <- 0 ; k <- 0                #start off with an empty set of counters     for (j in 1:i2)          #  inner loop      {     sample <- sample +  rbinom(numcases, samplesize, prob)            #samples of size         k <- k+1  }   x <- sample/k   out <- c(k,mean(x),sd(x))   #print(out,digit=3)   hist(x, xlim=range(0,samplesize),prob=T ,main=paste( numcases," samples of size",  k," with prob = ", prob)) # boxplot(x, horizontal=TRUE, ylim=c(0,samplesize), range=1,notch=T)              # and the boxplot   i2 <- 2*i2}     dev.off()        ########  #choose a different type of distribution (binomial)  # Set some parameters   numcases <- 1000               #how big is the population   ntimes <- 6   samplesize <- 1                 #initial sample size   prob <- .5                      #probability of outcome   par(mfrow <- c(ntimes,1))       #stack ntimes graphs on top of each other   i2 <- 1                     #initialize counters       for (i in 1:ntimes)               #repeat ntimes         {                               #begin the loop      x <- rbinom(numcases, samplesize, prob)            #samples of size       x <- x/samplesize               #normalize to the 0-1 range       out <- c(samplesize,mean(x),sd(x))     print(out,digit=3)     hist(x, xlim=range(0,1),prob=T ,main=paste( numcases," samples of size",  samplesize ," with prob = ", prob)) # boxplot(x, horizontal=TRUE, ylim=c(0,samplesize), range=1,notch=T)              # and the boxplot   samplesize <- samplesize*2         #double the sample size   }                               #end of loop      
Some simple measures of central tendency
 #######   x=c(1,2,4,8,16,32,64)    #enter the x data   summary(x)   boxplot(x)         x <- c(1,2,4,8,16,32,64)    #enter the x data   y <- c(10,11,12,13,14,15,16) #enter the y data                     data <- data.frame(x,y)      #make a dataframe   data                      #show the data   summary(data)             #descriptive stats   boxplot(x,y)              #same as boxplot(data)          

Sunday, January 8, 2012

Advanced R - mysql

Install MySQL

http://biostat.mc.vanderbilt.edu/wiki/Main/RMySQL

2 down vote accepted

Found solution with help of ran2, who gave me link to common question. The basic process is described here, but there are several hints, So I will describe the whole solution (please change the R version and paths if needed):

  1. Install latest RTools from here
  2. install MySQL or header and library files of mysql
  3. create or edit file C:\Program Files\R\R-2.12.1\etc\Renviron.site and add line like MYSQL_HOME=C:/mysql (c:\PROGRA~1\MySQL\MYSQLS~1.5 for version 5.5)
  1. copy libmysql.lib from mysql/lib to mysql/lib/opt to meet dependencies.
  1. copy libmysql.dll to C:\Program Files\MySQL\MySQL Server 5.5\bin
  1. run install.packages('RMySQL',type='source') and wait while compilation will end.

Thanks to all who tried to answer.

Basic R - Command

Save/Load temorary result

save.image('Temp.RData')
load('Temp.RData')
Envionment


ls() # show the current contents of the directory
rm(a) # remove the object "a"

Summary Data

str()/summary()

R classification - KNN

knn

function (train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE)


library(class)
r.knn<-knn(zipTrainData,zipTestData, zipTrainLable)

Basic R - Packages

install.packages("class")
remove.packages("class")

Basic R - Loading Data

Here is the steps to loading data into R.

Get Zip data from http://www-stat.stanford.edu/~tibs/ElemStatLearn/ Data -> Zip

  • getwd() - to get current directory
  • setwd("path") - to set current directory

Load Data
  • read.table
  • read.csv
help(read.table) / ?read.table



setwd("c:\\data")
zipTraining = read.table("zip.train")

summary(zipTraining)
head(a)
dim(zipTraining)
str(zipTraining)



Format data
zipTest<-read.table("zip.test")
zipTraining<-read.table("zip.train")

zipTrainData<-as.matrix(zipTraining[,2:257])
zipTrainLable<-as.factor(zipTraining[,1])

zipTestData<-as.matrix(zipTest[,2:257])
zipTestLable<-as.factor(zipTest[,1])

Learning
zip.lm <- lm(zipTrain~zipLable)

Monday, January 2, 2012

ETW events

http://blogs.msdn.com/b/ntdebugging/archive/2009/08/27/etw-introduction-and-overview.aspx

Improve Debugging And Performance Tuning With ETW