CUNY Bridge: Data Science Math

Walt Wells, 07.31-08.11.2016

Environment Setup, Data Import

if (!require("hflights")) install.packages("hflights")
if (!require("MASS")) install.packages("MASS")

data(hflights)

# Let's look at the variables ActualElapsedTime and Distance
#x <- hflights$ArrDelay
x <- hflights$ActualElapsedTime
y <- hflights$Distance
df <- data.frame(x,y)
# there are some NAs in elapsed time, let's remove those rows now.  
df <- df[complete.cases(df),]
#df <- subset(df, x>0)

# confirm skew for at least one var
par(mfrow=c(1,2))
hist(df$x)
hist(df$y)

Probability

We’ll check for a number of requested conditional probabilties of X at x and Y at y given the values of x and y. To do this, we’ll subset our df for type of condition, then calcluate based on remaining observations of X or Y.

quantile(df$x)

  0%  25%  50%  75% 100% 
  34   77  128  165  575

quantile(df$y)

  0%  25%  50%  75% 100% 
  79  393  809 1043 3904

yQ2 <- quantile(df$y, .5) #809
xQ3 <- quantile(df$x, .75) #165
all <- length(df$x)

# Let's check manually, review whether independant or dependant
# x = 3rd quartile, y = 2nd quartile
# a) P(X > x | Y > y), should be (.25 * .50) / (.25)
gy <- df[df$y > yQ2,]
pgy <- round(nrow(gy) / all, 3)

xgyg <- gy[gy$x > xQ3,]
pgxy <- round(nrow(xgyg) / all, 3)
pxgyg <- round(pgxy / pgy, 3)
pxgyg

[1] 0.501

# test a independant?  
gx <- df[df$x > xQ3, ]
pgx <- round(nrow(gx) / all, 3)
(pgx * pgy) / pgy

[1] 0.249

# b) P(X > x, Y > y)
pxy <- round(pgx * pgy, 3)
pxy

[1] 0.123

# c) P(X < x | Y > y) 
xlyg <- gy[gy$x <= xQ3, ]
plxgy <- round(nrow(xlyg) / all, 3)
pxlyg <- round(plxgy / pgy, 3)
pxlyg

[1] 0.499

# d) find x | y for all table values
# have a * c, need vals for Y <= y
ly <- df[df$y <= yQ2,]
ply <- round(nrow(ly) / all, 3)

xlyl <- ly[ly$x <= xQ3,]
plxgly <- round(nrow(xlyl) / all, 3)
pxlyl <- round(plxgly / ply, 3)
pxlyl

[1] 0.998

xgyl <- ly[ly$x > xQ3,]
pgxgly <- round(nrow(xgyl) / all, 3)
pxgyl <- round(pgxgly / ply, 3)
pxgyl

[1] 0.002

cname <- c("<=2d quartile", ">2d quartile", "Total")
rname <- c("<=3d quartile", ">3d quartile")
c1 <- c(nrow(xlyl), nrow(xgyl))
c2 <- c(nrow(xlyg), nrow(xgyg))

result <- data.frame(c1, c2, row.names = rname)
result[ ,3] <- rowSums(result)
result["Total", ] <- colSums(result)
colnames(result) <- cname
result

              <=2d quartile >2d quartile  Total
<=3d quartile        112826        55353 168179
>3d quartile            251        55444  55695
Total                113077       110797 223874

Does splitting the data in this fashion make them independent? Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations for the 2d quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

A <- df[df$x > xQ3, ]
PA <- round(length(A$x)/all, 3)
B <- df[df$y > yQ2, ] 
PB <- round(length(B$y)/all, 3)
AgB <- B[B$x > xQ3, ]
PAgB <- round(length(AgB$x)/all, 3)
PAgB

[1] 0.248

round(PA * PB, 3)

[1] 0.123

#now let's setup our result df for a chisq test
adjresults <- result[1:2,]
adjresults <- adjresults[, 1:2]

chisq.test(adjresults)


    Pearson's Chi-squared test with Yates' continuity correction

data:  adjresults
X-squared = 74318, df = 1, p-value < 2.2e-16

# double check with the full df
chisq.test(df)


    Pearson's Chi-squared test

data:  df
X-squared = 967288, df = 223873, p-value < 2.2e-16

Since the p value is so low, we do not reject the null hypothesis. All signs suggest that when these two variables are split along different quartiles and compared, they are independant. The double check using the full df suggests the same result, regardless of how the data is split.

Descriptive and Inferential Statistics.

#Provide univariate descriptive statistics and appropriate plots. 
summary(df)

       x               y         
 Min.   : 34.0   Min.   :  79.0  
 1st Qu.: 77.0   1st Qu.: 393.0  
 Median :128.0   Median : 809.0  
 Mean   :129.3   Mean   : 789.7  
 3rd Qu.:165.0   3rd Qu.:1043.0  
 Max.   :575.0   Max.   :3904.0

plot(density(df$x))

plot(density(df$y))

#Provide a scatterplot of the two variables.  
plot(df$x, df$y)

#Provide a 95% CI for the difference in the mean of the variables.  
t.test(df$y, df$x)


    Welch Two Sample t-test

data:  df$y and df$x
t = 682.5907, df = 231508.6, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 658.5146 662.3071
sample estimates:
mean of x mean of y 
 789.7346  129.3237

#Derive a correlation matrix for two of the quantitative variables you selected.  
CorMatrix <- cor(df)
CorMatrix

          x         y
x 1.0000000 0.9743551
y 0.9743551 1.0000000

#Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval.  Discuss the meaning of your analysis.
cor.test(df$x, df$y, conf.level=0.99)


    Pearson's product-moment correlation

data:  df$x and df$y
t = 2048.821, df = 223872, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
99 percent confidence interval:
 0.9740780 0.9746293
sample estimates:
      cor 
0.9743551

Our correlation tests show that these two variables are highly correlated. My understanding of modeling is still in its infancy, but I believe if we were trying to model using these two variables, it would be difficult to use both because of the overly high degree of correlation. We may need to make some decisions to not include one or the other, or create a combined statistic.

Linear Algebra and Correlation.

#Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) 
InvMatrix <- solve(CorMatrix)
InvMatrix

          x         y
x  19.75030 -19.24381
y -19.24381  19.75030

#Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. 
CorMatrix %*% InvMatrix

  x y
x 1 0
y 0 1

InvMatrix %*% CorMatrix

  x y
x 1 0
y 0 1

#As an advanced option, conduct principle components analysis and interpret.  Discuss.

PCA <- prcomp(df, center=TRUE, scale.=TRUE)
PCA

Standard deviations:
[1] 1.4051175 0.1601403

Rotation:
        PC1        PC2
x 0.7071068 -0.7071068
y 0.7071068  0.7071068

plot(PCA$x[,1], PCA$x[,2])

Since there are only two variables in our dataset, we can see the relationship between the two fairly well through many of the other previously utilized techniques. While it’s interesting to see the results of a PCA showing cumulative and proportional variance, reducing dimensionality to better understand relationships may be uncessary here. The resulting scatter plot looks strikingly similar to our simple x,y plot.

Calculus-Based Probability & Statistics.

#For your variable that is skewed to the right, shift it so that the minimum value is above zero.  Then load the MASS package and run fitdistr to fit an exponential probability density function.    
distr <- fitdistr(df$x, densfun="exponential")

#Find the optimal value of l for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, l)).  
l <- distr$estimate
xsamp <- rexp(1000,l)

#Plot a histogram and compare it with a histogram of your original variable.  
par(mfrow=c(1,2))
hist(xsamp)
hist(df$x)

#Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).   Also generate a 95% confidence interval from the empirical data, assuming normality.  
qexp(0.95, l); qexp(0.05, l)

[1] 387.4193

[1] 6.633441

qnorm(0.95, mean(df$x), sd(df$x)); qnorm(0.5, mean(df$x), sd(df$x))

[1] 226.8403

[1] 129.3237

#Finally, provide the empirical 5th percentile and 95th percentile of the data.  Discuss.

quantile(df$x, .95); quantile(df$x, .05)

95% 
231

5% 
51