if (!require("hflights")) install.packages("hflights")
if (!require("MASS")) install.packages("MASS")
data(hflights)
# Let's look at the variables ActualElapsedTime and Distance
#x <- hflights$ArrDelay
x <- hflights$ActualElapsedTime
y <- hflights$Distance
df <- data.frame(x,y)
# there are some NAs in elapsed time, let's remove those rows now.
df <- df[complete.cases(df),]
#df <- subset(df, x>0)
# confirm skew for at least one var
par(mfrow=c(1,2))
hist(df$x)
hist(df$y)
We’ll check for a number of requested conditional probabilties of X at x and Y at y given the values of x and y. To do this, we’ll subset our df for type of condition, then calcluate based on remaining observations of X or Y.
quantile(df$x)
0% 25% 50% 75% 100%
34 77 128 165 575
quantile(df$y)
0% 25% 50% 75% 100%
79 393 809 1043 3904
yQ2 <- quantile(df$y, .5) #809
xQ3 <- quantile(df$x, .75) #165
all <- length(df$x)
# Let's check manually, review whether independant or dependant
# x = 3rd quartile, y = 2nd quartile
# a) P(X > x | Y > y), should be (.25 * .50) / (.25)
gy <- df[df$y > yQ2,]
pgy <- round(nrow(gy) / all, 3)
xgyg <- gy[gy$x > xQ3,]
pgxy <- round(nrow(xgyg) / all, 3)
pxgyg <- round(pgxy / pgy, 3)
pxgyg
[1] 0.501
# test a independant?
gx <- df[df$x > xQ3, ]
pgx <- round(nrow(gx) / all, 3)
(pgx * pgy) / pgy
[1] 0.249
# b) P(X > x, Y > y)
pxy <- round(pgx * pgy, 3)
pxy
[1] 0.123
# c) P(X < x | Y > y)
xlyg <- gy[gy$x <= xQ3, ]
plxgy <- round(nrow(xlyg) / all, 3)
pxlyg <- round(plxgy / pgy, 3)
pxlyg
[1] 0.499
# d) find x | y for all table values
# have a * c, need vals for Y <= y
ly <- df[df$y <= yQ2,]
ply <- round(nrow(ly) / all, 3)
xlyl <- ly[ly$x <= xQ3,]
plxgly <- round(nrow(xlyl) / all, 3)
pxlyl <- round(plxgly / ply, 3)
pxlyl
[1] 0.998
xgyl <- ly[ly$x > xQ3,]
pgxgly <- round(nrow(xgyl) / all, 3)
pxgyl <- round(pgxgly / ply, 3)
pxgyl
[1] 0.002
cname <- c("<=2d quartile", ">2d quartile", "Total")
rname <- c("<=3d quartile", ">3d quartile")
c1 <- c(nrow(xlyl), nrow(xgyl))
c2 <- c(nrow(xlyg), nrow(xgyg))
result <- data.frame(c1, c2, row.names = rname)
result[ ,3] <- rowSums(result)
result["Total", ] <- colSums(result)
colnames(result) <- cname
result
<=2d quartile >2d quartile Total
<=3d quartile 112826 55353 168179
>3d quartile 251 55444 55695
Total 113077 110797 223874
Does splitting the data in this fashion make them independent? Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations for the 2d quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
A <- df[df$x > xQ3, ]
PA <- round(length(A$x)/all, 3)
B <- df[df$y > yQ2, ]
PB <- round(length(B$y)/all, 3)
AgB <- B[B$x > xQ3, ]
PAgB <- round(length(AgB$x)/all, 3)
PAgB
[1] 0.248
round(PA * PB, 3)
[1] 0.123
#now let's setup our result df for a chisq test
adjresults <- result[1:2,]
adjresults <- adjresults[, 1:2]
chisq.test(adjresults)
Pearson's Chi-squared test with Yates' continuity correction
data: adjresults
X-squared = 74318, df = 1, p-value < 2.2e-16
# double check with the full df
chisq.test(df)
Pearson's Chi-squared test
data: df
X-squared = 967288, df = 223873, p-value < 2.2e-16
Since the p value is so low, we do not reject the null hypothesis. All signs suggest that when these two variables are split along different quartiles and compared, they are independant. The double check using the full df suggests the same result, regardless of how the data is split.
#Provide univariate descriptive statistics and appropriate plots.
summary(df)
x y
Min. : 34.0 Min. : 79.0
1st Qu.: 77.0 1st Qu.: 393.0
Median :128.0 Median : 809.0
Mean :129.3 Mean : 789.7
3rd Qu.:165.0 3rd Qu.:1043.0
Max. :575.0 Max. :3904.0
plot(density(df$x))
plot(density(df$y))
#Provide a scatterplot of the two variables.
plot(df$x, df$y)
#Provide a 95% CI for the difference in the mean of the variables.
t.test(df$y, df$x)
Welch Two Sample t-test
data: df$y and df$x
t = 682.5907, df = 231508.6, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
658.5146 662.3071
sample estimates:
mean of x mean of y
789.7346 129.3237
#Derive a correlation matrix for two of the quantitative variables you selected.
CorMatrix <- cor(df)
CorMatrix
x y
x 1.0000000 0.9743551
y 0.9743551 1.0000000
#Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
cor.test(df$x, df$y, conf.level=0.99)
Pearson's product-moment correlation
data: df$x and df$y
t = 2048.821, df = 223872, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
99 percent confidence interval:
0.9740780 0.9746293
sample estimates:
cor
0.9743551
Our correlation tests show that these two variables are highly correlated. My understanding of modeling is still in its infancy, but I believe if we were trying to model using these two variables, it would be difficult to use both because of the overly high degree of correlation. We may need to make some decisions to not include one or the other, or create a combined statistic.
#Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
InvMatrix <- solve(CorMatrix)
InvMatrix
x y
x 19.75030 -19.24381
y -19.24381 19.75030
#Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
CorMatrix %*% InvMatrix
x y
x 1 0
y 0 1
InvMatrix %*% CorMatrix
x y
x 1 0
y 0 1
#As an advanced option, conduct principle components analysis and interpret. Discuss.
PCA <- prcomp(df, center=TRUE, scale.=TRUE)
PCA
Standard deviations:
[1] 1.4051175 0.1601403
Rotation:
PC1 PC2
x 0.7071068 -0.7071068
y 0.7071068 0.7071068
plot(PCA$x[,1], PCA$x[,2])
Since there are only two variables in our dataset, we can see the relationship between the two fairly well through many of the other previously utilized techniques. While it’s interesting to see the results of a PCA showing cumulative and proportional variance, reducing dimensionality to better understand relationships may be uncessary here. The resulting scatter plot looks strikingly similar to our simple x,y plot.
#For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function.
distr <- fitdistr(df$x, densfun="exponential")
#Find the optimal value of l for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, l)).
l <- distr$estimate
xsamp <- rexp(1000,l)
#Plot a histogram and compare it with a histogram of your original variable.
par(mfrow=c(1,2))
hist(xsamp)
hist(df$x)
#Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality.
qexp(0.95, l); qexp(0.05, l)
[1] 387.4193
[1] 6.633441
qnorm(0.95, mean(df$x), sd(df$x)); qnorm(0.5, mean(df$x), sd(df$x))
[1] 226.8403
[1] 129.3237
#Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
quantile(df$x, .95); quantile(df$x, .05)
95%
231
5%
51
These exercises show what happens when we normalize our x variable skew and how it changes the confidence intervals vs the original dataset. We can see similarities in each, but only the right tail remains reasonably consistent across comparisons, accounting for skew correction. Since, in this analysis, our ‘X’ is considering total elapsed time, and so few flights were up in the air for < 60 min, our .05 estimates look fairly different.