The need occasionally arises to produce a set of random observations on a pair of variables which have a precise degree of correlation between them. Most commonly the need for such data arises in the context of simulations to determine the effect on some outcome (e.g., value of a stock portfolio) of varying degrees of correlations among input or predictor variables. Other uses include creating graphic illustrations of specified degrees of bivariate correlation, examining the effect of predictor collinearity on regression weights and explained variance, and simulating the effects of correlated predictors on the outcomes of employment selection decisions.
Several efforts have been made to offer a quick and easy tool to produce the needed sets of observations, but the tools that been offered to date all contain a common serious flaw: They fail to account for the random level of correlation that inevitably occurs between the initial observations randomly generated for a pair of variables. Consequently, any procedures subsequently applied to impute the desired correlation will produce a correlation that deviates from the desired level by the amount of the originally occurring chance correlation. One fairly recent article even went so far as to assert that the application of the Box-Muller normality transformation to the variables eliminates any chance correlation between them, which is absurd.
The Correlated Random Variable Generator Excel macro offered on the "Stats Tools" tab of this website (see also the link below) is the only standalone tool currently available to produce normally distributed random bivariate data sets of any number of observations which have the precise (to 3 decimal places) degree of correlation between them that the user specifies. In addition, the user may specify the means and standard deviations of the two variables, and may limit the values of each variable to fall within specified upper and lower boundaries.
I should also mention that many researchers encounter the need for random data to be generated for 3 or more variables with a specified multivariate correlational structure, either with or without sampling error. I have developed a standalone compiled program for producing such data sets which is called, Monte Carlo/PC. This program sells for $150, and may be purchased from me. Please send requests for information about purchasing this program to inquiry@prostatservices.com or by calling Professional Statistical Services at the number in the contact tab of www.prostatservices.com.
The Correlated Random Variable Generator can be downloaded by clicking this link: Correlated Random Variable Generator (Excel macro)
The Breslow-Day and Tarone tests are the standard methods for testing the homogeneity of odds ratios. The Tarone test is actually a modification of the Breslow-Day test which produces slightly more accurate p-values, although the difference usually does not become evident until the 3rd decimal place. These tests assess whether the odds ratios for the 2-way contingency tables of two or more groups differ significantly from the common odds ratio. Although a well-known use of these tests is to evaluate whether the Mantel-Haenszel test's assumption of such homogeneity is satisfied, the applicability of these tests extends to the much broader issue of whether the relationship between two binary variables is homogeneous across groups.
Up to now these tests have been inaccessible to many who sought to utilize them due to their being available only in several very expensive or very technically challenging statistical software systems (i.e., SAS, SPSS, Stata, and R). I have developed an Excel macro worksheet for calculating these tests, and made it available for download under the Stats Tools tab of the Prostatservices.com website, free of charge to anyone. It is currently set up with a limit of 20 strata. If any users require the ability to test the odds ratios of more strata than this, they may contact me (Jeffrey Kane) through this website to arrange the necessary modifications.
Suppose you have computed 2 or 3 linear regression equations, and you wish to view the plots of the regression lines. This is often useful to make a quick determination of whether the lines intersect within the range of Y values of interest. This can be done, with some work, within some of the more expensive or technically challenging statistical systems (e.g., SPSS, Stata, R), but for people without the funds or skills to access these capabilities through these high level systems, there has been no readily accessible, easy to use tool for producing such graphs. This macro responds to this need. Just fill in the information requested and your graph is generated. It also computes the line intersections and reports whether the intersections occur within the range of Y encompassed by the graph.
Note that this tool will also work with multiple linear regression models as well as with simple linear regression models. To use with multiple regression models, you will have to compute the predicted values for your models, and then compute the simple regression of the observed Y values on the predicted Y values for each model. Use the slopes, intercepts, and minimums and maximums for these models to generate the graphs (the predicted Y variable will be the X variable in these analyses).
Analysis of covariance (ANCOVA) is intended to provide a way to exclude the influence of an extraneous variable on the comparison of means on a response measure between groups defined by the levels of an independent variable (e.g., treatment vs. control, different educational levels, different occupations, etc.). This method was first presented in print by Day and Fisher in 1938 (Day & Fisher, 1938). It has survived to the present, along with its regression-based incremental F-test version, as the only method available for excluding the influence of extraneous covariates from group means comparisons. It is widely used, especially in the analysis of data from classic experimental designs consisting of pre- and post-treatment assessments of treatment and control groups. In this situation, the pre-test is usually correlated with the post-test, and it is entirely appropriate to seek to exclude the post-test variance that is due to the pre-test.
Unfortunately, ANCOVA is only able to achieve its intended purpose in the very rare case where the slopes of the regressions of the dependent variable on the covariate are identical in all of the groups being compared. This restriction on its applicability is conventionally stretched to allow slope differences up to the point where they become statistically significant. In fact, even such nonsignificant slope differences will result in the retention of extraneous covariate variance within each group that will reduce the power of the test of group differences by enlarging the error term. In a large proportion of studies (25 - 50% in my experience), especially those involving 50 or fewer subjects per group, the assumption of homogeneity of within-group regression slopes is violated to significant degree, rendering ANCOVA formally inapplicable. Until now, the conclusion that data fails to satisfy this "homogeneity of regression slopes" assumption has left analysts with no alternative methodology for excluding covariate variance from the comparison of group means.
I have recently submitted an article for review in a major journal that proposes a new method which can be used in place of ANCOVA to exclude variance due to one or more extraneous covariates from the dependent variable on which group means are to be compared. The proposed method is called Analysis of Covariate Residuals, or ANCOVRES. The applicability of this method is unaffected by any degree of differences between the slopes of within-group regressions of the dependent variable on the covariate (i.e., lack of homogeneity of regression). It achieves complete exclusion of covariate influence on the dependent variable within each group being compared, thereby maximizing the power of the comparison for the given sample. The adjusted data is quite easily computed, especially in the case of one covariate, and this adjusted data is subsequently analyzed through ordinary ANOVA or t-test.
I recently had a client who had conducted a study of employee turnover using two different measures of intention to quit. We regressed these measures separately on measures of extrinsic and intrinsic satisfaction. We then sought to address the question of whether there was a difference in the degree to which the two measures of intention to quit could be predicted from the two satisfaction measures. This problem boils down to testing for the difference between the two R2s or Rs resulting from the separate regressions.
This is clearly not a case of testing for the difference between two correlations from independent samples. Both of the regressions were computed on the same sample. However, the two correlations were non-overlapping, which means that no variables were common to both correlations. This is apparent in the case of the separate intention to quit measures. However, it is less apparent but equally true that the composite score based on the regression weights derived for the prediction of each intention to quit measure constitutes a completely separate variable from the composite score based on the regression weights derived for the prediction of the other intention to quit measure. Thus, we have a case of correlations between dependent (because they are computed for the same sample) but non-overlapping variables.
The proper procedure for testing the difference between correlations fitting the above description was not fully resolved until the publication of a paper in Psychological Methods by Raghunathan, Rosenthal, and Rubin (1996). That paper describes a modification of the Pearson-Filon (PF) method that substitutes Fisher z transformations for the correlations appearing in the original PF formula. The resulting formula for this revised test is as follows:
where: zr12 = Fisher z transformation of r12 zr34 = Fisher z transformation of r34 k = (r13 – r23r12)(r24 – r23r34) + (r14 – r13r34)(r23 – r13r12) + (r13 – r14r34)(r24 – r14r12) + (r14 – r12r24)(r23 – r24r34)
The value of ZPF is referenced to the Z distribution to obtain its p-value.
The ZPF formula is clearly not one that the typical researcher, or even the typical statistician, wants to hammer out on a calculator each time the need for it arises. Weaver and Wuensch (2013) published an excellent article that explains the intricacies of comparing correlations and regression coefficients, and provides SPSS and SAS syntax programs for conducting the necessary calculations (the full manuscript of this article is available here: http://core.ecu.edu/psyc/wuenschk/W&W/Weaver&Wuensch_2013.pdf;
However, not everyone has access to either SPSS or SAS, and it's often much easier to perform calculations on summary data (e.g., already computed correlations and regression coefficients) in Excel than getting into the syntax runs of SPSS or SAS. As far as I can determine, there has not been a set of Excel calculator worksheets that correctly computes all the tests of the differences between correlations and regression coefficients within each of the categories of independent and dependent, overlapping and non-overlapping, and their combinations. In response to the absence of such an Excel-based set of calculators, I offer the Correlation and Slope Comparator. This is available for download at the Stat Tools tab of this website, and is described below.
The Correlation and Slope Comparator
This set of tools is provided in the form of a collection of worksheet calculators within an Excel workbook. I cannot take full credit for this tool. In the course of searching for an Excel-based calculator for the test of the difference between dependent non-overlapping correlations, I ran across a worksheet that contained a version of the test I was seeking along with tests of two other types of correlation pairs and the correction for attenuation. I updated the test of the difference between dependent non-overlapping correlations to reflect the latest version of the formula from the Raghunathan, et al. (1996) article. I also added the tests for the difference of a correlation from a hypothesized value and for the difference between slopes. Finally, I cleaned up the formatting and the appearance of the worksheets prior to data entry.
I have been unable to relocate the source from which I downloaded the spreadsheet that became the basis for this set of tools, or to identify the originator of the spreadsheet I adapted. However, if he or she stumbles upon this site and recognizes his or her work as the basis for this tool, I wish to acknowledge this person's good work and contributions to this further evolution of the original set of calculators.
The Correlation and Slope Comparator contains the following calculation tabs:
Dependent overlapping correlations: Tests for the significance of the difference between two correlations in the situation where the two correlations share a common variable (e.g., r1,2 and r1,3) and both correlations were computed on the same cases.
Dependent non-overlapping correlations: Tests for the significance of the difference between two correlations in the situation where there is no variable in common between the two correlations (e.g., r1,2 and r3,4), and both correlations were computed on the same cases.
Independent samples correlations: Tests for the significance of the difference between two correlations in the situation where each correlation was computed on a different sample of cases. [Note: The example invariably used in this case is the correlation between the same two variables in different samples (i.e., complete overlap). There potentially are hidden and as yet unexplored complications for comparisons involving 50% and zero overlap between the variables correlated in separate samples.]
Difference from hypothesized correlation: Tests for the significance of the difference between an observed correlation and the hypothesized value of the correlation. The hypothesized value may be zero or any other value between -1.0 and +1.0.
Difference between slopes: Tests for the significance of the difference between two (raw score) slopes (i.e., also commonly referred to as b weights or raw score regression coefficients) from a regression equation. The slopes may reference the same x and y variables in the same or different samples, or different x variables for regression equations computed for different samples. It's difficult to imagine a need to compare slopes for regressions of different y variables in either the same or different samples, but there are no indications in the literature that the computation of the pooled standard error of the difference in such slopes would be any different than in the more conventional situations.
Disattenuation of correlation: This computes three different ways of correcting a correlation for unreliability in the variables being correlated: correcting only for unreliability in the x variable, correcting only for unreliability in the y variable, and correcting for unreliability in both of the variables.
The Correlation and Slope Comparator can be downloaded by clicking this link:
Today I was scanning the results of a Google search on something statistical and I happened to click on the site of one of my competitors, another statistical consultant. I landed on one of his website pages that was promoting his expertise with statistical software, and was annoyed by reading his characterization of SPSS as just a “simple statistical package”, compared to “world class” packages like “SAS or R” that offer “high powered statistical analysis“. His implication was that SPSS is little more than a toy. Stata and Minitab are apparently unmentionable in this person’s view.
I encounter this view from time to time, usually indirectly through students whose professors are compelling them to use R or SAS. This propensity to derogate statistical systems that have been made more accessible to a wide range of potential users, and which are designed to expedite the conduct of lengthy analyses, seems to stem from at least three different motives. One source of such behavior is a weak ego. It is unfortunately not rare to find people who can feel good about themselves only by putting other people down. Since far fewer people make the effort to learn R, or have the resources to acquire access to SAS, a noticeable proportion of the people attracted to these software systems are motivated by the desire to gain some bragging rights over others.
Another source of this “put down” behavior is crass commercialism. If you can convince the public that your relatively rare skill can solve their problems better than a skill that is held more widely or that even could be acquired by the end-user him-/herself, you can charge more for your skill.
Least reprehensible, but no less damaging, is the belief that these harder to acquire software packages (where the barrier to acquisition is either the learning curve or the cost of the package) actually do everything better than the more widely accessible packages.
We can dismiss those driven by the ego-inflation motive as just your garden variety jerk who will always be with us. In my 23 years as an academic I encountered more of them ensconced in university faculties than I care to even think about. More importantly, my purpose here is not to try to change the views and behavior of any of these people who seek to put down more accessible statistical packages. Instead, my purpose is to give the rest of you an accurate understanding of the relative merits of the most widely used statistical packages.
The fact of the matter is that no statistical packages are “world class” in regard to all of the criteria by which such packages can be judged, and practically all of the packages are “world class” in some respects. Let’s consider what these criteria are in relation to widely-used, all-purpose statistical software packages. Here is my list (feel free to write-in to add more):
Ease of use
Learning curve
Depth of menued procedures
Range, quality, and ease of use of statistical procedures offered
Modifiability of analytical output specifications
Ease of transforming table output to formatting conventions (e.g., APA)
Range of graphical output offered
Speed of handling large data sets
Ease and flexibility of data importation
Ease of results exportation
Thoroughness and interpretability of results output
Ease and flexibility of data set manipulation
Pricing for individuals
Thoroughness and informativeness of documentation
I have written a review of the top 5 statistical software systems (i.e., SPSS, SAS, R, Minitab, and Stata) that evaluates these systems against each of the above criteria. It is available as an article in the Articles section of this website (A Review of the Top Five Statistical Software Systems).
I'm sure many of you have searched (e.g., on Google) for information on a research topic or method, and succeeded in finding the articles or book chapters that are the prime sources of such information. However, when you click on the search-entry listing you find that in order to obtain the needed article you would have to pay some outrageous price for a single article or an even more outrageous price to subscribe to the journal's archives? This is not a problem for faculty and students enrolled at most larger colleges and universities, or for those who work at big companies and research institutes, all of which have online journal access. But what about the rest of us -- all those students who graduate and no longer have access to their university's online services, people in small businesses who can’t afford the fees for articles or online journal access (which includes hundreds of thousands, if not millions, of independent consultants), and millions of private citizens who just may want to pursue an interest in a subject? The fees for these services have become so exorbitant that many smaller colleges, technical institutes, for-profit universities, and non-profit foundations have had to cancel or drastically limit their journal access subscriptions.
Has knowledge become something that is disseminated on the basis of wealth? It certainly seems so from my vantage point. This basis for access to the repositories of knowledge seems to undermine the democratic principles of equal opportunity and the free flow of ideas and information. Moreover, much of the research reported in these journal articles was conducted with government funding. How is it that such research ends up being a commodity that is apportioned on the basis of having the ability to pay for it? This is a good example of where the interests of free enterprise and capitalism run counter to the public interest.
I would be very interested in hearing your views on what can be done to expand public access to scientific and professional journals, and to documents in repositories.There are a few examples of freely-accessible repositories of journal articles, such as ERIC and JSTOR. However, the coverage of these repositories is spotty and limited, undoubtedly due to the threat they pose to the profits of the pay-to-read services. Could public funding be provided to expand the coverage of these free services? Another possibility might be to enable the Library of Congress, which has access to all publications, to offer free online access to its entire collection. This could probably be done for the cost of, say, one aircraft carrier, and might do more to advance the cause of peace than an armada of such ships.
As of the date of this post, I am launching this new blog feature to increase my accessibility to people out there who are encountering needs for help in statistics, research design, quantitative methods in business, psychometrics, survey design and analysis — in short, just about the whole gamut of quantitative methods. I see this blog as a place where I can update you on recent developments in statistics, on new ways to use statistical software to get at vexing analysis problems, and on new software tools for conducting various types of data analysis. I will also use as a channel through which to express my views on larger issues bearing on the conduct of research and the accessibility of research findings.
I am structuring this blog so that you can leave a comment or question in response to any of the blog entries I make. If you would like me to address a question you have, or to start up a blog thread on any particular topic that I haven't addressed, please send me an email with details to inquiry@prostatservices.com. I will attempt to prepare a blog entry that responds to your request, or at least post your interest or need for information so that others can reply.
Don't worry that your request for information might seem too elementary or simple -- we all started learning these subjects at ground zero, and those of us who have some knowledge are standing on the shoulders of others who took the time to help us understand what probably seemed to them to be simple things. Too many of the teachers and professionals in this field seem to want to make its concepts and methods sound difficult, apparently in an effort to elevate their own stature. I've also run into a considerable number of experts who view their own understanding of a statistical concept or method as a competitive edge, not to be shared with others. Quite candidly, I hold people in both of these categories in utter contempt. Knowledge is to be shared, and people seeking help with statistics are much better off being left with understanding rather than in awe.
Lawrence Myers and his co-authors, Glenn Gamst and A. J. Guarino (2006), have made a valuable contribution to the field of applied statistics by clarifying an issue that has not been adequately addressed by others. I refer to the treatment of interaction effects in factorial ANOVAs involving two or more factors. Myers et al. make two important points. First, they state that when an interaction effect is found to be significant, understanding the nature of the interaction should become the dominant focus of the analysis. The importance of the main effects of the factors involved in the interaction becomes very much subordinated. This is because, in the presence of a significant interaction, any effort to interpret the main effects of the factors involved will be based on the false premise that differences on one factor exist across all levels of the other factor(s). Myers et al. state it this way:
“If a significant interaction is obtained, it means that a different relationship is seen for different levels of an independent variable….One implication of obtaining a significant interaction is that a statement of each main effect will not fully capture the results of the study. …The general rule is that when an interaction effect is present, the information it supplies is more enriched—more complete—than the information contained in the outcome of the main effects of those variables composing it. Sometimes … a main effect is moderately representative of the results (although it is still not completely adequate to fully explicate the data). Other times … the main effects paint a nonrepresentative picture of the study's outcome.”
A second important point that Myers et al. make is that post hoc analyses of the “simple effects” encompassed by an interaction should proceed by pairwise comparison of the levels of each factor within the levels of the other factor(s) in the interaction. For example, if A is a 2-level factor and B is 3-level factor, pairwise comparisons should be made between the 3 pairs of B levels within each of the two A levels. The Type I error levels of these comparisons should be corrected for family-wise error (e.g., using Bonferroni, LSD, or other procedure). Myers et al. also recommend that the comparisons be done both ways (i.e., between B levels within each A level, and between A levels within each B level), although they note that others (e.g., Keppel, 1991) suggest that only one of these be chosen on a priori conceptual grounds. In either case, this is quite a different approach than other widely read authors recommend. For example, Howell (2009, pp. 424-426) recommends analyzing whether the overall differences are significant between each level of a factor within each level of the other factor(s) involved in an interaction. This is adequate when there are only two levels of the factor being compared within each of the other factor’s levels. However, when there are 3 or more levels being compared (e.g., 3 levels of B within each level of A), Howell’s overall difference approach does not tell us which specific pairs of factors differ significantly within each level of the other factor. In order to fully understand the nature of the interaction, we must use the pairwise comparison approach.
This leads to a final point that this message should cover: How does one obtain pairwise comparisons of interaction category means? I will address this question in relation to the use of SPSS. The menu options in SPSS do not allow for post hoc analyses of the pairwise combinations of interacting factors in factorial ANOVA analyses conducted using the GLM methods. In order to obtain the desired output we need to add a statement to the syntax of the GLM univariate ANOVA command. If we paste the syntax from a simple 2-way between-subjects ANOVA with post hoc tests specified (which the SPSS menu system limits to only the main effects) and the minimum of other options selected, we get the following (assume A = 2 levels, B = 3 levels):
UNIANOVA C BY A B /METHOD=SSTYPE(3) /INTERCEPT=INCLUDE /POSTHOC=B(TUKEY) /CRITERIA=ALPHA(.05) /DESIGN=A B A*B. In order to obtain the results of the post-hoc pairwise comparisons of the interaction categories (i.e., 3 comparisons within each of the 2 levels of A), we need to add the following line to the above syntax: /EMMEANS=TABLES(A*B) compare (B) adj (BONFERRONI) This would result in the following syntax statement: UNIANOVA C BY A B /METHOD=SSTYPE(3) /INTERCEPT=INCLUDE /POSTHOC=B(TUKEY) /EMMEANS=TABLES(A*B) compare (B) adj (BONFERRONI) /CRITERIA=ALPHA(.05) /DESIGN=A B A*B. The resulting output for the post hoc analyses of the interaction categories would look like this (with simulated data):
A
(I) B
(J) B
Mean Difference (I-J)
Std. Error
Sig.
A1
B1
B2
.030
.039
1.000
B3
.170*
.039
<.001
B2
B3
.140*
.039
.001
A2
B1
B2
-.470*
.039
<.001
B3
.065
.039
.291
B2
B3
.535*
.039
<.001
In the above example of output, I have modified the standard SPSS output to eliminate redundant categories. I have not explored the command syntax requirements necessary to get comparable output from SAS, STATA, or Minitab, but I strongly suspect each has provisions for producing these comparisons. References Howell, D. C. (2009). Statistical methods for psychology, seventh edition. Belmont, CA: Cengage Wadsworth. Meyers, L.S., Gamst, G., & Guarino, A. (2006). Applied multivariate research: Design and interpretation. Thousand Oaks, CA: Sage Publishers.