Does religious preference vary by
region of the country? Regional differences in religious preference are
found in many parts of the world such as India, the former Yugoslavia,
and Ireland. Is religious preference independent of geographical region
in the United States?
In this example, the variables relig
and region4 (from the gss 93 subset file) are used as table
variables to form a table with five rows and four columns. The five religions
are Protestant, Catholic, Jewish, None, and Other; the regions
are Northeast, Midwest, South, and West. This table structure
is called a general R x C table with no ordering across categories of its
variables. Numbers code the categories of both table variables, but SPSS
uses their respective labels in the output. The Pearson chi-square
statistic is requested for testing the independence of table rows and columns-that
is, testing the premise that religious preference and region are independent
of each other. Sometimes this task is expressed as testing equality of
proportions across rows (or columns).
To produce this output, from the
menus choose:
Analyze
Descriptive Statistics
Crosstabs...
Click Reset to restore the dialog box defaults, and then select:
·Row(s): relig
·Column(s): region4
Statistics ...
Put a check beside "Chi-square"
Case Processing Summary. This panel describes the number of cases used in each table you request. The total number of cases in the gss 93 subset file is 1500, and for 756 of these cases, the values of both relig and region4 are Valid. One or both values are missing for the remaining 744 cases. Thus, only half of the sample is used (50.4%). With so many values missing, you should be concerned that the results might be biased. For example, people from certain groups may feel uncomfortable about stating their religion, so they omit the question. Using the Frequencies procedure to check the number of missing values, you find that relig has fairly complete data but that region4 has many missing values.
Religious Preference by Region
Crosstabulation (See
it). In this sample of 756 people, 480 are Protestant, 15 are Jewish,
15 are Other, and so on. These counts are totals of the cell frequencies
in their respective rows. The counts along the bottom are totals for each
column. The row and column totals are known as marginals because
they summarize the counts within each table variable independently of the
other variable. The cell counts in the body of the table result from crosstabulating
the two table variables. For example, in the upper left comer there are
54 Protestants who live in the Northeast, 140 who live in the Midwest,
and so on. These counts are the observed number for each cell.
Chi-Square Tests (See
it). The null hypothesis
for the Pearson chi-square test is that the row and column variables are
independent of each other. By definition, two table variables are independent
if the probability that a case falls in a specific cell is the product
of its marginal probabilities. Using the probability that a subject is
Protestant (480/756) and the probability that a subject lives in the Northeast
(136/756), the probability for a case failing in the upper left cell is
(480 x 136)/(756^2) = 0.114
This probability is used to estimate
the number of cases expected (under the hypothesis of independence) in
each cell. The expected count is then compared with the observed count.
To compute the expected number of cases, multiply the probability
by the total sample size. This result is the row total multiplied by the
column total divided by the total sample size, or 86.3 cases expected for
this cell.
The difference between the observed
count of 54 and the expected count of 86.3 is large. Does this gap support
the variables' independence? For an overall test of independence, the Pearson
chi-square statistic repeats this process of comparing the observed number
of cases with the number expected for each cell. After subtracting the
expected count from the actual observed count for each cell, SPSS constructs
the statistic by squaring the difference and dividing the result by the
expected count. Thus, for the Pearson chi-square statistic, these quantities
are summed across all cells:
See
the Equation.
When the resulting chi-square statistic
is large, the null hypothesis of independence is rejected. To define large,
the sample statistic is compared to a critical point on the theoretical
chi-square distribution that depends on the number of rows and columns
in the table. This latter information is labeled df for degrees
of freedom. For an R x C table, the degrees of freedom are the number
of rows minus 1.0 times the number of columns minus 1.0, or (r - 1)(c -
1).
For this table, df = (5 - 1)(4 -
1), or 12.
The computed chi-square statistic
for this table is 109.1 and has an associated probability (p-value) or
significance level of less than 0.0005 (the probability is not 0).
Conventionally, if this probability is small enough (less than 0.05 or
0.01), the hypothesis of independence is rejected. Using these numbers
alone, you could report that there is an association between religious
preference and region.
However, if certain assumptions
are not met, this probability can be distorted or misleading. Many researchers
use the guideline that no cell has an expected value less than 1.0 and
not more than 20% of the cells have expected values less than 5 (in 2 x
2 tables, some say that no cell has an expected value less than 5).
SPSS reports the minimum expected
count, the number of cells with expected count < 5, and the
% of cells with expected count < 5. In this table, the minimum
expected count is 2.7, and eight cells (40%) have expected counts <
5. Clearly, the guideline is violated.
What should you do? Can you see
a way to make the table less sparse? A total of 15 people are in the Other
category. Because it is probably a mixture of religions, you can justify
deleting it. The Jewish category also has very few subjects. If
you delete it, however, you should be careful to indicate that any conclusions
are restricted to the Protestant, Catholic, and None groups.