The movement from theory to the successful operationalization of a concept is not an easy step. One can produce a "brilliant" theory of a phenomena by conducting a thorough literature review resulting in a system of logically interrelated propositions, assumptions, and conceptualizations; yet, it is still possible to fail in providing good empirical indicators for the variable of interest. All too frequently research methodologists overlook the importance of the process of measurement. In our zeal to successfully provide abstract explanations, formulate a research design, collect data, and analyze the results in search of support for our hypotheses, we very often fail to adequately measure the concepts at hand. Yet, no matter how sophisticated our theorizing, or how high-powered our statistical analyses, the results can be erroneous if the variables are improperly operationalized. With tongue in cheek, social scientists refer to this problem as "GIGO" (Geego)--Garbage In, Garbage Out.
This chapter will examine issues related to operationalization and measurement
of variables. We will attempt to provide the beginning researcher
with an understanding of measurement techniques and familiarize him or
her with tests of validity, reliability, unidimensionality, and reproducibility
LEVELS OF MEASUREMENT
Quantitative data collection typically involves acquiring information (demographic, attitudinal, evaluative, factual, etc.) from individual respondents. The interview schedule or questionnaire is the most often employed means of obtaining such information or data. As one can imagine, there are many ways of asking a question, and each method will yield different levels of measurement. For example, if one is interested in knowing the 1987 income of a household through the use of a mailed questionnaire, such data could be obtained with the following questions:
What was your 1987 gross annual income, including wages, tips,
interest and other sources? ___________ or
What was your 1987 gross annual income, including wages, tips, interest and other sources?
____ 1) less than $10,000 ____ 2) $10,000-$14,999 ____ 3) $15,000-$19,999 ____ 4) $20,000-$24,999 ____ 5) $25,000-$29,999 ____ 6) $30,000-$34,999 ____ 7) $35,000-$39,999 ____ 8) more than $40,000 or
Was your 1987 gross annual income -- including wages, tips, interest,
and other sources -- more than $20,000? ___ 1) yes ___ 2) no
These three questions reveal how one might solicit information concerning the variable of interest--1987 annual income. Each may be equally effective in obtaining the desired information, yet the response categories provided will result in very different levels of measurement.
In our last question, which had the response categories of "yes" and "no", the assessment of income is on the nominal level. Measures at the nominal level assess categories which are mutually exclusive and for which there is no inherent hierarchy in the response options. Gender is another concept which is measured at the nominal level. When we operationalize the concept of gender, we usually assume that "male" or "female" are the appropriate response categories (although some might argue that being either "male" or "female" is not a true reflection of gender, and that perhaps a continuum of gender might be a more accurate portrayal of one's gender identity).
Our second question, which provided categories of income arranged in a hierarchy, involves measurement at the ordinal level. The measurement contained within this question is ordinal in nature because the response sets do imply a continuum or hierarchy of income. Questions concerning degrees of attitude or value are typically ordinal level measures -- with response categories implying levels of agreement. For example, if you were asked to state the level to which you agreed with a statement concerning the legalization of marijuana use, you might respond along a continuum of 1=strongly disagree, 2=disagree, 3=agree, 4=strongly agree.
Finally, our first measure of annual income provides us with an example of an interval level response set. Here the respondent can give any numerical value. Hence, Mr. Smith might indicate that his 1987 gross annual income was $20,545 while Ms. Jones might reveal that she earned $41,090 during that year. The point is, this open-ended response set provides the researcher with an interval level measure of income because the distances between units of measurement (in this case numbers of dollars) have real numerical meaning. It can be said that Ms. Jones earned twice as much as Mr. Smith that year. One value can be directly compared to another value and arithmetic understandings can be derived. Other examples of variables often measured at the interval level include grade point average, age, years of formal education, and IQ.
The difference between nominal, ordinal, and interval level variables is not difficult to understand. What is more difficult to comprehend is why the levels of measurement are so important in social science research. Statistical measures developed for data at the interval level allow the researcher to make the most refined and accurate assessment of the existence, nature, direction, and strength of the association between two or more variables. However much of the social world consists of variables which are not measured at the interval level. Therefore, the social scientist must be prepared to utilize the less refined and discriminating data analysis techniques which are appropriate for ordinal and nominal level measures.
The researcher's key objective in providing response categories is to measure variables at their most "productive" level so that the more powerful statistical techniques will be appropriate. However, it is also important that the response categories are understandable and meaningful to the respondent, lest erroneous information be obtained.
To simply conceptualize a variable and then create empirical indicators, will not necessarily guarantee that the variable of interest has been successfully operationalized. In most research investigations, sociologists view the creation of a collection of items as a more precise measurement of a concept. Simply, several measures or items are combined to arrive at a single scale score.
In this section we will discuss issues of validity, reliability, unidimensionality,
and reproducibility as they relate to the task of scale construction.
These four properties serve as the primary standards for evaluating the
operationalization and measurement of empirical variables.
Reliability refers to the consistency of a measure. That is, whether it measures the same thing, in the same way, time after time. Reliability may be thought of as obtaining the same results after giving the survey (or empirical indicators) to the same group of people more than once. Because social scientists do not typically conduct longitudinal studies, the more common social science usage of the term "reliability" refers to the consistency of measurement. Specifically, do the various items of the scale, which are thought to measure the same thing, actually do so?
Reliability can be statistically assessed. Coefficients of reproducibility
and reliability can be obtained (see later discussion in this chapter for
specific discussion of measurement techniques). Therefore, the methodological
question of reliability can be empirically addressed -- one can ultimately
determine the extent to which the variables are reliable or consistent
measures. However, the issue of validity is more complex and largely theoretical
Validity refers to whether or not the questions or empirical indicators are actually measuring what they are claim to measure. Statistically speaking, validity cannot be determined. The social scientist may only look at the items and assess whether or not the items seem to be measuring what they intend to measure (face validity). One might also assess just what it is he or she hopes to measure and if the items seem to adequately represent all aspects or dimensions of the concept being measured (content validity).
While any empirical test for validity is inherently problematic, the sociologist's concern for valid measurement is fundamentally more important than the issue of reliability. It is quite feasible that any given test or scale might consistently measure the variable of interest but in actuality fail to truly measure that variable.
According to Selltiz, Wrightsman, and Cook (1981:197)
The concern with reliability arose from the difficulty of obtaining evidence about validity. However, evidence of high reliability can never substitute for evidence of validity. Reliability can only show that something is being measured dependably but not necessarily the intended concept. A valid measure with low reliability is more useful than a reliable measure of something one does not care to measure.
For example, a researcher could create the following two item scale claiming to measure life satisfaction, and then find that these items produce consistent results (the scores on each question correlate well with each other).
Compared to others my age, I am generally happier with my life.
1 2 3 4
strongly agree agree disagree strongly disagree
On a scale of 1 to 4 with 1 = very satisfied, I would say I am _____ with my life at the present time.
very satisfied not very not at all satisfied
Simply because respondents tend to score in a consistent manner for each question, does not necessarily imply that the scale is valid. It is very possible that these two questions do not measure "life satisfaction" at all. Rather, in reality the researcher may be assessing "relative life satisfaction" -- life satisfaction compared to others and compared to one's own experiences at earlier times in life.
Another situation where we find empirical indicators with questionable scale validity, is when a series of specific questions are used to operationalize somewhat unrelated dimensions of a concept; while collectively claiming to measure the global concept. Consider the following two questions which might be used as indicators for general life satisfaction:
Overall, I would rate my health as very good.
1 2 3 4
strongly agree agree disagree strongly disagree
Overall, I would rate my financial situation as very good.
1 2 3 4
strongly agree agree disagree strongly disagree
It stands to reason that good physical and financial well-being might be related to overall life satisfaction. But these specific questions are better indicators for health and financial status than they are for life satisfaction. If enough items such as these were added together to form some global measure for life satisfaction, the researcher needs to be alert to the fact that he or she may not really be measuring "life satisfaction" at all.
Think about the following fictitious commercial for Beatrice Foods as an analogy to illustrate the point we are making: The announcer says: "Do you like crispy pickles? Do you like rich and luscious fudge? Do you like creamy coffee ice cream? Then you'll love our new Jamocca Pickle Fudge Ice Cream!! Buy Kemps -- the ice cream that tastes as good as you think it will." As we have demonstrated, empirical indicators may not be valid measures of concepts, even though they may yield consistent results.
Much discussion in the social sciences centers around the issue of unidimensionality in scale construction. Unidimensionality refers to the property that the items comprising a scale measure one and only one dimension or concept at a time. Typically, rather complex concepts such as religious commitment, feminism, prejudice, death anxiety, marital satisfaction, and a host of other concepts have been measured with scales and not by single questions or empirical indicators.
Researchers are able to quantitatively evaluate scale unidimensionality by correlating the separate scores received on each of the items with the total scale score. If the magnitude of the correlation coefficients are high ( >.8), one can assume that each item of the scale is assessing the same dimension. However, it is also possible that the strong relationships between the items is caused by the fact that different dimensions of the same concepts are probably highly correlated. Therefore, it seems evident that demonstrating scale unidimensionality is problematic at best. In practice, most scales do not assess single dimensions but rather measure the multidimensional nature of most concepts.
Let's examine the development of a particular scale as a means of illustrating the benefits of scale construction for operationalizing complex concepts. Suppose you were interested in measuring the extent to which college students "loved" their dating partners. Perhaps the simplest means of obtaining this information would be to write the following question:
How much do you love your present dating partner?
___ 1) not at all ___ 2) very little ___ 3) some ___ 4) much
___ 5) very much ___ 6) I currently do not have a dating
As you can imagine, this question might easily mean different things to different people. Scott Jones' interpretation of "very much" may not be the same as the way in which Heidi Smith interprets the same response category. Consequently, we may not get a very objective measure of the extent to which a person loves his or her partner.
Also, an important consideration is the method by which the researcher scores the various response categories of the question. If a respondent checks category four -- "much", does that imply she loves her dating partner twice as much as the person who checked category two -- "very little"? Or do the values indicate only some ordinal measures of love?
Loving is not an easy concept to measure, particularly if one wishes to assign this variable a single quantitative score. Instead, a scale composed of several indicators might be created which would provide a more complex and precise assessment of the extent to which college students love their dating partners.
Just such a scale was created by Zick Rubin (1970). Rubin began by constructing about 80 questions which broadly reflected an individual's attitudes toward a particular person. Using his own judgment, and the opinion of some "expert" friends, he sorted these statements into "liking" or "loving" sets of items. He then administered these 2 sets of questions to 198 undergraduates at a midwestern university. Factor analysis was performed -- a statistical technique which sorts questions (based on correlations) into unidimensional clusters or categories. From this analysis, Rubin was able to identify 13 items in each of the two scales which were reliable and most probably valid measures of love and liking.
The love scale seeks to assess the dimension of personal attachment (i.e., "If I were lonely, my first thought would be to seek ____________ out"). It also attempts to measure the dimensions of caring (i.e., If ____________ were feeling bad, my first duty would be to cheer him or her up."), and intimacy (i.e., I feel I can confide in ____________ about virtually everything). Furthermore, subjects are also asked to respond to a number of attitudinal statements and to indicate the degree of agreement or disagreement with each statement when thinking about a selected boyfriend or girlfriend. The complexity of measurement found in Rubin's "Loving Scale" demonstrates that the concept of love is not easily measured, but through careful scale construction, one can tap the salient dimensions of the concept, and provide an objective assessment of the extent to which an individual "loves" a friend or dating partner.
It is clear to see from this example that in Rubin's scale each of the items seemed to possess face validity -- the items were selected because each appeared to be measuring some aspect of love. But some items, while initially appearing to be valid indicators, were dropped from the revised scale (remember Rubin began with 80 items and ended up with two 13 item scales which measured the concepts of loving and liking). The rejected items were excluded because they did not prove to be reliable measures. And while the remaining items in the final scales appear to be reliable, Rubin cannot claim with certainty that the scale possesses content validity. It might well be that consistent responses to different questions are a function of respondents attempting to answer questions in ways that they believe others would expect them to answer.
The example of Zick Rubin's creation of the Liking and Loving Scale is also of interest because it illustrates how time consuming and difficult it is to create a good scale. When writing items, one must always refer to the concept of interest and pay particular attention to the many possible dimensions of the concept. Then items must be pretested, revised, retested, and statistically analyzed. The most common forms of data analysis employed in scale construction are Cronbach's alpha and factor analysis (see SPSSX for details on the procedure RELIABILITY). Once this data has been analyzed, items found to be unreliable will have to be deleted from the scale.
This tedious, but essential process, is why we recommend that most beginning
researchers employ preexisting scales if possible. There are many
books which contain scales developed in the social sciences and discuss
the corresponding issues of reliability and validity, as well as the steps
involved in administration and scoring. The following four books will prove
to be very helpful in locating scales and empirical indicators used in
sociological and social psychological variable assessment: Handbook of
Research Design and Social Measurement (Miller, 1977), Sociological Measurement
(Bonjean, et al., 1967), Measures of Social Psychological Attitudes (Robinson
and Shaver, 1973), and Questionnaire Design and Attitude Measurement (Oppenheim,
Another issue which needs to be considered in the construction and use of scales is reproducibility. The best scales are those which, given an individual's score, we can feel confident that he or she obtained this score as a result of a particular attitudinal position. It is desirable that the researcher be able to predict, with a knowledge of respondent's scale score, those items with which the respondent most likely agreed and those with which the respondent was not in agreement. In many research situations, the property of reproducibility is quite difficult to achieve because scales often lack reliability, validity, and unidimensionality.
For example, let us suppose that one was trying to create a 10 item scale which measured attitudes toward abortion. If one asked respondents to indicate their level of agreement to each statement using 4 categories of response (1=strongly disagree, 2=disagree, 3=agree, 4=strongly agree), and assuming all items could be interpreted in the same direction, and that one could add all of the 10 values obtained for each response; then each subject could conceivably obtain a minimum scale score of 10 and maximum of 40. Low scores could be interpreted as indicating little or no agreement for the use of abortion, while high scores might indicate agreement or high support for abortion. However, for those individuals receiving middle range scale scores (say 20 - 30), we have no way of knowing whether these scores were obtained because the subject checked the middle range categories (2's and 3's) for all 10 questions, or because the subject responded to some of the statements very positively and to other statements very negatively (thus creating the impression of moderate support for all statement concerning abortion). Given this problem, many attitudinal scales are not amenable to cumulative scaling.
Fortunately, we can determine whether or not the results of a scale
are reproducible with a statistical procedure call Guttman scalogram analysis.
In SPSSX, one would use the RELIABILITY subroutine with the GUTTMAN option
creating a coefficient of reproducibility. Some researchers argue
that a value of less than .90 for this coefficient of reproducibility would
indicate that the scale did not have consistent scoring across items, and
that the scale should be revised or rescored (Guttman, 1950). While
other researchers feel this is too stringent a test and would suggest that
reproducibility can be established with lower coefficient values (Green,
TYPES OF SCALES
Although scales have been used to test many concepts in the social sciences, they are most often used to measure attitudes. Most attitudinal scales are designed to assign individuals scores relative to their value commitments, beliefs, and feelings. Individual scores are then compared with the standardized scores of others in order to position the individual on some continuum of the attitude being measured. The following section will refer to the several types of attitude scales which are widely employed when doing sociological research.
Other than attempting to deal with the issues of reliability, validity, unidimensionality and reproducibility; the main purpose for creating attitude scales is to enable the researcher to make better comparisons between individuals with respect to the attitudinal variable being studied. That is to say, if we only had 1 question on a survey which read:
"I believe the public school system should be permitted to use corporal punishment when needed." ____ yes ____ no
an individual would only be assigned one of two values--1) either they favor corporal punishment in schools or 2) they did not favor the use of corporal punishment.
It would be much better if the scale consisted of 10 or more items, which would allow the researcher to assign values along a continuum of numerical scores with a range of 30 points. If the scale were indeed a reliable and valid measure of this attitude, receiving a high score would indicate stronger opposition to the use of corporal punishment than receiving a lower scale score.
There are essentially 3 types of scales -- differential scales, summated
scales, and cumulative scales. In the following section we will discuss
the unique features of each type of scale and discuss the best known types
in contemporary sociological use.
A differential scale requires that a subject's response specifies where he or she falls along some attitudinal dimension. The best known examples of differential scales are those created by L. L. Thurstone (Thurstone, 1929). Thurstone scales are created in a several step process. First an open-ended question is posed to a large number of people. Each of the responses to the question are recorded and then assigned values by a panel of "expert" judges -- based on the strength of the attitude expressed in the statement. The judges independently assign these values by reading and rank ordering the large number of different responses given to the question. Typically the responses are typed on cards and then independently sorted by each judge into ten numbered piles.
As judges sort the cards, they are suppose to assume that there are equal intervals between each numbered pile. The notion of equally appearing intervals refers to the fact that according to the panel of judges, a person obtaining a value of 10 would clearly show a stronger response in attitude than someone scoring a 7. Referring to the earlier discussion on the levels of measurement, it is an attempt to give meaningful numerical values to a range of attitudes.
After the judges have sorted all the responses into numbered piles, numerical averages will be assigned to each statement. Those statements receiving high inter-rater reliability (the judges agree as to which pile the card should be placed), will be used as options for the closed-ended questions appearing on the final questionnaire.
When this questionnaire is finally administered to the target population, subjects are asked to check the statement that they believe most clearly represents their point of view. The corresponding scale score (as determined by the panel or judges) will be the value assigned to reflect the attitude of the respondent.
In practice, the differential scale method created by Thurstone is not
used much today. The reason for this is that Thurstone scaling is
an extremely time consuming approach to scale construction. Another problem
with Thurstone scales is that they quickly become outdated if there are
changes in the society affecting the attitudes being measured. For example,
the statement "I believe that a man should help his wife with her work
within the house," may have a short time ago been perceived as the neutral
position for some measure of attitude toward egalitarian gender roles.
However, even with these problems, Thurstone scales can be used effectively
in the construction of new scales.
The second basic type of scale is the summated scale. Here the numerical values assigned to the response categories for each question are simply added to produce a single scale score. The summated scale approach theoretically works because persons who are very strongly favorable toward some idea, will more often select positive response categories, while those who have more neutral ideas will select some positive and some negative categories. Finally, it is assumed that those persons who are opposed to the concept being measured will respond by selecting those statements which reflect a negative position.
The most common form of summated scale is the Likert Scale, developed by Rensis Likert in 1932. Typically, a number of statements are developed which are thought to reflect positive and negative attitudes toward some concept (i.e. conservatism, feminism, religious orthodoxy, prejudice, etc.) Each question is then written with a number of response categories. The most common type is the 4 point Likert Scale--(1) strongly agree, (2) agree, (3) disagree, and (4) disagree. An individual's score would be computed by adding the values assigned to each of the responses selected for all of items of the scale.
The Likert Scale is the most widely used method of scaling in the social sciences today. Perhaps this is because Likert scales are much easier to construct and because they tend to be more reliable than other scales with the same number of items (Tittle and Hill, 1967).
Unlike the Thurstone scale, Likert scales encounter problems related to reproducibility -- there are many ways of achieving the same scale score. If a respondent scores high on several items, but low on several others, the summated score would reflect a moderate position. A different respondent could be assigned the same scale score even though he or she had neither positive nor negative responses to any of the items.
Furthermore, since the response categories for each item do not have
equal intervals between the assigned numerical values (nor even equally
"appearing" intervals as in Thurstone scales), Likert or summated scales
can only produce data at the ordinal level of measurement. Consequentially,
the use of interval level statistical techniques is not appropriate.
The third type of scale is the cumulative scale. Cumulative scales (like the previous two scales) attempt to position individuals along some attitude continuum by assigning them single scale scores. What is different about a cumulative scale is that the items are sequentially related to each other. If a person scores favorable to item 3, he or she would also have scored favorable to items 1 and 2. Similarly, respondents disagreeing with item 6, would also disagree with items 7,8,9 and 10 of a 10 item scale.
In the cumulative scale, the items may be sequentially ordered (with the most favorable items beginning or ending the set), or they might be randomly presented. Because there is no attempt to insure that values assigned to the items are equally spaced or weighted (as was the case in the Thurstone method) a cumulative scale (like the summated scale) produces ordinal level measurements. Therefore, a score received on a cumulative scale is simply a hierarchically positioned value, and only has relative meaning when compared to other numerical scale scores.
Consider the following 3 item scale. Here respondents were asked to indicate whether they agreed or disagreed.
1. Older persons are usually outdated in their work skills.
____ Agree (0 points) ____ Disagree (1 point)
2. Older people are usually productive workers.
____ Agree (1 points) ____ Disagree (0 point)
3. Older workers bring skills and expertise to jobs which younger members do not provide.
____ Agree (1 points) ____ Disagree (0
Respondents scoring 1 point, would agree with question one; respondents scoring 2 points, would disagree with question one and agree with question 2; finally respondents scoring 3 points, would agree with questions 2 and 3 but disagree with question 1. The more items on the scale, the more variance in the scores obtained.
In the 1940's Louis Guttman (1944) was instrumental in developing scalogram analysis. The resulting Guttman coefficient of reproducibility, presented earlier in this chapter, is the statistical assessment of the total cumulative effect of a scale. This statistic enables the researcher to test the unidimensionality of a scale.
Guttman Scales are not as widely used as are Likert Scales. This is most likely due to the fact that most scales are not truly unidimensional. However, the ability to derive the Guttman coefficient of reproducibility through statistical packages such as SPSSX makes it a widely used analytic approach in refining and constructing scales.
Having reviewed the merits and problems related to scale construction,
we will now turn our attention to the techniques and procedures involved
in the preparation of questionnaire and interview schedules.
QUESTIONNAIRE AND INTERVIEW SCHEDULES
Both questionnaire and interview schedules are commonly used in survey research. The questionnaire is typically restricted to written responses while the interview is most often a verbal response which is then recorded or transposed by the interviewer. The interview may be conducted in a face-to-face setting or it may be given over the telephone.
The interview can be structured or unstructured. In a structured interview, the respondent has a very limited number of response set categories and little opportunity for elaboration beyond those options. For example, a telephone interviewer might ask respondent to select one of the following phrases which best describes their health status: 1) very healthy, 2) fairly healthy, 3) not very healthy, 4) not at all healthy. Respondents would then merely give a number indicating their health status at the time. In an unstructured interview, the interviewer might ask the following question, "Could you tell me a little bit about your health?" This question allows respondents any number of options. They may talk about their health history, they may compare their health status to the health of others, or they might elaborate on a recent impairment and how it restricted their mobility. The key to conducting unstructured interviews is to help the respondent focus on the variables of concern in the research project, and at the same time allow supplemental data to emerge.
The major differences between the questionnaire and the interview relate to the issue of interviewer bias and rapport. The verbal collection of data, through the interview process, demands that the researcher pay strict attention to his or her own objectivity. What we "selectively" hear the respondent saying, and how we choose to probe for more information, is too often a function of what we think we will hear. To put it another way, interviews are often conducted in a fashion that leads the respondent to answer in accordance with the researcher's hypotheses. This issue is further complicated by the fact that the gender of the interviewer might affect the amount, or type, of respondent self-disclosure. If it is a face-to-face interview, factors such as the age, social class, education, and race of the interviewer may elicit different responses based upon the perceptions that the respondent has concerning the interviewer's values or attitudes.
Conducting interviews, in the context of survey research, is the most difficult, time consuming, and costly method for collecting data. Interviewing also places a special demand upon the researcher to guard his or her value-free stance in question wording and interpretation.
This is not to imply that interviews are inferior to questionnaires. The questionnaire, although it does allow for greater anonymity, has several major disadvantages as well. The primary problem is that the questionnaire format does not allow for probing or clarification of responses. Another problem is that questionnaires do not typically permit the recording of complex behaviors or attitudes.
It should be obvious at this point, that one makes a decision whether
or not to use interviews or questionnaires based on the nature of the research
hypotheses, the state of the body of knowledge concerning the variables
of interest, the amount of money one has to spend on the research, the
amount of time or training one has, and the availability of computer resources
for analyzing quantitative or qualitative data.
STEPS IN PREPARING QUESTIONNAIRES AND STRUCTURED INTERVIEWS
The first step is to select empirical indicators which will specifically address the hypotheses which have been derived from the literature review. As we have indicated earlier, we advocate the use of existing scales or measures whenever possible. One may also modify a scale or index to meet the needs of a particular research study -- where existing scales are not totally appropriate. The real advantage in using existing scales is that other scientists have already assessed the scale's reliability and validity. Furthermore, the use of these scales tends to replicate the research of others -- providing greater support for the scientific body of knowledge.
The second step is to decide what type of questions to use. Does one want very closed or open-ended response sets? A closed response set refers to those questions which have a limited number of fixed categories for answers, while an open-ended questions allows for free responses. Again, this decision will be based on one's ability to provide all possible alternatives to a question, the complexity of the behaviors or attitudes being studied, and resources available for statistical analysis of the data being collected. We have found that Babbie's (1986) The Practice of Social Research and Selltiz, Wrightsman, and Cook's (1981) Research Methods in Social Relations to be very helpful in finding specific information concerning question writing.
The third step is to formulate a layout of the questionnaire or interview schedule. We suggest that questions be placed in an order that makes sense to the respondent. The researcher should try to group questions by topic or area. This tends to help the respondent maintain a mental set and more efficiently complete the survey.
It is also a good idea to place more sensitive questions toward the end of the survey. In doing this the researcher will have the opportunity to build rapport with the respondent before he or she is asked to answer sensitive questions. Also, should the respondent refuse to answer these questions, most of the data will have already been collected.
The fourth step is to carefully edit the questionnaire. Once the survey is completed, let it sit for a day or two. Then return to the rough draft and reexamine it carefully. Pay close attention to ambiguities, omissions, incomplete response sets, incomplete directions for responding or returning the survey, and other such details. We have also found it very helpful to have another social scientist look at the questionnaire and provide feedback which can be used to improve the instrument.
The fifth step is to pretest the questionnaire or interview schedule with a small number of subjects who are similar to those who will actually comprise the sample. The questionnaire should be administered under similar conditions, and the researcher should actually conduct one or two interviews. Questions which are difficult for the respondent to answer, or which appear to be providing unreliable information, should be modified or eliminated. We cannot stress enough the importance of taking time to conduct a pretest. By spotting those ambiguities which inevitably occur in first drafts of surveys, the overall quality of the data eventually collected is much enhanced. Remember the problem of GIGO!
The practice of social research is indeed an exciting one, and the construction
of valid and reliable surveys is one of the most difficult and rewarding
aspects of this research. We would encourage the undergraduate social scientist
to try all of the different approaches in collecting survey data -- face-to-face
interviews, mailed or distributed questionnaires, telephone interviews,
field research, historical research, and direct observation.
For it is really only through the "doing" of sociology that we learn its
real weaknesses and strengths.
IMPORTANT CONCEPTS COVERED IN THIS CHAPTER
Content Validity Cumulative Scales Differential Scales
Guttman Scales Interval Scales Interview Schedule Likert Scales
Measurement Nominal Scales Operationalization Ordinal Scales
Questionnaire Reliability Reproducibility Scales Scalogram Analysis
Structured Interview Summated Scales Thurstone Scales Unidimensionality
Unstructured Interview Face Validity Guttman Coefficient of Reproducibility
IMPORTANT POINTS COVERED IN THIS CHAPTER
1. Measures at the nominal level assess categories which are mutually exclusive, ordinal measures imply a continuum or hierarchy of response categories, and interval level measures have equal distances between units of enumeration.
2. The researcher's key objective in providing response categories is to measure variables at their most "productive" level so that the more powerful statistical techniques will be appropriate.
3. Reliability refers to the consistency of a measure. Reliability may be thought of as obtaining the same results after giving the empirical indicators to the same group of people more than once.
4. Validity refers to whether or not empirical indicators are actually measuring what they are claim to measure. Empirical indicators may not be valid measures of concepts, even though they may yield consistent results.
5. Unidimensionality refers to the property that the items comprising a scale measure one and only one dimension or concept at a time.
6. If a scale has the property of reproducibility, the researcher will be able to predict, with a knowledge of respondent's scale score, those items with which the respondent most likely agreed and those with which the respondent was not in agreement.
7. Other than attempting to deal with the issues of reliability, validity, unidimensionality and reproducibility; the main purpose for creating attitude scales is to enable the researcher to make better comparisons between individuals with respect to the attitudinal variable being studied.
8. The questionnaire is typically restricted to written responses while the interview is most often a verbal response which is then recorded or transposed by the interviewer.
9. In a structured interview, the respondent has a very limited number of response set categories and little opportunity for elaboration beyond those options. An unstructured interview allows respondents to respond in any manner they choose.
10. Because scale construction is a difficult task, we recommend that most beginning researchers employ preexisting scales if at all possible.
11. Deciding whether or not to use interviews or questionnaires
should be based upon the nature of the research hypotheses, the state of
the body of knowledge concerning the variables of interest, the amount
of money one has to spend on the research, the amount of time or training
one has, and the availability of computer resources for analyzing quantitative
or qualitative data.
Babbie, Earl. 1983. The Practice of Social Research, (3rd edition), Belmont, California: Wadsworth Publishers.
Bonjean, Charles, Richard Hill, and Dale McLemore. 1967. Sociological Measurement. San Francisco: Chandler Publishers.
Guttman, Louis. 1980. "The Basis for Scalogram Analysis," in Samuel Stouffer (editor), Measurement and Prediction; Princeton, NJ: Princeton University Press.
Guttman, Louis. 1944. "A basis for Scaling Quantitative Data," American Sociological Review, Volume 9, pages 139-150.
Green, Bert. 1954. "Attitude Measurement," in G. Lindzey (editor), Handbook of Social Psychology, Cambridge, MA.: Addison-Wesley.
Likert, Rensis. 1932. "A Technique for the Measurement of Attitudes," Archives of Psychology, Number 140.
Miller, Delbert. 1964 Handbook of Research Design and Social Measurement. New York: McKay Publishers.
Oppenheim, A. N. 1966. Questionnaire Design and Attitude Measurement. New York: Basic Books.
Robinson, John, and Phillip Shaver. 1973. Measures of Social Psychological Attitudes, Ann Arbor, Michigan: Survey Research Center.
Rubin, Zick. 1970. "Measurement of Romantic Love," Journal of Personality and Social Psychology, Volume 16, pages 265-273.
Thurstone, L.L. 1929. "Theory of Attitude Measurement," Psychological Bulletin, Volume 36, pages 222-241.
Tittle, C. and Hill, R. 1967. "Attitude Measurement and Prediction of Behavior: An Evaluation of Conditions and Measurement Techniques," Sociometry, Volume 30, pages 199-213.
Selltiz, Claire, Lawrence Wrightsman, and Stuart Cook. 1981. Research Methods in Social Relations (4th edition), New York:Holt, Rinehart and Winston Publishers.
Go back to Sociology 371 -- Research Methods
If you have any questions or comments please email: