Economics and Management 416
Income Equations
IPUMS-CPS is an integrated set of data from 48 years (1962-2009) of the March
Current Population Survey (CPS). The CPS is a monthly U.S. household survey
conducted jointly by the U.S. Census Bureau and the Bureau of Labor Statistics.
Initiated in the 1940s in the wake of the Great Depression, the survey was
designed to measure unemployment. A battery of labor force and demographic
questions, known as the "basic monthly survey," is asked every month. Over time,
supplemental inquiries on special topics have been added for particular months.
Among these supplemental surveys, the March Annual Demographic File and Income
Supplement is the most widely used by social scientists and policymakers, and it
provides the data for IPUMS-CPS.
The Current Population Survey (CPS) is a survey of about 50,000 households. The
sample is scientifically selected to provide a random representation of the
civilian noninstitutional population. Respondents are interviewed to obtain
information about the employment status of each member of the household 15 years
of age and older. However, published data focus on those ages 16 and over. The
sample provides estimates for the nation as a whole and serves as part of
model-based estimates for individual states and other geographic areas.
Estimates obtained from the CPS include employment, unemployment, earnings,
hours of work, and other indicators. They are available by a variety of
demographic characteristics including age, sex, race, marital status, and
educational attainment. They are also available by occupation, industry, and
class of worker. Supplemental questions to produce estimates on a variety of
topics including school enrollment, income, previous work experience, health,
employee benefits, and work schedules are also often added to the regular CPS
questionnaire.
Code book for your data.
We have a sample of individuals from the state of Minnesota collected in 2009.
METRO indicates whether a household was located in a metropolitan area. For households within metropolitan areas METRO = 1.
OWNERSHP indicates whether the household rented or owned its housing unit. Households that acquired their unit with a mortgage or other lending arrangement were understood to "own" their unit even if they had not yet completed repayment.
PUBHOUS indicates whether the house, apartment, or mobile home is part of a
government housing project for people with low incomes, commonly known as a
"public housing project."
Participation in public housing is determined by two factors: program
eligibility and the availability of housing. Income standards for initial and
continuing occupancy vary across local housing authorities, although Federal
guidelines set broad limits. Rental charges define net benefits and cannot
exceed 30 percent of the family's or the individual's net monthly income. A
public housing unit can be occupied by a family of two or more related persons
or an individual who is handicapped, elderly, or displaced by urban renewal or
natural disaster.
FOODSTMP indicates whether one or more members of the household received Food
Stamps during the prior year. The Food Stamp Act of 1977 was enacted to increase
the food purchasing power of eligible households through the use of coupons to
purchase food. The Food and Nutrition Service of the U.S. Department of
Agriculture (USDA) administers the Food Stamp Program through State and local
welfare offices. The Food Stamp Program is the major national income support
program which provides benefits to all low-income and low-resource households,
regardless of the person's characteristics (e.g., sex, age, disability, etc.).
NCHILD counts the number of own children (of any age or marital status)
residing with each individual. NCHILD includes step-children and adopted
children as well as biological children. Persons with no children present are
coded 0.
AGE gives each person's age at last birthday.
SEX gives each person's sex. Males are set to 1.
Racial categories in the CPS have been more consistent than racial categories in the census. Up through 2002, the number of race categories ranged from 3 (white, negro, and other) to 5 (white, black, American Indian/Eskimo/Aleut, Asian or Pacific Islander, and other). Dummy variables of White, Black, Asian, and American Indian have been created. Hispanic is not a racial group and is not separated in this variable.
MARST gives each person's current marital status, including whether the spouse was currently living in the same household.
BPL indicates whether persons were born in the United States.
EDUC99 reports the respondent's highest level of educational attainment. Respondents without high school diplomas were to indicate the highest school grade they had completed, while those with high school diplomas were to indicate the highest diploma or degree they had obtained. A series of dummies have been created to use this variable: no high school diploma, high school diploma, some college or associates, bachelors, advanced degree.
FULLPART indicates whether respondents who were employed during the previous calendar year worked full-time or part-time. Full-time work is defined as thirty-five hours a week or more.
FIRMSIZE indicates the total number of persons who worked for the respondent's employer during the preceding calendar year. A value of 0 indicates the firm is small with under 100 employees and 1 is greaten than 100 employees.
INCWAGE indicates each respondent's total pre-tax wage and salary income--that is, money received as an employee--for the previous calendar year.
The POVERTY variable is located on the person record, although it treats
respondents who live in families collectively. It compares the family's total
income for the previous calendar year as reported in income the poverty
threshold. POVERTY is also calculated for adults living as unrelated
individuals. POVERTY is an intervalled poverty variable reported in the original
CPS PUMS.
In accordance with the Office of Management and Budget's Directive 14, the CPS
and the Census Bureau use the definition of poverty originally developed by the
Social Security Administration in 1964, later modified by federal interagency
committees in 1969 and 1980. The core of this definition is the 1961 economy
food plan, the least costly of four nutritionally adequate food plans designed
by the U.S. Department of Agriculture. A 1955 USDA study determined that
families of three or more persons spend about one-third of their income for
food, so the poverty level for these families is set at three times the cost of
the 1961 economy food plan. For smaller families and unrelated individuals, the
multiplier is higher, since these people generally spend a smaller proportion of
their income for food. For a more detailed discussion, see U.S. Bureau of the
Census, Current Population Reports, Series P-60, No. 171, Poverty in the United
States: 1988 and 1989.
DISABWRK identifies persons who had "a health problem or a disability which prevents him/her from working or which limits the kind or amount of work." This question was used to determine whether a follow-up question should be asked about the receipt of income as the result of a health problem. Respondents were not supposed to refer to short, acute illnesses (e.g., influenza) or temporary conditions (e.g., pregnancy or broken bones), but this is not specified in the question wording.
This assignment will allow you to practice identifying key aspects of the research design used to gather data. It will be graded as either A or A-. A is worth 25 points and means that you have completed all of the work and have mistakes on fewer than 1/3 of the questions. A - is worth 10 points and means that there are mistakes on 1/3 or more of the questions. For this assignment, you will examine the research design used for the
Current Population Survey:
Univariate Statistics Exercise
Select five variables from our Consumer Population Survey data. Produce and analyze the summary statistics for each. Type a few sentences describing the distribution of each variable.
Multiple Regression Exercise
For this assignment, use the labor data provided on the webpage. Type your answers to the questions below and attach a copy of your excel output. The assignments will be graded as either A or A-. A is worth 25 points and means that you have completed all of the work and have mistakes on fewer than 1/3 of the questions. A - is worth 10 points and means that there are mistakes on 1/3 or more of the questions.
Conduct a linear regression for the following variables: Wage/Income on School, Experience, Race, and Sex (using dummy variables for race and sex)
Write one hypothesis for each Independent Variable, stating how you think it will affect the Dependent Variable. When you write the hypotheses for nominal level variables, state how you expect the groups represented by each dummy variable will differ from the reference category.
2. What percent of variation in the Dependent Variable can be explained by the model?
3. Is the model statistically significant? Explain your answer.
4. For each Independent Variable, explain whether it is negatively or positively associated with the Dependent Variable and whether the relationship is statistically significant.
5. Which Independent Variable had the strongest effect on the Dependent Variable? Which had the weakest effect? Explain your answer.
6. Which of your hypotheses were supported? Explain your answer.
Data Analysis Project
ESTIMATING THE EARNINGS EQUATION
General instructions:
You may do this project in a group of up to three (3) students. Commands from Excel menus are indicated as follows: File > Open means you should click on File and then select Open from the menu. Variable names are in all upper-case letters. Something you actually have to hand in is indicated in italics. In this project you will examine data from the March 2009 Current Population Survey (CPS), contained in the following file: earn2009.xls Random sample of Minnesota adults (18+) in the labor force who worked full-time, year-round in 2009 and were not self-employed.The first paragraphs were a brief description of the CPS and a listing of the variables in our data set and their definitions.
Getting started
How to download the data from the web
1. Go to the course web page:
http://web.mnstate.edu/stutes
Click on
Data sets2. To download, right-click on the data set you need (
earn2009), and you should then be given the option of saving the file somewhere. Save it to your hard disk or a jump.Make sure Excel is set up to do data analysis
In Excel, click on
Tools. Near the bottom of the menu you should see an option for Data Analysis. If it is there, you are all set. If it is not there, you need to click on Tools > Add-Ins. You will the be able to add the Analysis ToolPak, which has what you need.Part 1: Descriptive statistics (The sections below take you through a sample analysis. While it appears that you have more to submit. You do not. The only hand-in portion of the assignment is above this section.)
1. Open the data set (earn2009.xls) in Excel. Variable names are across the top row, and the variables are in columns. If you forget what a variable is, see the definitions attached to this handout.
2. Scroll to the bottom of the data.
How many observations (N) are there in the sample?3. Take a look at observation (person) #15.
Describe this person. What is her or his:gender
age
race
region of
residence
educational
attainment
earnings from
wages and salary
occupation
marital status
4. Now obtain some basic statistics for the following three variables: age, edu99, and incwage.
Throughout this assignment, we will use
wagesal (annual wage and salary income) as our measure of earnings. For each of these three variables, do the following:Generate means, etc. using Tools > Data Analysis > Descriptive Statistics. Note that you must specify the Input Range (cells that contain all the data). Include the variable name in the input range and check Labels in First Row.
Note:
When given the option for where to put the output, it is convenient to place it on a new worksheet in the same workbook.Print out your results using the Print button. (Remember that this is not part of the assignment so you do not really need to print)
For each of the three variables, compare the mean and median. If they are different, what is the significance of one being bigger than the other?
For each of the three variables, find the 10th, 25th, 50th, 75th, and 90th percentiles. Examine the extremes of the distribution (top and bottom few values). Are there any obvious "outliers"?
5. Use the same descriptive statistics procedure to answer the following questions. (Hint: the proportion of Asians in the sample is given by the mean of the
asian variable.) You do not need to hand in the actual printouts for these questions, just the number for each.What
percentage of individuals are white? Black? Asian? American Indian?
What
percentage of individuals are female?
What
percentage of individuals are married?
What
percentage of individuals graduated from high school but had no further formal
education?
What
percentage of individuals have more than a high-school education?
6. Now examine the subsample of women. The best way to do this in Excel is sort all your data by
female , using either Data > Sort or Data > Filter, then copy the observations with female = 1 into a new sheet.Using this restricted (female) sample, answer all parts of questions 4 and 5 over again for females only.
7.
Repeat all of #6 for the subsample of males only.8. Using your separate results for men and women (#7 and #8), compare the men and women in
the sample. In terms of the variables you have examined, in what ways are men and women similar? In what ways are they different?Part 2: Regression basics
1. Using the earnings data (full data set), create an XY scatter plot of earnings (vertical) against years
of school (horizontal). To do this, it helps to have the X variable (educ99) in one column and the Y variable (incwage) in the next column immediately to the right. Then use the chart wizard to create an XY scatter.Print out the scatter plot and answer the following questions about the scatter plot:
Does the relationship between earnings and schooling appear to be positive or negative?
Are there any obvious outliers?
Sketch in a line that you think fits the data and estimate what its equation is.
2. Estimate a simple regression with earnings (
incwager) as the dependent (Y) variable and schooling (edu99) as the independent or X variable. To do this, use Tools > Data Analysis > Regression and proceed as we did in class. Make sure to use the variable labels. Check off the boxes for residuals and residual plots. Put the output on a new worksheet.Print out the resulting table of results, and answer the following questions:
Write out the formula for the equation you have estimated.
Interpret the R-squared.
Interpret the F-statistic.
Part 3: Multiple regression
We now predict earnings using
edu99 and age, where age is a proxy for experience. This variable is often called "years of potential work experience."1. Run a regression in which the dependent variable is
incwage and the independent variables are edu99 and age. (Note that all the X-variables (regressors) must be in adjacent columns.) Print out the results.Check the general validity of the regression by examining and discussing the R-squared and F-statistic.
Interpret the coefficients on
2. Now run a similar regression, but this time use as the dependent variable the
natural log of wage and salary earnings (do NOT take the logs of school or exper.) To do this you need to generate a new column of numbers with the variable name lnincwage (the Excel function LN(..) produces the natural log).Again, check the general validity of the regression by examining and discussing the R-squared and
Interpret the coefficients on
For the remainder of this Data Analysis Project, continue to use the log of earnings (
lwincwage) as your dependent variable.3. To your regression from #2, now add as a regressor a dummy variable for
female (include edu99 and age as regressors too). Controlling for schooling and experience, does the evidence suggest that women earn more than, less than, or about the same as men?Explain (consider both economic and statistical significance in your answer).
4. To your regression from #3, add two interaction terms: the interaction between
female and edu99, and the interaction between female and age (include all the regressors you used in #3). This requires creating two new columns of numbers. For instance, the interaction of female and school could be femsch = female*edu99.Draw a diagram representing the relationship between schooling and log earnings
Draw a diagram representing the relationship between work experience and log earnings