Sinopsis
Resultados de la regresión puede tener algún valor limitado cuando cuidadosamente interpretados. Inevitable formas de variación hará que los coeficientes estimados para reducir sustancialmente hacia cero. Un mejor modelo es necesario que se encarga de la variación de una manera más apropiada.
(Un modelo de máxima verosimilitud puede ser construido pero puede ser impracticable debido a que el cálculo sea necesario, lo que implica la evaluación numérica de integrales multidimensionales. El número de dimensiones que son igual a el número de alumnos matriculados en las clases.)
Introducción
Como una narrativa para informar a nuestra intuición, imagino que en estos 38 exámenes se han dado en 38 cursos separados durante un semestre en una escuela pequeña con la matrícula de 200 estudiantes universitarios. En una situación realista los estudiantes tendrán diferentes habilidades y experiencias. Como sustituto de medidas de estas habilidades y experiencias que puede tomar, por ejemplo, las puntuaciones en el SAT de matemáticas y verbales pruebas y año en la universidad (de la 1 a la 4).
Normalmente, los estudiantes deberán inscribirse en los cursos de acuerdo a sus habilidades e intereses. Los estudiantes de primer año de tomar cursos de iniciación y cursos de iniciación poblada principalmente por los estudiantes de primer año. Estudiantes de los últimos años y talentosos estudiantes de primer año y estudiantes de segundo año de tomar la avanzada y de posgrado a nivel de cursos. Esta selección parcialmente estratifica a los estudiantes para que las capacidades innatas de los estudiantes dentro de cualquier clase suelen ser más homogénea que la propagación de capacidades a lo largo de la escuela.
Por lo tanto, los estudiantes más capaces pueden encontrarse puntuación cerca de la parte inferior de la difícil, avanzado clases en las que se inscriben, mientras que los menos capaces que los estudiantes pueden calificar cerca de la parte superior de la fácil clases introductorias. Esto puede confundir a un intento directo de relacionar el examen clasifica directamente a los atributos de los estudiantes y las clases.
Análisis
Índice de los estudiantes con $i$ y dejar que los atributos de student $i$ ser dada por el vector $\mathbf{x}_i$. Índice de las clases con $j$ y dejar que los atributos de la clase $j$ ser dada por el vector $\mathbf{z}_j$. El conjunto de estudiantes matriculados en la clase de $j$$A_j$.
Asumir la "fuerza" de cada estudiante $s_i$ es una función de sus atributos más algún valor aleatorio, que puede así tener cero significa:
$$s_i = f(\mathbf{x}_i, \beta) + \varepsilon_i.$$
We model the exam in class $j$ by adding independent random values to the strength of each student enrolled in the class and converting those to ranks. Whence, if student $i$ is enrolled in class $j$, their relative rank $r_{i,j}$ is determined by their position in the sorted array of values
$$\left(s_k + \delta_{k,j}, k \in A_j\right).$$
This position $r_{i,j}$ is divided by one more than the total class enrolment to give the dependent variable, the percentage rank:
$$p_{i,j} = \frac{r_{i,j}}{1 + |A_j|}.$$
I claim that the regression results depend (quite a bit) on the sizes and structure of the random (unobserved) values $\varepsilon_i$ and $\delta_{i,j}$. The results also depend on precisely how students are enrolled in classes. This should be intuitively obvious, but what is not so obvious--and appears difficult to analyze theoretically--is how and how much the unobserved values and the class structures affect the regression.
Simulation
Without too much effort we can simulate this situation to create and analyze some sample data. One advantage of the simulation is that it can incorporate the true strengths of the students, which in reality are not observable. Another is that we can vary the typical sizes of the unobserved values as well as the class assignments. This provides a "sandbox" for assessing proposed analytical methods such as regression.
To get started, let's set the random number generator for reproducible results and specify the size of the problem. I use R
because it's available to anyone.
set.seed(17)
n.pop <- 200 # Number of students
n.classes <- 38 # Number of classes
courseload <- 4.5 # Expected number of classes per student
To provide realism, create n.classes
classes of varying difficulties on two scales (mathematical and verbal, with a negative correlation), conducted at varying academic levels (ranging from 1=introductory to 7=research), and with variable ease. (In an "easy" class, differences among the amounts of student learning may be large and/or the exam may provide little discrimination among the students. This is modeled by random terms $\delta_{i,j}$ that, for class $j$ tend to be large. The exam results will then be almost unpredictable from the student strength data. When the class is not "easy," these random terms are negligibly small and the student strengths can perfectly determine the exam rankings.)
classes <- data.frame(cbind(
math <- runif(n.classes),
rbeta(n.classes, shape1=(verbal <- (1-math)*5), shape2=5-verbal),
runif(n.classes, min=0, max=7),
rgamma(n.classes, 10, 10)))
rm(math, verbal)
colnames(classes) <- c("math.dif", "verbal.dif", "level", "ease")
classes <- classes[order(classes$math.dif + classes$verbal.dif + classes$level), ]
row.names(classes) <- 1:n.classes
plot(classes, main="Classes")
The students are spread among the four years and endowed with random values of their attributes. There are no correlations among any of these attributes:
students <- data.frame(cbind(
as.factor(ceiling(runif(n.pop, max=4))),
sapply(rnorm(n.pop, mean=60, sd=10), function(x) 10*median(c(20, 80, floor(x)))),
sapply(rnorm(n.pop, mean=55, sd=10), function(x) 10*median(c(00, 80, floor(x)))),
rnorm(n.pop)
))
colnames(students) <- c("year", "math", "verbal", "ability")
plot(students, main="Students")
The model is that each student has an inherent "strength" determined partly by their attributes and partly by their "ability," which is the $\varepsilon_i$ value. The strength coefficients beta
, which determine the strength in terms of other attributes, are what the subsequent data analysis will seek to estimate. If you want to play with this simulation, do so by changing beta
. The following is an interesting and realistic set of coefficients reflecting continued student learning throughout college (with a large amount between years 2 and 3); where 100 points on each part of the SAT are worth about one year of school; and where about half the variation is due to the "ability" values not captured by SAT scores or year in school.
beta <- list(year.1=0, year.2=1, year.3=3, year.4=4, math=1/100, verbal=1/100, ability=2, sigma=0.01)
students$strength <- (students$year==1)*beta$year.1 +
(students$year==2)*beta$year.2 +
(students$year==3)*beta$year.3 +
(students$year==4)*beta$year.4 +
students$math*beta$math +
students$verbal*beta$verbal +
students$ability*beta$ability
students <- students[order(students$strength), ]
row.names(students) <- 1:n.pop
(Bear in mind that students$ability
is unobservable: it is an apparently random deviation between the strength predicted from the other observable attributes and the actual strength on exams. To remove this random effect, set beta$ability
to zero. beta$sigma
will multiply the ease
values: it's basically the standard deviation of the $\delta_{i,j}$ relative to the range of strengths of students in a given course. Values around $.01$ to $.2$ or so seem reasonable to me.)
Let the students pick courses to match their abilities. Once they do that, we can compute the class sizes and stash those with the classes
dataframe for later use. The value of spread
in the assignments <-...
line determines how closely the students are sectioned into classes by ability. A value close to $0$ essentially pairs the weakest students with the easiest courses. A value close to the number of classes spreads the students out a little more. Much larger values than that start to get unrealistic, because they tend to put weaker students into the most difficult courses.
pick.classes <- function(i, k, spread) {
# i is student strength rank
# k is number to pick
p <- pmin(0.05, diff(pbeta(0:n.classes/n.classes, i/spread, (1+n.pop-i)/spread)))
sample(1:n.classes, k, prob=p)
}
students$n.classes <- floor(1/2 + 2 * rbeta(n.pop,10,10) * courseload)
assignments <- lapply(1:n.pop, function(i) pick.classes(i, students$n.classes[i], spread=1))
enrolment <- function(k) length(seq(1, n.pop)[sapply(assignments, function(x) !is.na(match(k, x)))])
classes$size <- sapply(1:n.classes, enrolment)
classes$variation <- by(data, data$Class, function(x) diff(range(x$strength)))
(As an example of what this step has accomplished, see the figure further below.)
Now apply the model: the abilities of the students in each class are independently varied--more for easy exams, less for hard (discriminating) exams--to determine their exam scores. These are summarized as ranks and "pranks", which are rank percents. The pranks for a class of $n$ students range from $1/(n+1)$ through $n/(n+1)$ in increments of $1/(n+1)$. This will later make it possible to apply transformations such as the logistic function (which is undefined when applied to values of $0$ or $1$).
exam.do <- function(k) {
s <- seq(1, n.pop)[sapply(assignments, function(x) !is.na(match(k, x)))]
e <- classes$ease[k]
rv <- cbind(rep(k, length(s)), s, order(rnorm(length(s), students$strength[s], sd=e*beta$sigma*classes$variation[k])))
rv <- cbind(rv, rv[,3] / (length(s)+1))
dimnames(rv) <- list(NULL, c("Class", "Student", "Rank", "Prank"))
rv
}
data.raw <- do.call(rbind, sapply(1:n.classes, exam.do))
To these raw data we attach the student and class attributes to create a dataset suitable for analysis:
data <- merge(data.raw, classes, by.x="Class", by.y="row.names")
data <- merge(data, students, by.x="Student", by.y="row.names")
Let's orient ourselves by inspecting a random sample of the data:
> data[sort(sample(1:dim(data)[1], 5)),]
Row Student Class Rank Prank math.dif verbal.dif level ease Size year math verbal ability strength n.classes
118 28 1 22 0.957 0.77997 6.95e-02 0.0523 1.032 22 2 590 380 0.576 16.9 4
248 55 5 24 0.889 0.96838 1.32e-07 0.5217 0.956 26 3 460 520 -2.163 19.0 5
278 62 6 22 0.917 0.15505 9.54e-01 0.4112 0.497 23 2 640 510 -0.673 19.7 4
400 89 10 16 0.800 0.00227 1.00e+00 1.3880 0.579 19 1 800 350 0.598 21.6 5
806 182 35 18 0.692 0.88116 5.44e-02 6.1747 0.800 25 4 610 580 0.776 30.7 4
Record 118, for example, says that student #28 enrolled in class #1 and scored 22nd (from the bottom) on the exam for a percentage rank of 0.957. This class's overall level of difficulty was 0.0523 (very easy). A total of 22 students were enrolled. This student is a sophomore (year 2) with 590 math, 380 verbal SAT scores. Their overall inherent academic strength is 16.9. They were enrolled in four classes at the time.
This dataset comports with the description in the question. For instance, the percentage ranks indeed are almost uniform (as they must be for any complete dataset, because the percentage ranks for a single class have a discrete uniform distribution).
Remember, by virtue of the coefficients in beta
, this model has assumed a strong connection between examination scores and the variables shown in this dataset. But what does regression show? Let's regress the logistic of the percentage rank against all the observable student characteristics that might be related to their abilities, as well as the indicators of class difficulty:
logistic <- function(p) log(p / (1-p))
fit <- lm(logistic(Prank) ~ as.factor(year) + math + verbal + level, data=data)
summary(fit)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.577788 0.421579 -6.11 1.5e-09 ***
as.factor(year)2 0.467846 0.150670 3.11 0.0020 **
as.factor(year)3 0.984671 0.164614 5.98 3.2e-09 ***
as.factor(year)4 1.109897 0.171704 6.46 1.7e-10 ***
math 0.002599 0.000538 4.83 1.6e-06 ***
verbal 0.002130 0.000514 4.14 3.8e-05 ***
level -0.208495 0.036365 -5.73 1.4e-08 ***
---
Signif. codes: 0 ‘***' 0.001 ‘**' 0.01 ‘*' 0.05 ‘.' 0.1 ‘ ' 1
Residual standard error: 1.48 on 883 degrees of freedom
Multiple R-squared: 0.0661, Adjusted R-squared: 0.0598
F-statistic: 10.4 on 6 and 883 DF, p-value: 3.51e-11
Diagnostic plots (plot(fit)
) look fastastic: the residuals are homoscedastic and beautifully normal (albeit slightly short tailed, which is no problem); no outliers; and no untoward influence in any observation.
As you can see, everything is highly significant, although the small R-squared might be disappointing. The coefficients all have the roughly the correct signs and relative sizes. If we were to multiply them by $3.5$, they would equal $(-9, 1.6, 3.4, 3.9, 0.009, 0.007, -0.7)$. The original betas were $(*, 1, 3, 4, 0.010, 0.010, *)$ (where $*$ stands for a coefficient that was not explicitly specified).
Notice the high significance of level
, which is an attribute of the classes, not of the students. Its size is pretty large: the class levels range from near $0$ to near $7$, so multiplying this range by the estimated coefficient of level
show it has the same size of effect as any of the other terms. Its negative sign reflects a tendency for students to do a little bit worse in the more challenging classes. It is very interesting to see this behavior emerge from the model, because the level was never explicitly involved in determining the examination outcomes: it only affected how the students chose their classes.
(By the way, using the percentage ranks untransformed in the regression does not qualitatively change the results reported below.)
Let's vary things a bit. Instead of setting spread
to $1$, we were to use $38$, thereby causing a wider (more realistic) distribution of students throughout the classes. Rerunning everything from the top gives these results:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.902006 0.349924 -14.01 < 2e-16 ***
as.factor(year)2 0.605444 0.130355 4.64 3.9e-06 ***
as.factor(year)3 1.707590 0.134649 12.68 < 2e-16 ***
as.factor(year)4 1.926272 0.136595 14.10 < 2e-16 ***
math 0.004667 0.000448 10.41 < 2e-16 ***
verbal 0.004019 0.000434 9.25 < 2e-16 ***
level -0.299475 0.026415 -11.34 < 2e-16 ***
---
Signif. codes: 0 ‘***' 0.001 ‘**' 0.01 ‘*' 0.05 ‘.' 0.1 ‘ ' 1
Residual standard error: 1.3 on 883 degrees of freedom
Multiple R-squared: 0.282, Adjusted R-squared: 0.277
F-statistic: 57.9 on 6 and 883 DF, p-value: <2e-16
(In this scatterplot of class assignments, with spread
set to $38$, students are sorted by increasing strength and classes are sorted by increasing level. When spread
originally was set to 1
, the assignment plot fell in a tight diagonal band. Weaker students tend to take easier classes and stronger students take harder classes, but there are plenty of exceptions.)
This time the R-squared is much improved (although still not great). However, all the coefficients have increased by 20 - 100%. This table compares them along with some additional simulations:
Simulation Intercept Year.2 Year.3 Year.4 Math Verbal Level R^2
Beta * 1.0 3.0 4.0 .010 .010 * *
Spread=1 -2.6 0.5 1.0 1.1 .003 .002 -0.21 7%
Spread=38 -4.9 0.6 1.7 1.9 .005 .004 -0.30 25%
Ability=1 -8.3 0.9 2.6 3.3 .008 .008 -0.63 58%
No error -11.2 1.1 3.3 4.4 .011 .011 -0.09 88%
Keeping spread
at $38$ and changing ability
from $2$ to $1$ (which is a very optimistic assessment of how predictable the student strengths are) yielded the penultimate line. Now the estimates (for student year and student SAT scores) are getting reasonably close to the true values. Finally, setting both ability
and sigma
to $0$, to remove the error terms $\varepsilon_i$ and $\delta_{i,j}$ en total, da un alto R cuadrado y produce estimaciones de cerca a los valores correctos. (Es de destacar que el coeficiente de level
a continuación, se reduce en un orden de magnitud.)
Este rápido análisis muestra que la regresión, al menos como aquí, se va a confundir inevitable formas de variación de los coeficientes. Además, los coeficientes dependen también (en cierta medida) sobre cómo los estudiantes se distribuyen entre las clases. Esto en parte puede ser acomodados mediante la inclusión de los atributos de la clase entre las variables independientes de la regresión, como se ha hecho aquí, pero aún así el efecto de la estudiante distribución no desaparecen.
La falta de previsibilidad de la verdadera rendimiento de los estudiantes, y cualquier variación en el aprendizaje de los estudiantes y el desempeño real en los exámenes, al parecer la causa de los coeficientes estimados para reducir a cero. Parecen hacerlo de manera uniforme, lo que sugiere que la relación de los coeficientes todavía puede ser significativo.