목차
- Fundamentals of Hypothesis Testing
- General ANOVA Setting
- Partitioning the Variation
- ANOVA F Test Statistic
- One-Way ANOVA Table
- How to Solve
- 문제
👀, 🤷♀️ , 📜, 📝
이 아이콘들을 누르시면 정답, 개념 부가 설명을 보실 수 있습니다:)
Fundamentals of Hypothesis Testing
즉, 가설검증이다.
이 추정의 결과로 인한 의사결정까지 한다는 것이다.
- One-Sample Tests: 우리나라 사람의 아이큐가 100 이상이다 -> 우리나라 사람의 아이큐라는 하나의 데이터 세트가 필요하다
- Two-Sample Tests: 우리나라와 일본의 아이큐를 비교해보면 우리나라가 더 높다 -> 우리나라 아이쿠 데이터 세트와 일본의 아이큐 데이터세트가 필요하다. 즉, 비교, 변화 검증에 많이 쓰인다.
- 📌 three or more sample test: 우리나라,일본, 한국의 아이큐 세개가 다 같냐 아니면 하나라도 다르냐?
즉 ANOVA TEST는 위의 3개 중 three or more sample test에 해당 한다
General ANOVA Setting
Investigator controls one or more factors of interest
- Each factor contains two or more levels
➡️ 즉 나누는 level이 2개 이상이라면 이는 최소 3개 이상의 데이터가 필요하다는 것이다 - Levels can be numerical or categorical
- Different levels produce different groups
- Think of each group as a sample from a different population
예를 들어보면,
- Factor or treatment variable
- store name
- Levels, group, categories etc
- 11st Gmkt coupang
Observe effects on the dependent variable
- Are the groups the same?
- Experimental design: the plan used to collect the data
One-Way Analysis of Variance
Evaluate the difference among the means of three or more groups
- Examples: Number of accidents for 1st, 2nd, and 3rd shift Expected mileage for five brands of tires
Assumptions
- Populations are normally distributed
- Populations have equal variances
- Samples are randomly and independently drawn
Hypotheses of One-Way ANOVA
- \(𝐻_0\): All population means are equal
- i.e., no factor effect (no variation in means among groups)
- 📌 중요한건 3개다 다르냐가 아니라 ‘세개 다 같냐?’ VS ‘하나라도 다른게 있냐?’ 라는 것이다
- At least one population mean is different
- i.e., there is a factor effect
- Does not mean that all population means are different (some pairs may be the same)
Partitioning the Variation
Total variation can be split into two parts:
- SST = Total Sum of Squares (Total variation)
- SSA = Sum of Squares Among Groups (Among-group variation)
- SSW = Sum of Squares Within Groups (Within-group variation)
보고도 무슨소린지 모를 확률이 높다….
그런 분들을 위해 조금 더 쉽게 풀어보겠다 :)
SST
Total Variation
the aggregate variation of the individual data values across the various factor levels (SST)
즉 중심에서 각각의 data들이 얼마나 떨어져있는지
- ⚪: data
- 🔷: data의 중심값
- 노란색 선: SST
즉 SST는 데이터들의 중심값들의 거리를 모두 구해서 더한 것이다
[구하는 공식]
where,
- SST = Total sum of squares
- c = number of groups or levels
- \(n_j\) = number of values in group j
- \(X_{ij}\) = \(i^{th}\) observation from group j
- 𝑋 ̅ ̅= grand mean (mean of all data values)
SSA
Among-Group Variation
variation among the factor sample means (SSA)
그룹들 간의 거리
- ⚪: data
- 🔷: data의 중심값
- 🔶: 각 그룹의 중심값
- 초록색 선: SSA
[구하는 과정]
1) 가까운 거리에 있는 데이터들끼리 임의로 그룹을 나눈다.
2) 그 그룹의 중심점 🔶을 찾는다
3) 전체 중심점 🔷과 그룹의 중심점간의 거리(초록색 선)를 구해 더한다.
[구하는 공식]
where,
- SSA = Sum of squares among group
- c = number of groups or levels
- \(n_j\) = number of values in group j
- \(𝑋 ̅_{j}\) = sample mean from group j
- 𝑋 ̅ ̅= grand mean (mean of all data values)
[MSA]
Mean Square Among = SSA/degrees of freedom
SSW
Within-Group Variation
variation that exists among the data values within a particular factor level (SSW)
각 그룹안의 중심점과 data사이의 거리
- ⚪: data
- 🔷: data의 중심값
- 🔶: 각 그룹의 중심값
- 주황색 선: SSW
[구하는 과정]
1) 가까운 거리에 있는 데이터들끼리 임의로 그룹을 나눈다.
2) 그 그룹의 중심점 🔶을 찾는다
3) 그 그룹의 중심점 🔶과 각 그룹의 데이터 ⚪ 사이의 거리(주황색 선)를 찾는다
[구하는 공식]
where,
- SSW = Sum of squares within groups
- c = number of groups or levels
- \(n_j\) = number of values in group j
- \(𝑋 ̅_{j}\) = sample mean from group j
- \(𝑋_{ij}\) = \(i^{th}\) observation in group j
[MSW]
Mean Square Within = SSW/degrees of freedo
그런데 SST = SSA + SSW 이므로,
ANOVA F Test Statistic
📌 F Test Statistic 이란?: 먼저 이게 뭔지부터 간단하게 공부하고 와보자
- \(H_0\): \(μ_1\) = \(μ_2\) = \(μ_3\) … \(μ_c\)
- \(H_1\): At least two population means are different
[Test Statistic]
- MSA is mean squares among groups
- MSW is mean squares within groups
[Degrees of freedom]
- \(df_1\) = c – 1 (c = number of groups)
- \(df_2\) = n – c (n = sum of sample sizes from all populations)
Interpreting One-Way ANOVA F Statistic
The F statistic is the ratio of the among estimate of variance and the within estimate of variance
- The ratio must always be positive
- \(df_1\) = c -1 will typically be small
- \(df_2\) = n - c will typically be large
Decision Rule:
- Reject \(H_0\) if \(F_{STAT}\) > \(F_α\), otherwise do not reject \(H_0\)
One-Way ANOVA Table
How to Solve
그럼 이제 문제로 지금까지 배운 개념들을 적용해보자
Example
You want to see if three different golf clubs yield different distances. You randomly select five measurements from trials on an automated driving machine for each club. At the 0.05 significance level, is there a difference in mean distance?
SOLVE
1) Find the F critical value
F critical value 구하는 법?
sig = 0.05
degree of freedom
- \(df_1\) = c -1 (c= number of groups)
- \(df_2\) = n - c (n = sum of sample sizes from all populations)
\(F = F_{a, df_1 , df_2} = F_{0.05, 2, 12} = 3.89\)
2) find test statistic
각각의 평균을 구하고 전체 평균을 구해보자
그럼 이를 바탕으로 SSA를 구해보자
그럼 이제 SSW를 구해보자
그리고 이를 바탕으로 \(F_{stat}\)를 구해보면,
그럼 이제 이 표를 다 채워 넣을 수 있다
8) Dicision
Decision:
- Reject \(H_0\) at a = 0.05
Conclusion:
- There is evidence that at least one \(μ_j\) differs from the rest
문제
[TABLE A]
An airline wants to select a computer software package for its reservation system. Four software packages (1, 2, 3, and 4) are commercially available. The airline will choose the package that bumps as few passengers, on the average, as possible during a month. An experiment is set up in which each package is used to make reservations for 5 randomly selected weeks. (A total of 20 weeks was included in the experiment.) The number of passengers bumped each week is obtained, which gives rise to the following Excel output:
1. Referring to Table A, the within groups degrees of freedom is
a) 3
b) 4
c) 16
d) 19
📜 정답 보기
n-c = 20 - 4 = 16
c) 16
2. Referring to Table A, the total degrees of freedom is
a) 3
b) 4
c) 16
d) 19
📜 정답 보기
n - 1 = 20 - 1 - 19
d) 19
3. Referring to Table A, the among-group (between-group) mean squares is
a) 8.525
b) 70.8
c) 212.4
d) 637.2
📜 정답 보기
212.4/3 = 70.8
b) 70.8
4. Referring to Table A, at a significance level of 1%,
a) there is insufficient evidence to conclude that the average numbers of customers bumped by the 4 packages are not all the same.
b) there is insufficient evidence to conclude that the average numbers of customers bumped by the 4 packages are all the same.
c) there is sufficient evidence to conclude that the average numbers of customers bumped by the 4 packages are not all the same.
d) there is sufficient evidence to conclude that the average numbers of customers bumped by the 4 packages are all the same.
📜 정답 보기
reject이므로 하나라도 다르다
c) there is sufficient evidence to conclude that the average numbers of customers bumped by the 4 packages are not all the same.
[TABLE B]
A hotel chain has identically sized resorts in 5 locations. The data that follow resulted from analyzing the hotel occupancies on randomly selected days in the 5 locations.
5. Fill the blanks.
📜 정답 보기
6. True or False: Referring to Table B, if a level of significance of 0.01 is chosen, the null hypothesis should be rejected.
📜 정답 보기
-> rejet이다
true
7. True or False: Referring to Table B, if a level of significance of 0.01 is chosen, the decision made indicates that all 5 locations have different mean occupancy rates.
📜 정답 보기
False
다 같은건 아니다니까 5개가 모두 다르다는 아니다.
8. True or False: Referring to Table B, if a level of significance of 0.01 is chosen, the decision made indicates that at least 1 of the 5 locations has different mean occupancy rate.
📜 정답 보기
True
at least 1 of the 5 locations has different mean occupancy rate. 이게 맞다
9. Referring to Table B, the null hypothesis for Levene’s test for homogeneity of variances is
a) \(H_0\): \(𝜇_{A}\) = \(𝜇_{B}\) = \(𝜇_{C}\) = \(𝜇_{D}\)
b) \(H_0\): \(M_{A}\) = \(M_{B}\) = \(M_{C}\) = \(M_{D}\)
c) \(H_0\): \(𝜎_{A}^2\) = \(𝜎_{B}^2\) = \(𝜎_{C}^2\) = \(𝜎_{D}^2\)
d) \(H_0\): \(𝜋_{A}\) = \(𝜋_{B}\) = \(𝜋_{C}\) = \(𝜋_{D}\)
📜 정답 보기
Levene’s test -> 분산이 같냐 아니냐
c) \(H_0\): \(𝜎_{A}^2\) = \(𝜎_{B}^2\) = \(𝜎_{C}^2\) = \(𝜎_{D}^2\)
이게 뭔지 정도는 알자