Table of Contents

central tendency

the extent to which the values of a numerical variable group around a typical or central value.
즉, 표준 값 또는 중심 값 주위의 수치 변수 그룹 값의 범위이다!
가운데를 찾는 것이다!

Mean

즉 평균이다.
arithmetic mean(ften just called the “mean”)

산술 평균(= “평균”)
중심 경향을 나타내는 가장 일반적인 척도

[구하는 식]

[평균의 문제]
만능일것만 같은 평균에도 문제가 존재한다.
아래의 경우를 생각해보자.
아래와 같은 경우 평균이 데이터의 특징을 잘 나타내준다.

하지만 이 데이터에 특이값(outliers) 인 ‘20’이 들어왔다고 생각해보자
그럼 평균이 상식적으로 데이터를 잘 나타낸다고 보긴 힘들 것이다.

[해결]
이럴 땐 Median으로 해결하자

Median

즉 중앙값이다.
In an ordered array, the median is the “middle” number (50% above, 50% below)으로,
쉽게 쓰자면 그냥 제일 가운데에 있는 데이터를 뽑아오면 된다.
➡ ❤ Less sensitive than the mean to extreme values

[구하는 식]
smallest to largest

값이 숫자 순서일 때 중앙값의 위치

값의 개수가 홀수 일 때

number of values is odd
midian의 값은 중앙값이다 (middle number)

값의 개수가 짝수 일 때

number of values is even
midian의 값은 두 중간수의 평균 (average of the two middle numbers)

Mode

Value that occurs most often: 가장 자주 발생하는 값
Not affected by extreme values: 극단값의 영향을 받지 않음
❤ Used for either numerical or categorical data: 수치 또는 범주형 데이터에 사용 가능
There may be no Mode: Mode가 없을 수 있음
There may be several modes: Mode가 한개가 아닐 수 있다.

📝 예시 보기

variation

분산
the amount of dispersion or scattering away from a central value that the values of a numerical variable show.
수치 변수 값이 나타내는 중심 값에서 벗어나는 산포 또는 산란의 양
그 분포가 어떻게 생겼는지를 보는 것이다

같은 center여도 다른 variation을 갖을 수 있다.

Measures of variation give information on the spread or variability of the data values

Range

범위이다.
즉, 데이터 집합에서 가장 큰 값과 가장 작은 값의 차이

Simplest measure of variation: 가장 간단한 변동 측정이다.
Difference between the largest and the smallest values in a set of data

[range의 문제]
평균의 문제접과 비슷하다.
Sensitive to outliers

Sample Variance

즉 분산이다
평균에서 벗어난 값의 제곱 편차의 평균
주로 $S^2$ 로 나타낸다.

[구하는 식]
smaple인 경우 계산 했을 때

n-1로 나눔 ➡ 분산값이 더 큼

population인 경우 계산 했을 때

n으로 나눔 ➡ 분산값이 더 작음
우리가 고등학교때 배운것이다.
하지만 우리는 전체 데이터를 다루는 경우는 거의 힘들 다는 것을 배웠다.
그러므로 이제 대부분 sample임을 가정하고 계산을 해야한다.

Sample Standard Deviation

표준편차다.
그냥 분산에 루트 씌우면 된다.
퍼짐의 정도 이다
주로 $S$ 로 나타낸다.

Most commonly used measure of variation: 가장 일반적으로 사용되는 변동 측도
Shows variation about the mean: 평균에 대한 변동 표시
Is the square root of the variance: 분산의 제곱근
❤ Has the same units as the original data: 원본 데이터와 동일한 단위 보유(제곱을 했으니 뺴준다)

즉, 퍼짐의 정도이므로 아래와 같이 나타낼 수 있다.

📝 예시 보기

Coefficient of Variation

즉, 변동 계수이다.

상대적: Measures relative variation
퍼센트: Always in percentage (%)
평균으로 나눔1: Shows variation relative to mean
비교 가능: Can be used to compare the variability of two or more sets of data measured in different units

Xbar: mean

위의 설명으로 이해가 안될테니 실전 문제에 적용해보자!

[ex 1]
두 Stock의 Coefficient of Variation를 구해보고 해석해봐라
Stock A
• Average price last year = $50
• Standard deviation = $5

Stock B
• Average price last year = $100
• Standard deviation = $5

📜 정답, 해설 보기

Stock A 1년간 평균이 50달러
50달러를 중심으로 5달러만큼 흔들렸구나
5달러 만큼 흔들림
즉 변동폭이 비교적 큼

Stock B
1년간 평균이 100달러
100달러를 중심으로 5달러만큼 흔들렸구나
5달러 만큼 흔들림
즉 변동폭이 비교적 작음

➡ Both stocks have the same standard deviation, but stock B is less variable relative to its price
➡ 즉, Stock A가 더 불안정하다

[ex 2]
두 Stock의 Coefficient of Variation를 구해보고 해석해봐라
Stock A
• Average price last year = $50
• Standard deviation = $5

Stock C
• Average price last year = $8
• Standard deviation = $2

📜 정답, 해설 보기

Stock A

Stock C

➡ Stock C has a much smaller standard deviation but a much higher coefficient of variation

Z-Score

데이터 값의 Z-점수를 계산 방법
- 하려면 평균을 빼고 표준 편차로 나눔: subtract the mean and divide by the standard deviation.
Z-점수는 데이터 값이 평균에서 벗어나는 표준 편차의 수입니다.
데이터 값의 Z점수가 -3.0보다 작거나 +3.0보다 크면 극단 특이치(extreme outlier) 로 간주됨.
- 약 97%에 들지 않음
Z-점수의 절대값이 클수록 데이터 값이 평균에서 멀어짐.

Xbar: the sample mean
S: the sample standard deviation

[ex]
• Suppose the mean math SAT score is 490, with a standard deviation of 100.
• Compute the Z-score for a test score of 620.

📜 정답, 해설 보기

📌 “A score of 620 is 1.3 standard deviations above the mean and would not be considered an outlier.
➡ S를 1.3에 넣어보면 된다.

shape

the pattern of the distribution of values from the lowest value to the highest value.
Describes how data are distributed

Two useful shape related statistics are:

Skewness(왜도)
- Measures the extent to which data values are not symmetrical
- 데이터 값이 대칭적이지 않은 정도를 측정한다.
- 꼭지가 어디로 치우쳐있는지
Kurtosis(첨도)
- Kurtosis affects the peakedness of the curve
- 첨도는 곡선의 정점에 영향을 미칩니다.
- 즉 얼마나 뾰족한지