본문 바로가기
카테고리 없음

[STATA] 기초통계분석

by e-money2580 2023. 1. 10.
반응형

** su (sum, summarize statistics) : 기초 통계량 표시 (사전에 정해진 형태의 통계량)

use "D:\STATA연습데이터\STATA기초적이해와활용\data7_1.dta", clear

su lived

su lived, de

by gender, sort : su lived
// gender 변수에 따라 정렬 후 성별에 따라 두 그룹으로 나누어 기초통계량 보여주기

su lived if gender==0
// gender 변수가 0(남성)인 경우에만 보여주기

label list
des gender

su lived if gender=="male" : sexlbl

su lived
return list // system variable에 저장된 값 보여주기

gen lived_s=(lived-r(mean))/r(sd) 
// lived의 각 관측치에서 시스템 변수에 저장된 평균을 빼고 표준편차로 나누어준 값을 새 변수에 입력

su lived_s // 표준화한 값 확인

su lived educ
return list // 두개의 변수에 대한 기초통계량 명령어 실행시 두번째(마지막) 변수의 값이 시스템 변수에 입력



** tabstat (table of statistics) : 기초 통계량 표시 (원하는 통계량을 골라서 표시 가능)

use "D:\STATA연습데이터\STATA기초적이해와활용\data7_1.dta", clear

tabstat lived, s(mean median) by(gender) c(v) format(%9.2f)
// s(원하는 통계량) by(구분하려는 그룹의 기준 변수) c(v) : 열에 변수이름 표시 c(s) : 열에 통계량 표시



** tab (table, tabulate) : 범주형 (더미) 변수의 요약 통계량 보여주기 (주로 빈도)

// one-way table (1개 변수)
tab gender

// two-way 테이블 (2개 변수)
tab gender kids 
tab gender meetings
tab gender contam
tab gender school

// tab1 : one-way table (3개 이상 변수)
tab1 gender kids meeting

// tab2 : twp-way table (3개 이상 변수) 
tab2 gender kids meeting



** 카이제곱 검정 : 두 변수가 서로 독립인지(귀무가설) 검정
tab gender meeting, exp chi2
// exp : 표에 예상빈도수를 표시   chi2 : 카이제곱 검정 실행

/* <카이제곱 검정 결과>

Attended meetings on
Respondent pollution
's gender no        yes Total

male 42         18 60 
41.6       18.4 60.0 

female 64         29 93 
64.4       28.6 93.0 

Total 106         47 153 
106.0       47.0 153.0 

       Pearson chi2(1) =   0.0240   Pr = 0.877
*/

// p값이 0.877이므로 두 변수가 독립일 것이라는 귀무가설을 기각하지 못함 = 두 변수는 독립

tabi 42 18 \ 64 29, chi2 expected

/* <카이제곱 검정 결과>

col
row 1 2 Total

1 42 18 60 
41.6 18.4 60.0 

2 64 29 93 
64.4 28.6 93.0 

Total 106 47 153 
106.0 47.0 153.0 


         Pearson chi2(1) =   0.0240   Pr = 0.877

*/
// p값이 0.877이므로 두 변수가 독립일 것이라는 귀무가설을 기각하지 못함 = 두 변수는 독립


** ci : 신뢰구간 구하기 (confidence interval)

ci mean lived // ci mean [변수이름]

/* 

Variable Obs Mean Std. err. [95% conf. interval]

lived 153 19.26797 1.370703 16.55988 21.97606

*/
// 변수 lived의 모평균에 대한 신뢰수준 95%의 신뢰구간은 16.56~21.98


ci proportions meetings, wald  
// binomial(이항변수)의 성공비율에 대한 신뢰구간 구하기)

/*

Binomial Wald    
Variable Obs Proportion Std. err. [95% conf. interval]

meetings 153 .3071895 .0372962 00.2340903    .3802888

*/
// 변수 meeting의 모비율에 대한 신뢰수준 95%의 신뢰구간은 0.23~0.38



use "D:\STATA연습데이터\STATA기초적이해와활용\data7_2.dta",clear

tab die

ci mean die, poisson exposure(followup)

/*
Poisson exact    
Variable Exposure Mean Std. err. [95% conf. interval]

die 5051   .0067313 .0011544 .0046616 .0094064

*/
// 조사대상 기간은 총 5,051개월 (단위 : 1개월)이고 모비율에 대한 신뢰수준 95%의 신뢰구간은 0.47%~0.94%



** t-test (1) 1개 표본 내에서 평균 비교 검정

use "D:\STATA연습데이터\STATA기초적이해와활용\data7_3.dta",clear

ttest inc_male==3000
// inc_male 변수의 모평균은 3000일 것이라는 귀무가설에 대해 양측 검정

/* <t-test 결과>

Variable Obs Mean Std. err. Std. dev. [95% conf. interval]

inc_male 19 2728 508.9954 2218.659 1658.64 3797.36

mean = mean(inc_male) t =  -0.5344
H0: mean = 3000 Degrees of freedom = 18

Ha: mean < 3000 Ha: mean != 3000 Ha: mean > 3000
Pr(T < t) = 0.2998 Pr(T > t) = 0.5996 Pr(T > t) = 0.7002
*/
// 양측검정결과의 p값이 0.5996으로 유의수준 0.05보다 훨씬 크므로 귀무가설을 기각할 수 없다
// = inc_male의 모평균은 3000이다 (남성의 평균소득은 3,000만원이다)


** t-test (2) 매칭되는 두 변수의 평균 비교 검정

use "D:\STATA연습데이터\STATA기초적이해와활용\data7_4.dta"

ttest husband == wife
// 남편과 아내의 가사노동시간은 같을 것이라는 귀무가설에 대한 양측 검정

/* <t-test 결과>

Variable Obs Mean Std. err. Std. dev. [95% conf. interval]

husband 5 13.8 1.984943 4.438468 8.288914 19.31109
wife 5 36 2.774887 6.204837 28.29568 43.70432

diff 5 -22.2 2.177154 4.868265 -28.24475 -16.15525

     mean(diff) = mean(husband - wife)                            t = -10.1968
 H0: mean(diff) = 0                              Degrees of freedom =        4

 Ha: mean(diff) < 0           Ha: mean(diff) != 0           Ha: mean(diff) > 0
 Pr(T < t) = 0.0003         Pr(|T| > |t|) = 0.0005          Pr(T > t) = 0.9997
*/
// 양측검정결과 p값은 0.0005로 유의수준 0.05보다 훨씬 작은 값이므로 귀무가설을 기각한다
// = 한 가구에서 남편과 아내의 평균 가사노동시간은 서로 다르다.
// 95% 신뢰수준에서 남편과 아내의 평균 가사노동시간 차이의 신뢰구간은 -28.24~-16.16시간 


** t-test (3) 서로 독립적인 두 변수의 평균 비교 검정

use "D:\STATA연습데이터\STATA기초적이해와활용\data7_3.dta",clear

ttest inc_female == inc_male, unpaired
// 남성과 여성의 소득수준은 같을 것이라는 귀무가설에 대한 양측 검정, 두 변수가 서로 독립이므로 unpaired 옵션 추가

/*
Variable Obs Mean Std. err. Std. dev. [95% conf. interval]

inc_fe~e 14 2272.714 674.6932 2524.471 815.1283 3730.3
inc_male 19 2728 508.9954 2218.659 1658.64 3797.36

Combined 33 2534.848 404.8982 2325.963 1710.098 3359.599

diff -455.2857 828.3372 -2144.691 1234.119

    diff = mean(inc_female) - mean(inc_male)                      t =  -0.5496
H0: diff = 0                                     Degrees of freedom =       31

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.2933         Pr(|T| > |t|) = 0.5865          Pr(T > t) = 0.7067
*/
// 양측 검정결과 p값이 0.5865로 유의수준 0.05보다 훨씬 크므로 귀무가설을 기각하지 못함
// = 남성과 여성의 소득수준은 같다


** t-test (4) 서로 독립적인 두 개의 그룹의 평균 비교 검정결과

use "D:\STATA연습데이터\STATA기초적이해와활용\data7_3.dta",clear

stack inc_female inc_male, into(income) // wide type 데이터를 long type으로 변환

ttest income, by(_stack) unequal welch
// 남성집단과 여성집단의 평균 소득수준은 같을 것이라는 귀무가설 양측 검정
// unequal : 두 집단의 분산이 다름  Welch 수정 자유도 사용

/*
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       1 |      14    2272.714    674.6932    2524.471    815.1283      3730.3
       2 |      19        2728    508.9954    2218.659     1658.64     3797.36
---------+--------------------------------------------------------------------
Combined |      33    2534.848    404.8982    2325.963    1710.098    3359.599
---------+--------------------------------------------------------------------
    diff |           -455.2857    845.1551               -2187.312    1276.741
------------------------------------------------------------------------------
    diff = mean(1) - mean(2)                                      t =  -0.5387
H0: diff = 0                             Welch's degrees of freedom =  27.7141

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.2972         Pr(|T| > |t|) = 0.5944          Pr(T > t) = 0.7028
*/
// p값이 0.5944로 5% 유의수준보다 훨씬 크므로 귀무가설을 기각하지 못함
// = 남녀 집단의 소득수준의 평균은 차이가 없다



use "D:\STATA연습데이터\STATA기초적이해와활용\data7_5.dta", clear

** 배우자 여부(marital1 : 1있음 0없음)에 따라 남여(gender 1남성 2여성) 평균소득 차이 검정
** 고용상태는 full-time job(1)으로 한정

by marital1, sort : ttest income if emp==1, by(gender)
// 남성과 여성의 평균소득 차이는 없을 것이라는 귀무가설을 배우자 있는 집단과 없는 집단에 대해 각각 검정
/*
-> marital1 = 0

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
     men |     329    1943.334    64.57033      1171.2     1816.31    2070.359
   women |     274    1667.588    81.39332    1347.299    1507.349    1827.826
---------+--------------------------------------------------------------------
Combined |     603    1818.036    51.34085    1260.729    1717.208    1918.865
---------+--------------------------------------------------------------------
    diff |            275.7468    102.5824                74.28314    477.2104
------------------------------------------------------------------------------
    diff = mean(men) - mean(women)                                t =   2.6881
H0: diff = 0                                     Degrees of freedom =      601

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.9963         Pr(|T| > |t|) = 0.0074          Pr(T > t) = 0.0037

 -------------------------------------------------------------------------------
 -> marital1 = 1

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
     men |     456      2346.8    62.71164    1339.154     2223.56    2470.041
   women |     344    1650.515    70.13199    1300.755    1512.572    1788.457
---------+--------------------------------------------------------------------
Combined |     800    2047.397    48.30306    1366.217    1952.582    2142.213
---------+--------------------------------------------------------------------
    diff |            696.2859    94.46541                510.8559     881.716
------------------------------------------------------------------------------
    diff = mean(men) - mean(women)                                t =   7.3708
H0: diff = 0                                     Degrees of freedom =      798

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000
*/
// 배우자가 없는 경우, 있는 경우 모두, 양측검정 결과 p값이 유의수준 5%보다 낮으므로 귀무가설을 기각
// = 배우자가 있든 없든 full-time job인 남성과 여성의 평균 소득수준은 차이가 있다
// diff 값을 보면 남성의 평균 소득이 여성보다 높음을 알 수 있다

 

[출처]  기초통계와 회귀분석(민인식, 최필선, 2012), 한국STATA학회 홈페이지(http://kastata.org/html/sub02-04.asp)

반응형

댓글