[21] Python(Pandas)

20 Jun 2020

Reading time ~3 minutes

Table of Contents

목차
판다스 시작하기
데이터 추출하기
- 열 단위 데이터 추출
- 행 단위 데이터 추출
기초적인 통계 계산하기
- 그룹화
- 그래프를 그리기 위해 사용하는 매직함수

판다스 시작하기

데이터 효율적으로 불러오기

시리즈
데이터프레임: 시리즈들이 각 요소가 되는 딕셔너리

데이터 집합 불러오기

• read_csv
-> df = pd.read_csv( )
:데이터 집합 읽고 데이터프레임이라는 자료형으로 반환

• 데이터 확인 행 출력
-> print(df.head(개수))

country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106

• 타입확인
print(type(df))
<class ‘pandas.core.frame.DataFrame’>

• 행과 열의 크기 -> print(df.shape)
(1704, 6) • 데이터프레임의 열 이름 출력 -> print(df.columns)

Index([‘country’, ‘continent’, ‘year’, ‘lifeExp’, ‘pop’, ‘gdpPercap’], dtype=’object’)

• 데이터 구성 값 확인 -> Print(df.dtypes)

country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

print(df.info(개수))

  <class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
None

판다스와 파이썬 자료형 비교

데이터 추출하기

열 단위 데이터 추출

열 추출
파일이름 =df[‘’]
1개 열: 시리즈 얻음
2개이상 열: 데이터 프레임 얻음(df)

country_df = df['country']
print(country_df.tail( 뽑을 개수))-> 밑에서부터
1699    Zimbabwe
1700    Zimbabwe
1701    Zimbabwe
1702    Zimbabwe
1703    Zimbabwe
Name: country, dtype: object

여러 개의 열 한꺼번에 추출
df[[‘열 이름’, ‘열이름’,’열 이름’]]

행 단위 데이터 추출

인덱스 : 왼쪽 번호
loc 데이터가 추가, 삭제되면 변함
First, second와 같은 문자열로 지정가능

행번호 : 데이터의 순서 따라감
iloc 정수만으로 데이터 조회, 추출X
실제 데이터프레임에서 확인할 수 없는 값

• loc 로 행 데이터 추출

인덱스가 0인 행데이터(열 형태로) 추출
Print(df.loc[0]) -> 인덱스 번호

country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap        779.445
Name: 0, dtype: object

마지막행 데이터 추출
-1로 추출 불가
shape[] 행크기(1704)로 변환후 추출

A = df.shape[0]
B = A - 1
C = df.loc[B]
print(C)
> tail(1)

여러개 인덱스 한번에 추출

print(df.loc[[0,99,999]]) 
country continent  year  lifeExp       pop    gdpPercap
0    Afghanistan      Asia  1952   28.801   8425333   779.445314
99    Bangladesh      Asia  1967   43.453  62821884   721.186086
999     Mongolia      Asia  1967   51.253   1149500  1226.041130

=>loc 속성이 반환하는 데이터 자료형: 시리즈
tail속성이 반환하는 데이터 자료형: 데이터 프레임

• iloc 로 행 데이터 추출
위와 같음
But -1 추출가능
아이에 없는 숫자는 X

• loc, iloc 활용 df.loc[[행],[열]] or df.iloc[[행],[열]]
-> iloc 에 문자열 리스트 안됌

Range: 정수리스트반환([0,1,2,3])아니라
제네레이터(데이터 추출X)(range(0.4)) 반환
->list를 통해서 리스트 반환 후 사용

슬라이싱과 리스트 가능

subset = df.loc[:,[‘year’, ‘pop’]
 print(subset.head())
 year       pop
0  1952   8425333
1  1957   9240934
2  1962  10267083
3  1967  11537966
4  1972  13079460

 subset = df.iloc[:,[ 2, 4, -1 ]]
 print(subset.head())
year       pop   gdpPercap
0  1952   8425333  779.445314
1  1957   9240934  820.853030
2  1962  10267083  853.100710
3  1967  11537966  836.197138
4  1972  13079460  739.981106

기초적인 통계 계산하기

그룹화

• 그룹화하여 평균구하기 ‘year’,’continent’열로 그룹화한 그룹 데이터프레임에서’lifeExp’,’gdpPercap’열만 추출하여 평균 구함

print(df.groupby(['year','continent'])[['lifeExp','gdpPercap']].mean())

• 그룹화하여 데이터 개수세기

print(df.groupby('continent')['country'].nunique())

그래프 그리기

그래프를 그리기 위해 사용하는 매직함수

%matplotlib inline

import matplotlib.pyplot as plt
global_yearly_life_expectancy=df.groupby('year')['lifeExp'].mean() ->(가로)[세로]
global_yearly_life_expectancy.plot()