Series와 DataFrame (데이터 분석을 위한 판다스 입문)

nowgnas

2021-09-07

나만의 데이터 만들기

시리즈와 데이터프레임 직접 만들기

s = pd.Series(['banana', 42])
print(s)
-----------------------------
0    banana
1        42
dtype: object

Series 메서드에 리스트를 전달하여 시리즈를 생성

s = pd.Series(['Wes MC', 'Createor'], index=['person', 'who'])
print(s)
-------------------------------------------------------------
person      Wes MC
who       Createor
dtype: object

index 인자를 이용해서 인덱스로 사용하고자 하는 문자열을 리스트에 담아 전달하면 된다

Data Frame

scientist = pd.DataFrame({
    'Name': ['frankllin', 'willam'],
    'Occupation': ['chemist', 'statistician'],
    'Born': ['1920', '1876']
})
----------------------------------------------
        Name    Occupation  Born
0  frankllin       chemist  1920
1     willam  statistician  1876

scientist_col = pd.DataFrame(
    data={'Occupation': ['chemist', 'statistician'],
          'Born': ['1920', '1876']
          },
    index=['Rosaline', 'willam'],
    columns=['Occupation', 'Born'])
----------------------------------------------------
            Occupation  Born
Rosaline       chemist  1920
willam    statistician  1876

인자를 사용해 index 와 colums 를 지정할 수 있다

scientist_dict = pd.DataFrame(OrderedDict([
    ('Name', ['Rosaline', 'Willam']),
    ('Occupation', ['chemist', 'statistician']),
    ('Born', ['1920', '1876'])
]))
-----------------------------------------------
       Name    Occupation  Born
0  Rosaline       chemist  1920
1    Willam  statistician  1876

딕셔너리는 데이터의 순서가 보장되지 않는다.
순서가 보정된 딕셔너리를 사용하기 위해 OrderedDirct 를 사용한다

시리즈 다루기 (기초)

데이터프레임에서 시리즈 선택하기

scientist = pd.DataFrame(
    data={'Occupation': ['chemist', 'statistician'],
          'Born': ['1920', '1876'],
          'Died': ['1958', '1937']},
    index=['Rosaline', 'William'],
    columns=['Occupation', 'Born', 'Died'])

first_row = scientist.loc['William']
print(type(first_row))
print(first_row)
-------------------------------------------------
<class 'pandas.core.series.Series'>

Occupation    statistician
Born                  1876
Died                  1937
Name: William, dtype: object

loc 속성을 이용해서 Willam의 정보를 얻을 수 있다

index, values 속성과 keys 메서드 사용하기

print(first_row.index)
print(first_row.values)
print(first_row.keys())
------------------------
# index
Index(['Occupation', 'Born', 'Died'], dtype='object')
# values
['statistician' '1876' '1937']
# keys()
Index(['Occupation', 'Born', 'Died'], dtype='object')

시리즈 다루기 (응용)

scientist = pd.read_csv('../doit_pandas/data/scientists.csv')

ages = scientist['Age']
print(ages.max())
---------------------------------
90

시리즈의 max() min() std() 등의 메서드를 호출해서 해당 열의 통계 수치를 알 수 있다

시리즈와 브로드캐스팅

1
2
3

print(ages)
print(ages + ages)
print(ages * ages)

같은 길이의 벡터로 더하기 연산과 곱하기 연산 수행이 가능하다
스칼라 값을 곱하거나 더해도 벡터의 모든 값에 스칼라를 적용하여 브로드캐스팅 된다

print(ages + pd.Series([1, 100]))
----------------------------------
0     38.0
1    161.0
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
dtype: float64

길이가 다른 벡터를 연산하면 길이를 벗어난 값들은 NaN 이 반환되고 길이가 일치하는 만큼 연산이 이뤄진다

rev_ages = ages.sort_index(ascending=False)
print(rev_ages)
-----------------------------------------
7    77
6    41
5    45
4    56
3    66
2    90
1    61
0    37

sort_index 메서드에 ascending 인자 값을 False로 지정해 인덱스 역순으로 데이터를 정렬가능하다
벡터와 벡터의 연산은 일치하는 인덱스 끼리 연산을 진행한다

시리즈와 데이터프레임의 데이터 처리하기

born_datetime = pd.to_datetime(scientist['Born'], format='%Y-%m-%d')
died_datetime = pd.to_datetime(scientist['Died'], format='%Y-%m-%d')

scientist['born_dt'], scientist['died_dt'] = (born_datetime, died_datetime)
print(scientist.head())

scientist['born_dt'] 을 사용해서 새로운 열을 만들 수 있다

1	scientist_dropped = scientist.drop(['Age'], axis=1)

열을 삭제하기 위해 drop()을 사용한다
axis=1을 사용하면 Age 열을 삭제할 수 있다

데이터 저장하고 불러오기

데이터를 피클, CSV, TSV 파일로 저장하고 불러오기

names = scientist['Name']
names.to_pickle('output/scientist_name.pickle')

read_name = pd.read_pickle('PATH')

.pickle을 확장자로 가지는 파일을 생성할 수 있다
pickle 파일을 읽어올 때는 read_pickle을 사용해야 한다