Quant : Pandas - API (1)

EDA : Exploratory Data Analysis

- In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods(wiki)

1. MetaData

- data of data

2. Univariate descriptive statistics

- summary statistics about individual variables(columns) ex) 평균, 분포 등

Metadata

.shape

- 데이터 사이즈 확인

df.shape
# (681, 15) : row & column

.dtypes.value_counts()

- Series의 형태 확인 (각 칼럼이 어떤 데이터 타입인지)

- 리턴값을 알아야 어떤 오퍼레이션, 분석을 진행할 지 알 수 있다.

df.dtypes.value_counts()
# float64    15
# object      1
# dtype: int64

# df.get_dtype_counts() : 0.25.0 version  -> df.dtypes.value_counts()

.info()

- 각 컬럼들의 정보를 요약

df.info()

#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 681 entries, 0 to 680
#Data columns (total 16 columns):
# #   Column     Non-Null Count  Dtype  
#---  ------     --------------  -----  
# 0   ticker     681 non-null    object 
# 1   매출액(억원)    680 non-null    float64
# 2   영업이익률(%)   680 non-null    float64
# 3   순이익률(%)    680 non-null    float64
# 4   당기순이익(억원)  680 non-null    float64
# 5   ROE(%)     665 non-null    float64
# 6   ROA(%)     665 non-null    float64
# 7   ROIC(%)    611 non-null    float64
# 8   EPS(원)     681 non-null    float64
# 9   BPS(원)     681 non-null    float64
# 10  SPS(원)     681 non-null    float64
# 11  PER(배)     668 non-null    float64
# 12  PBR(배)     681 non-null    float64
# 13  PSR(배)     668 non-null    float64
# 14  price      681 non-null    float64
# 15  price2     681 non-null    float64
# dtypes: float64(15), object(1)
# memory usage: 85.2+ KB

.dtype

- 특정 컬럼의 정보를 가져옴

df['ticker'].dtype

# dtype('O')

.rename

- 컬럼 명칭 변경

df = df.rename(columns={"ticker":"종목명"})

Summary statistics

.describe()

- 데이터 요약 정보

df.describe()

# count : nan은 세지 않음
# mean 평균
# std : 표준편차
# min / max : 최소값 / 최대값
# 25 / 50 / 75%

- row, column 변경

a = df.describe()
a.T

# df.describe().T, 숫자형이 아닌 Series는 제외됨
# df.describe(include = [np.number]) (default)

.describe(percentile = x)

- 퍼센티지 데이터 표시

df.decribe(percentile=[0.01, 0.03, 0.99]).T.head(2)

.describe(include = x)

.descirbe(exclude = x)

- string, categorical, object. 숫자형 외의 데이터 표시/제외

df.describe(include=[object, pd.Categorical]).T.head()
# df.describe(exclude=[np.number]).T.head() # 위와 같음

# top : 가장 많이 나오는 단어

.quantile(x)

- 백분위수 표기

df['PER(배)'].quantile(.2)  : 백분위수
# -1.630518

df['PER(배)'].quantile([.1, .2, .3])
# 0.100   -10.562
# 0.200    -1.631
# 0.300     6.177
# Name: PER(배), dtype: float64

.nquniue()

- 각 컬럼별로 unique한 값이 몇개인지 Serise 형태로 반환

- null value, nan은 세지 않음

df.nunique()

#종목명          681
#매출액(억원)      680
#영업이익률(%)     667
#순이익률(%)      672
#당기순이익(억원)    680
#ROE(%)       655
#ROA(%)       650
#ROIC(%)      610
#EPS(원)       681
#BPS(원)       681
#SPS(원)       681
#PER(배)       668
#PBR(배)       680
#PSR(배)       668
#price        628
#price2       620
#dtype: int64

df['종목명'].nuqniue()
# 681

.unique()

- 해당 칼럼의 unique한 값들을 array 형태로 반환

df["종목명"].unique()

#array(['AK홀딩스', 'BGF', 'BNK금융지주', 'BYC', 'CJ', 'CJ CGV', 'CJ대한통운',
#       'CJ씨푸드', 'CJ제일제당', 'CS홀딩스', 'DB', 'DB금융투자', 'DB손해보험', 'DB하이텍',
#       'DGB금융지주', 'DRB동일', 'DSR', 'DSR제강', 'E1', 'F&F', 'GKL', 'GS',

.value_counts()

- 각 값이 몇개 있는지 반환

- .valoue_counts(normalize = True) : 전체에서 몇% 인지 반환

- df["Sector"].value_counts() : 월/연별 섹터의 변화 파악 가능 (투자지표 활용 可)

df['종목명'].value_counts()
# df['종목명'].value_counts(normalize=True)

# AK홀딩스     1
# 인터지스      1
# 이스타코      1
# 이아이디      1
# 이연제약      1
#         ..
# 롯데정밀화학    1
# 롯데지주      1
# 롯데칠성음료    1
# 롯데케미칼     1
# 흥아해운      1
# Name: 종목명, Length: 681, dtype: int64

* count(), unique(), nquniue(), value_counts() 등과 같이 value의 빈도와 관련 있는 함수들이. nan을 포함해서 계산하는지 잘 숙지해야 함

정렬

.nsmallest() / .nlargest()

df.nsmallest(5, "PER(배)")
# PER 기준으로 가장 작은 5개를 가져옴

df.nsmallest(100, "PER(배)").nlargest(5, '당기순이익(억원)')
# PER이 가장작은 100개중에서, 그 중에서 당기순이익이 가장 큰 5개 종목의 데이터

.sort_values()

df.sort_values("EPS(원)")
# EPS 오름차순 정렬

df.sort_values("EPS(원)", ascending=False).head()
# EPS 내림차순 정렬

df.sort_values(
    ['순이익률(%)', 'EPS(원)'],
    ascending=[True, False]
).head()
# 1순위 : 순이익률 오름차순 정렬, 2순위 : EPS 내림차순 정렬

Selecting subset

By Columns

df['EPS(원)']
# string으로 인덱싱을 하면 Serises로 반환한다.

df[['EPS(원)', '종목명']]
# list로 인덱싱하면 DataFrame으로 반환한다.

type(df['순이익률(%)'])
# pandas.core.series.Series

type(df[['순이익률(%)', '당기순이익(억원)'] ])
# pandas.core.frame.DataFrame

.filter()

df.filter(like="RO").head()
# like : 'RO' 가 들어가는 column

df.filter(regex="P\w+R").head()
# regex 정규표현식

By dtype

.dtypes.value_counts()

- 각 column에 해당되는 데이터 타입 명시

df.dtypes.value_counts()

# float64    15
# object      1
# dtype: int64

.select_dtypes()

- type에 매칭되는 column만 표시

- int, float, object ...

df.select_dtypes(include=['float']).head()

df.select_dtypes(include=['object']).head()
# df.select_dtypes(include=['str']).head() (X) => use `object` instaed

By Row

.iloc / .loc

- Row 기준으로 데이터를 가져옴

- iloc : integer loc. integer로 row를 명시함

- arg를 list로 묶으면 DataFrame 형태로 반환

- arg를 string으로 보내면 Series 형태로 반환

name_df.iloc[[0, 3]]

name_df.loc[['삼성전자', 'CJ']]

name_df.loc["가":"다"].head() # 사전식으로, "가"~"다"에 해당되는 row 반환

loc : range indexing

name_df = name_df.sort_index()
# 반드시 index를 sort를 해야만 loc을 이용한 range indexing이 가능
# index가 sort된 새로운 dataframe을 return하는데 그것을 다시 name_df로 받음

name_df.index.is_monotonic_increasing
# True
# 오름차순으로 정렬된 것 확인

name_df.loc["삼성":"삼성전자"].head(2)
# 사전식으로 "삼성" ~ "삼성전자"에 해당되는 rows 반환

loc : row + column. 범위 설정

# '삼성전자' index의, 순이익률 'column' 검색
name_df.loc["삼성전자", "순이익률(%)"]
# 9.499

name_df.loc[["삼성SDI", "삼성전자"], ["순이익률(%)", "EPS(원)"]]
#   	 순이익률(%) EPS(원)
# 종목명		
# 삼성SDI	  0.518	  765.051
# 삼성전자	9.499	2197.652

iloc도 동일하게 활용 可

name_df.iloc[[0, 3], :]
name_df.iloc[[0, 3], [0,1]]

# iloc은 integer만 사용 可
# df.iloc[[0, 3], "상장일"]   # error
# df.iloc[[0, 3], ["상장일", "종가"]]   # error

By at

- Scalar value 계산 시, loc/iloc보다 빠름

- .at() <-> .loc()

- .iat() <-> .iloc()

- at/loc은 DataFrame(테이블 형태)로 가져올 때 유용함

df.at[100, '순이익률(%)']  

%timeit df.loc[100, '순이익률(%)']
%timeit df.at[100, '순이익률(%)'] 
# 6.53 µs ± 222 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 3.3 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit df['순이익률(%)'].iloc[100]
%timeit df['순이익률(%)'].iat[100]
# 6.48 µs ± 187 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 3.57 µs ± 100 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

'Quant' 카테고리의 다른 글

Quant : Pandas - Grouping (0)	2022.03.27
Quant : Pandas - API(2) (0)	2022.03.15
Quant : Pandas - Series & DataFrame (0)	2022.03.12
Quant : 사전 계획 (0)	2022.03.08
Quant : 개발환경 세팅 (2)	2022.03.07

Do What You Cant

Quant : Pandas - API (1)

EDA : Exploratory Data Analysis

Metadata

Summary statistics

정렬

Selecting subset

'Quant' 카테고리의 다른 글

티스토리툴바

Quant : Pandas - API (1)

EDA : Exploratory Data Analysis

Metadata

Summary statistics

정렬

Selecting subset

'Quant' 카테고리의 다른 글

'Quant' Related Articles

티스토리툴바