[Python] Data manipulation with pandas(2)

3 minute read

Pandas

Aggregating dataframe
- pandas Series인 상태로 계산하면 계산값이 int/float 형태로, pandas Dataframe 상태로 계산하면 dataframe으로 반환
  - .mean() - 평균
  - .median() - 중위값
  - .mode() - 최빈값
  - .max() - 최대값
  - .min() - 최소값
  - .var() - 분산
  - .std() - 표준편차
  - .sum() - 합
  - .quantile() - 분위수
  - .agg() - custom function 사용이 가능하게 함
- dataframe으로 반환
  - .cumsum() - 누적 합
  - .cummax() - 누적 중 최대값
  - .cummin() - 누적 중 최소값
  - .cummprod() - 누적 곱
unique
- .drop_duplicates()
group_by
- .groupby()
pivot
- .pivot_table

# import data
import seaborn as sns
import pandas as pd
import numpy as np

iris = sns.load_dataset('iris')

.mean() & .median()

df.mean()
df.median()

print(iris['sepal_length'].mean(), iris['sepal_length'].median())

5.843333333333334 5.8

.agg()

여러 열에 함수를 적용할 경우 리스트 안에 함수를 기입

def irq(column):
    return column.quantile(0.75) - column.quantile(0.25)

iris[['sepal_length', 'sepal_width']].agg([irq, np.mean])

	sepal_length	sepal_width
irq	1.300000	0.500000
mean	5.843333	3.057333

.cumsum() & .cummax() & .cummin() & .cumprod()

cumsumed_sepaL_length = iris['sepal_length'].cumsum()
cummax_sepaL_length = iris['sepal_length'].cummax()
cummin_sepaL_length = iris['sepal_length'].cummin()
cumprod_sepaL_length = iris['sepal_length'].cumprod().round(2)

iris['sepal_length_cumsum'] = cumsumed_sepaL_length
iris['sepal_length_cummax'] = cummax_sepaL_length
iris['sepal_length_cummin'] = cummin_sepaL_length
iris['sepal_length_cumprod'] = cumprod_sepaL_length
iris[['sepal_length', 'sepal_length_cumsum', 'sepal_length_cummax', 'sepal_length_cummin', 'sepal_length_cumprod']].head(10)

	sepal_length	sepal_length_cumsum	sepal_length_cummax	sepal_length_cummin	sepal_length_cumprod
0	5.1	5.1	5.1	5.1	5.10
1	4.9	10.0	5.1	4.9	24.99
2	4.7	14.7	5.1	4.7	117.45
3	4.6	19.3	5.1	4.6	540.28
4	5.0	24.3	5.1	4.6	2701.42
5	5.4	29.7	5.4	4.6	14587.66
6	4.6	34.3	5.4	4.6	67103.25
7	5.0	39.3	5.4	4.6	335516.24
8	4.4	43.7	5.4	4.4	1476271.46
9	4.9	48.6	5.4	4.4	7233730.13

.drop_duplicates()

조건을 2개 이상으로 할 경우 리스트 안에 열이름을 기입

print(
    iris.drop_duplicates('petal_width').shape,
    iris.drop_duplicates(['petal_length', 'petal_width']).shape
)

(22, 5) (102, 5)

.value_counts()

pandas Series만 가능
normalize 옵션을 통해 비율 산출 가능

iris['species'].value_counts(sort=True)

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

iris['species'].value_counts(sort=True, normalize=True)

setosa        0.333333
versicolor    0.333333
virginica     0.333333
Name: species, dtype: float64

.groupby()

조건을 2개 이상으로 할 경우 리스트 안에 열이름을 기입

iris.groupby('species')[['sepal_length', 'sepal_width']].mean()

# 조건이 두개인 경우
# iris.groupby(['species', 'petal_length'])[['sepal_length', 'sepal_width']].mean()

	sepal_length	sepal_width
species
setosa	5.006	3.428
versicolor	5.936	2.770
virginica	6.588	2.974

iris.groupby('species').agg([np.mean, np.median, np.max, np.min])

	sepal_length				sepal_width				petal_length				petal_width
	mean	median	amax	amin	mean	median	amax	amin	mean	median	amax	amin	mean	median	amax	amin
species
setosa	5.006	5.0	5.8	4.3	3.428	3.4	4.4	2.3	1.462	1.50	1.9	1.0	0.246	0.2	0.6	0.1
versicolor	5.936	5.9	7.0	4.9	2.770	2.8	3.4	2.0	4.260	4.35	5.1	3.0	1.326	1.3	1.8	1.0
virginica	6.588	6.5	7.9	4.9	2.974	3.0	3.8	2.2	5.552	5.55	6.9	4.5	2.026	2.0	2.5	1.4

.pivot_table

default는 mean이며, aggfunc을 통해 복수의 aggregate 값 산출
values에는 aggregate한 값을 원하는 변수 기입
index에는 groupby로 지정하고 싶은 변수 기입, 해당 변수에 따라 aggregate된 값을 row에 따라 제시
columns에 groupby 지정하고 싶은 변수를 기입할 경우, 해당 변수에 따라 aggregate된 값을 column에 따라 제시
fill_value에는 Nan 대신 산출하고 싶은 값 기입
margins에는 종합적인 aggregate 값을 산출하고 싶을 경우 True 사용

iris.pivot_table(values = ['sepal_length', 'petal_length'],
                 index = ['species','petal_width'],
                 aggfunc = [np.mean, np.max],
                 margins = True)

		mean		amax
		petal_length	sepal_length	petal_length	sepal_length
species	petal_width
setosa	0.1	1.380000	4.820000	1.5	5.2
	0.2	1.444828	4.972414	1.9	5.8
	0.3	1.428571	4.971429	1.7	5.7
	0.4	1.571429	5.300000	1.9	5.7
	0.5	1.700000	5.100000	1.7	5.1
	0.6	1.600000	5.000000	1.6	5.0
versicolor	1.0	3.628571	5.414286	4.1	6.0
	1.1	3.566667	5.400000	3.9	5.6
	1.2	4.240000	5.780000	4.7	6.1
	1.3	4.176923	5.884615	4.6	6.6
	1.4	4.500000	6.357143	4.8	7.0
	1.5	4.580000	6.190000	4.9	6.9
	1.6	4.766667	6.100000	5.1	6.3
	1.7	5.000000	6.700000	5.0	6.7
	1.8	4.800000	5.900000	4.8	5.9
virginica	1.4	5.600000	6.100000	5.6	6.1
	1.5	5.050000	6.150000	5.1	6.3
	1.6	5.800000	7.200000	5.8	7.2
	1.7	4.500000	4.900000	4.5	4.9
	1.8	5.381818	6.445455	6.3	7.3
	1.9	5.320000	6.340000	6.1	7.4
	2.0	5.550000	6.650000	6.7	7.9
	2.1	5.783333	6.916667	6.6	7.6
	2.2	6.033333	6.866667	6.7	7.7
	2.3	5.700000	6.912500	6.9	7.7
	2.4	5.433333	6.266667	5.6	6.7
	2.5	5.933333	6.733333	6.1	7.2
All		3.758000	5.843333	6.9	7.9

Share on

Twitter Facebook LinkedIn

ZSU

[Python] Data manipulation with pandas(2)

Pandas

.mean() & .median()

.agg()

.cumsum() & .cummax() & .cummin() & .cumprod()

.drop_duplicates()

.value_counts()

.groupby()

.pivot_table

Share on

You may also enjoy

[Kubernetes] Force Delete Pod, PV, and PVC

[Software Engineering at Google] CH8 Style Guides and Rules

[Software Engineering at Google] CH7 Measuring Engineering Productivity

[Software Engineering at Google] CH6 Leading at Scale