[Python] Data manipulation with pandas(3)

5 minute read

Pandas

Slicing and Indexing DataFrames
- pandas에는 index와 loc을 사용해 더 깔끔한 코드로 subsetting을 수 있음
- 그러나 data가 index로 활용된 측면에서 이는 tidy data는 아님(이 문제는 .reset_index를 통해 해결 가능)
- 개인적으로 dataframe 정렬 후 slicing을 통해 dataframe을 subsetting할 때 유용할 수 있을 것이라고 생각
- .set_index()
- .reset_index()
- .loc()
- .sort_index()

# import data
import seaborn as sns
import pandas as pd
import numpy as np

# pd.set_option('display.max_rows', 1000)
# pd.set_option('display.max_columns', 1000)

iris = sns.load_dataset("iris")
temperatures = pd.read_csv("temperatures.csv")
temperatures.head(3)

	Unnamed: 0	date	city	country	avg_temp_c
0	0	2000-01-01	Abidjan	Côte D'Ivoire	27.293
1	1	2000-02-01	Abidjan	Côte D'Ivoire	27.685
2	2	2000-03-01	Abidjan	Côte D'Ivoire	29.061

.set_index()

index는 multilevel(2개 이상)으로 설정할 수 있음

# index가 1개인 경우
iris["name"] = iris["species"] + np.arange(len(iris)).astype(str)
iris.set_index("name").head()

	sepal_length	sepal_width	petal_length	petal_width	species
name
setosa0	5.1	3.5	1.4	0.2	setosa
setosa1	4.9	3.0	1.4	0.2	setosa
setosa2	4.7	3.2	1.3	0.2	setosa
setosa3	4.6	3.1	1.5	0.2	setosa
setosa4	5.0	3.6	1.4	0.2	setosa

# index가 2개인 경우
temperatures_multi_ind = temperatures.set_index(["country", "city"])
temperatures_multi_ind.head()

		Unnamed: 0	date	avg_temp_c
country	city
Côte D'Ivoire	Abidjan	0	2000-01-01	27.293
	Abidjan	1	2000-02-01	27.685
	Abidjan	2	2000-03-01	29.061
	Abidjan	3	2000-04-01	28.162
	Abidjan	4	2000-05-01	27.547

.reset_index()

index가 다시 column으로 바뀜
옵션 drop=True로 할 경우 index 열 제거

temperatures_ind = temperatures.set_index("city")
temperatures_ind.head(3)

	Unnamed: 0	date	country	avg_temp_c
city
Abidjan	0	2000-01-01	Côte D'Ivoire	27.293
Abidjan	1	2000-02-01	Côte D'Ivoire	27.685
Abidjan	2	2000-03-01	Côte D'Ivoire	29.061

temperatures_ind.reset_index().head(3)

	city	Unnamed: 0	date	country	avg_temp_c
0	Abidjan	0	2000-01-01	Côte D'Ivoire	27.293
1	Abidjan	1	2000-02-01	Côte D'Ivoire	27.685
2	Abidjan	2	2000-03-01	Côte D'Ivoire	29.061

temperatures_ind.reset_index(drop=True).head(3)

	Unnamed: 0	date	country	avg_temp_c
0	0	2000-01-01	Côte D'Ivoire	27.293
1	1	2000-02-01	Côte D'Ivoire	27.685
2	2	2000-03-01	Côte D'Ivoire	29.061

.loc()을 이용한 subsetting(1)

row에 대한 filter

# .loc을 활용해 (row에 대해) filter하는 방법1 (df.column으로 조건을 주면 됨) (추천)
iris.loc[(iris.species == "setosa") & (iris.sepal_length < 5.0)].head()

	sepal_length	sepal_width	petal_length	petal_width	species	name
1	4.9	3.0	1.4	0.2	setosa	setosa1
2	4.7	3.2	1.3	0.2	setosa	setosa2
3	4.6	3.1	1.5	0.2	setosa	setosa3
6	4.6	3.4	1.4	0.3	setosa	setosa6
8	4.4	2.9	1.4	0.2	setosa	setosa8

# .loc을 활용해 (row에 대해) filter하는 방법2 (filter할 것을 index로 설정 후 조건을 주면 됨)
temperatures_ind = temperatures.set_index("city")
temperatures_ind = temperatures_ind.loc[["Moscow", "Saint Petersburg"]]
temperatures_ind.reset_index().head(5)

# 방법1로 방법2를 할 경우의 코드
# temperatures.loc[temperatures.city.isin(["Moscow", "Saint Petersburg"])].head(5)

	city	Unnamed: 0	date	country	avg_temp_c	year
0	Moscow	10725	2000-01-01	Russia	-7.313	2000
1	Moscow	10726	2000-02-01	Russia	-3.551	2000
2	Moscow	10727	2000-03-01	Russia	-1.661	2000
3	Moscow	10728	2000-04-01	Russia	10.096	2000
4	Moscow	10729	2000-05-01	Russia	10.357	2000

# .loc을 활용해 (row에 대해 ) filter 하는 방법 2-2 (multi index인 경우, 리스트 안에 튜플로(outer_index, inner_index) 조건을 주어야 함)
temperatures_multi_ind = temperatures.set_index(["country", "city"])
rows_to_keep = [("Brazil", "Rio De Janeiro"), ("Pakistan", "Lahore")]
temperatures_multi_ind.loc[rows_to_keep]

# 방법1로 방법 2-2를 할 경우의 코드
# temperatures[(temperatures.country.isin(["Brazil", "Pakistan"])) & (temperatures.city.isin(["Rio De Janeiro", "Lahore"]))]

		Unnamed: 0	date	avg_temp_c
country	city
Brazil	Rio De Janeiro	12540	2000-01-01	25.974
	Rio De Janeiro	12541	2000-02-01	26.699
	Rio De Janeiro	12542	2000-03-01	26.270
	Rio De Janeiro	12543	2000-04-01	25.750
	Rio De Janeiro	12544	2000-05-01	24.356
...	...	...	...	...
Pakistan	Lahore	8575	2013-05-01	33.457
	Lahore	8576	2013-06-01	34.456
	Lahore	8577	2013-07-01	33.279
	Lahore	8578	2013-08-01	31.511
	Lahore	8579	2013-09-01	NaN

330 rows × 3 columns

.loc()을 이용한 subsetting(2)

row에 대한 filter & column에 대한 select

temperatures_srt = temperatures.set_index(["country", "city"]).sort_index()

# .loc을 활용해 (row에 대해) slice 하는 방법 1 (index slicing ("condition1":"condition2")로 조건을 주면 됨)
temperatures_srt.loc["Pakistan":"Russia"]

# multi index인 경우, inner index에 대해서만 조건을 줄 수 없음, 아래의 겨우 원하는 df를 반환하지 않음
# temperatures_srt.loc["Lahore":"Moscow"]

		Unnamed: 0	date	avg_temp_c	year
country	city
Pakistan	Faisalabad	4785	2000-01-01	12.792	2000
	Faisalabad	4786	2000-02-01	14.339	2000
	Faisalabad	4787	2000-03-01	20.309	2000
	Faisalabad	4788	2000-04-01	29.072	2000
	Faisalabad	4789	2000-05-01	34.845	2000
...	...	...	...	...	...
Russia	Saint Petersburg	13360	2013-05-01	12.355	2013
	Saint Petersburg	13361	2013-06-01	17.185	2013
	Saint Petersburg	13362	2013-07-01	17.234	2013
	Saint Petersburg	13363	2013-08-01	17.153	2013
	Saint Petersburg	13364	2013-09-01	NaN	2013

1155 rows × 4 columns

# .loc을 활용해 (row에 대해) slice 하는 방법 1-2 (multi index인 경우, 리스트 안에 튜플:튜플로("outer_index", "inner_index"):("outer_index", "inner_index") 조건을 주어야 함)
temperatures_srt.loc[("Pakistan","Lahore"):("Russia","Moscow")]

		Unnamed: 0	date	avg_temp_c
country	city
Pakistan	Lahore	8415	2000-01-01	12.792
	Lahore	8416	2000-02-01	14.339
	Lahore	8417	2000-03-01	20.309
	Lahore	8418	2000-04-01	29.072
	Lahore	8419	2000-05-01	34.845
...	...	...	...	...
Russia	Moscow	10885	2013-05-01	16.152
	Moscow	10886	2013-06-01	18.718
	Moscow	10887	2013-07-01	18.136
	Moscow	10888	2013-08-01	17.485
	Moscow	10889	2013-09-01	NaN

660 rows × 3 columns

temperatures_srt = temperatures.set_index(["country", "city"]).sort_index()

# .loc을 활용해 (column)에 대해 slice 하는 방법 2
temperatures_srt.loc[:,'date':'avg_temp_c']

		date	avg_temp_c
country	city
Afghanistan	Kabul	2000-01-01	3.326
	Kabul	2000-02-01	3.454
	Kabul	2000-03-01	9.612
	Kabul	2000-04-01	17.925
	Kabul	2000-05-01	24.658
...	...	...	...
Zimbabwe	Harare	2013-05-01	18.298
	Harare	2013-06-01	17.020
	Harare	2013-07-01	16.299
	Harare	2013-08-01	19.232
	Harare	2013-09-01	NaN

16500 rows × 2 columns

temperatures_srt = temperatures.set_index(["country", "city"]).sort_index()

# .loc을 활용해 (row와 column)에 대해 slice 하는 방법 2
temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad"),"date":"avg_temp_c"]

		date	avg_temp_c
country	city
India	Hyderabad	2000-01-01	23.779
	Hyderabad	2000-02-01	25.826
	Hyderabad	2000-03-01	28.821
	Hyderabad	2000-04-01	32.698
	Hyderabad	2000-05-01	32.438
...	...	...	...
Iraq	Baghdad	2013-05-01	28.673
	Baghdad	2013-06-01	33.803
	Baghdad	2013-07-01	36.392
	Baghdad	2013-08-01	35.463
	Baghdad	2013-09-01	NaN

2145 rows × 2 columns

.sort_index

temperatures_ind = temperatures.set_index(["country", "city"])
temperatures_ind.sort_index(level=["country", "city"], ascending=[True,False]).reset_index(drop=True)

# 위와 같음
# temperatures.sort_values(["country", "city"], ascending = [True,False])[["Unnamed: 0", "date", "avg_temp_c"]]

	Unnamed: 0	date	avg_temp_c
0	7260	2000-01-01	3.326
1	7261	2000-02-01	3.454
2	7262	2000-03-01	9.612
3	7263	2000-04-01	17.925
4	7264	2000-05-01	24.658
...	...	...	...
16495	5605	2013-05-01	18.298
16496	5606	2013-06-01	17.020
16497	5607	2013-07-01	16.299
16498	5608	2013-08-01	19.232
16499	5609	2013-09-01	NaN

16500 rows × 3 columns

.iloc을 이용한 pivot_table subsetting

# dataframe["column"].dt.component을 통해 date의 year, month, day 등을 얻을 수 있음
temperatures["date"] = pd.to_datetime(temperatures["date"])
temperatures["year"] = temperatures["date"].dt.year

temp_by_country_city_vs_year = temperatures.pivot_table(
    values = "avg_temp_c", 
    index = ["country", "city"], 
    columns = "year")

temp_by_country_city_vs_year.loc[("Egypt", "Cairo"):("India", "Delhi"), "2005":"2010"]

	year	2005	2006	2007	2008	2009	2010
country	city
Egypt	Cairo	22.006500	22.050000	22.361000	22.644500	22.625000	23.718250
Egypt	Gizeh	22.006500	22.050000	22.361000	22.644500	22.625000	23.718250
Ethiopia	Addis Abeba	18.312833	18.427083	18.142583	18.165000	18.765333	18.298250
France	Paris	11.552917	11.788500	11.750833	11.278250	11.464083	10.409833
Germany	Berlin	9.919083	10.545333	10.883167	10.657750	10.062500	8.606833
India	Ahmadabad	26.828083	27.282833	27.511167	27.048500	28.095833	28.017833
	Bangalore	25.476500	25.418250	25.464333	25.352583	25.725750	25.705250
	Bombay	27.035750	27.381500	27.634667	27.177750	27.844500	27.765417
	Calcutta	26.729167	26.986250	26.584583	26.522333	27.153250	27.288833
	Delhi	25.716083	26.365917	26.145667	25.675000	26.554250	26.520250

.pivot_table().mean()

.pivot_table도 dataframe이기에 mean()함수 적용 가능
.mean()의 옵션으로 axis = ‘columns’/’index’를 입력해 해당 axis에 대한 평균 산출 가능

mean_temp_by_year = temperatures.pivot_table(
    values = "avg_temp_c", 
    index = ["country", "city"], 
    columns = "year").mean()

mean_temp_by_city = temp_by_country_city_vs_year.mean(axis = 'columns')

# Filter for the city that had the lowest mean temp
mean_temp_by_city[mean_temp_by_city == min(mean_temp_by_city)]

country  city  
China    Harbin    4.876551
dtype: float64

Share on

Twitter Facebook LinkedIn

ZSU

[Python] Data manipulation with pandas(3)

Pandas

.set_index()

.reset_index()

.loc()을 이용한 subsetting(1)

.loc()을 이용한 subsetting(2)

.sort_index

.iloc을 이용한 pivot_table subsetting

.pivot_table().mean()

Share on

You may also enjoy

[Clean Architecture] CH9 LSP(리스코프 치환 원칙)

[Clean Architecture] CH8 OCP(개방-폐쇄 원칙)

[Clean Architecture] CH7 SRP(단일 책임 원칙)

[Clean Architecture] CH6 함수형 프로그래밍