[Python] Reshaping Data with pandas(2)

4 minute read

Pandas

.melt() - wide to long
pd.wide_to_long - wide to long
- function of pandas
.str - handling string in pandas

# import pkg & dataset
import numpy as np
import pandas as pd
df_book = pd.read_csv("books.csv", sep =",", index_col="bookID")
df_book.head(1)

	title	authors	average_rating	isbn	isbn13	language_code	num_pages	ratings_count	text_reviews_count	publication_date	publisher
bookID
1	Harry Potter and the Half-Blood Prince (Harry ...	J.K. Rowling/Mary GrandPré	4.57	0439785960	9780439785969	eng	652	2095690.0	27591.0	9/16/2006	Scholastic Inc.

.melt()

parameter
- id_vars - identifier variable
- value_vars - category variable (default는 전체)
- var_name - categroy variable cloumn name
- value_name - value column name

# value_vars default는 모든 열
df_book.melt(id_vars="title")

	title	variable	value
0	Harry Potter and the Half-Blood Prince (Harry ...	authors	J.K. Rowling/Mary GrandPré
1	Harry Potter and the Order of the Phoenix (Har...	authors	J.K. Rowling/Mary GrandPré
2	Harry Potter and the Chamber of Secrets (Harry...	authors	J.K. Rowling
3	Harry Potter and the Prisoner of Azkaban (Harr...	authors	J.K. Rowling/Mary GrandPré
4	Harry Potter Boxed Set Books 1-5 (Harry Potte...	authors	J.K. Rowling/Mary GrandPré
...	...	...	...
111305	Expelled from Eden: A William T. Vollmann Reader	publisher	Da Capo Press
111306	You Bright and Risen Angels	publisher	Penguin Books
111307	The Ice-Shirt (Seven Dreams #1)	publisher	Penguin Books
111308	Poor People	publisher	Ecco
111309	Las aventuras de Tom Sawyer	publisher	Edimat Libros

111310 rows × 3 columns

# 관심있는 열만 지정할 수 있음
df_book.melt(id_vars=["title",'authors'],
             value_vars=["ratings_count","num_pages"],
             var_name="feature",
             value_name="number")

	title	authors	feature	number
0	Harry Potter and the Half-Blood Prince (Harry ...	J.K. Rowling/Mary GrandPré	ratings_count	2095690.0
1	Harry Potter and the Order of the Phoenix (Har...	J.K. Rowling/Mary GrandPré	ratings_count	2153167.0
2	Harry Potter and the Chamber of Secrets (Harry...	J.K. Rowling	ratings_count	6333.0
3	Harry Potter and the Prisoner of Azkaban (Harr...	J.K. Rowling/Mary GrandPré	ratings_count	2339585.0
4	Harry Potter Boxed Set Books 1-5 (Harry Potte...	J.K. Rowling/Mary GrandPré	ratings_count	41428.0
...	...	...	...	...
22257	Expelled from Eden: A William T. Vollmann Reader	William T. Vollmann/Larry McCaffery/Michael He...	num_pages	512
22258	You Bright and Risen Angels	William T. Vollmann	num_pages	635
22259	The Ice-Shirt (Seven Dreams #1)	William T. Vollmann	num_pages	415
22260	Poor People	William T. Vollmann	num_pages	434
22261	Las aventuras de Tom Sawyer	Mark Twain	num_pages	272

22262 rows × 4 columns

pd.wide_to_long

parameter
- dataframe
- stubnames - prefix
- i - id
- j - suffix
- sep - separation criterion (default=””)
  - always assumed that prefix is immediately followed by a numeric suffix
- suffix - if suffix is not numeric, suffix can be assigned using regex

# dataset
df1 = df_book[["title","language_code","publication_date","num_pages"]]
df1 = df1.rename(columns={"language_code":"language","publication_date":"publication date","num_pages":"page number"})
df1 = df1.iloc[:3]
df1["publication number"] = [2,6,4]
df1.set_index("title", inplace=True)
df1

	language	publication date	page number	publication number
title
Harry Potter and the Half-Blood Prince (Harry Potter #6)	eng	9/16/2006	652	2
Harry Potter and the Order of the Phoenix (Harry Potter #5)	eng	9/1/2004	870	6
Harry Potter and the Chamber of Secrets (Harry Potter #2)	eng	11/1/2003	352	4

# wide_to_long은 index을 사용할 수 없기에 일반 column으로 전환
df1 = df1.reset_index(drop=False)

pd.wide_to_long(
    df1,
    stubnames=["publication", "page"],
    i=["title", "language"],
    j="feature",
    sep=" ",
    suffix="\w+"
)

			publication	page
title	language	feature
Harry Potter and the Half-Blood Prince (Harry Potter #6)	eng	date	9/16/2006	NaN
Harry Potter and the Half-Blood Prince (Harry Potter #6)	eng	number	2	652
Harry Potter and the Order of the Phoenix (Harry Potter #5)	eng	date	9/1/2004	NaN
Harry Potter and the Order of the Phoenix (Harry Potter #5)	eng	number	6	870
Harry Potter and the Chamber of Secrets (Harry Potter #2)	eng	date	11/1/2003	NaN
Harry Potter and the Chamber of Secrets (Harry Potter #2)	eng	number	4	352

.str

Pandas series를 string으로 데이터 처리 가능
.split으로 문자열 분리
- 첫 번째 parameter는 분리 기준
- expand=True를 통해 분리된 값을 dataframe으로 반환
.get으로 분리된 문자열 선택 가능

# dataset
df2 = df_book[["title", "num_pages", "ratings_count"]]
df2 = df2[df2["title"].str.contains("The Science of Sherlock Holmes|The New Annotated Sherlock Holmes", na=False)] # na=False - df2["title"] 값이 NA인 경우에는 제외
df2 = df2.rename(columns={"title":"main_title", "num_pages":"number_pages", "ratings_count":"number_ratings"})
df2["version"] = ["Vol I", "Vol II", "Vol I"]
df2.reset_index(drop=True, inplace=True)
df2

	main_title	number_pages	number_ratings	version
0	The New Annotated Sherlock Holmes: The Complet...	1878	1411.0	Vol I
1	The New Annotated Sherlock Holmes: The Novels	907	2203.0	Vol II
2	The Science of Sherlock Holmes: From Baskervil...	244	2037.0	Vol I

# Split main_title by a colon and assign it to two columns named title and subtitle 
df2[['title', 'subtitle']] = df2['main_title'].str.split(':', expand=True)

# Split version by a space and assign the second element to the column named volume 
df2['volume'] = df2['version'].str.split(' ').str.get(1)

# Drop the main_title and version columns modifying books_sh
df2.drop(['main_title', 'version'], axis=1, inplace=True)

# Reshape using title, subtitle and volume as index, name feature the new variable from columns starting with number, separated by undescore and ending in words 
df2_long = pd.wide_to_long(df2, 
                          stubnames="number", 
                          i=["title", "subtitle", "volume"], 
                          j="feature", 
                          sep="_", 
                          suffix="\w+")

df2_long

				number
title	subtitle	volume	feature
The New Annotated Sherlock Holmes	The Complete Short Stories	I	pages	1878
	The Complete Short Stories	I	ratings	1411.0
	The Novels	II	pages	907
	The Novels	II	ratings	2203.0
The Science of Sherlock Holmes	From Baskerville Hall to the Valley of Fear the Real Forensics Behind the Great Detective's Greatest Cases	I	pages	244
The Science of Sherlock Holmes		I	ratings	2037.0

Share on

Twitter Facebook LinkedIn

ZSU

[Python] Reshaping Data with pandas(2)

Pandas

.melt()

pd.wide_to_long

.str

Share on

You may also enjoy

[Kubernetes] Force Delete Pod, PV, and PVC

[Software Engineering at Google] CH8 Style Guides and Rules

[Software Engineering at Google] CH7 Measuring Engineering Productivity

[Software Engineering at Google] CH6 Leading at Scale