[Python] Data manipulation with pandas(4)

1 minute read


Pandas

  • Creating and Visualizing DataFrames
    • .plot()
      • kind=”bar”/ “line”/ “scatter”
    • .hist()
    • .legend()
  • Missing data
    • .isna()
  • Read & write dataframe
    • pd.read_csv()
    • .to_csv()

# import data
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

iris = sns.load_dataset("iris")

.plot()

  • kind=”bar”
petal_len_by_species = iris.groupby("species")[["petal_length"]].mean()
petal_len_by_species.plot(kind="bar")
plt.show()

  • kind = “line”
sp_wd_by_sp_len = iris.groupby("sepal_length")[["sepal_width"]].mean()
sp_wd_by_sp_len.plot(kind="line")
plt.show()

  • kind = “scatter”
iris.plot(x="sepal_length",
          y="sepal_width",
          kind="scatter",
          title="Sepal length x Sepal width")

plt.show()


.hist()

  • pandas Series로 subsetting 한 경우 <AxesSubplot:>이 되어 2개의 그래프가 하나의 그래프로 표현될 수 있음
iris[iris.species=="setosa"]['sepal_length'].hist(alpha=0.5, bins=20)
iris[iris.species=="versicolor"]['sepal_length'].hist(alpha=0.5, bins=20)
plt.legend(['setosa sepal length','versicolor sepal length'])
plt.show()


.isna()

  • .any()
  • .sum()
# column별 missing value 존재 유무 확인
iris.isna().any()

# column별 missing value 개수 확인
iris.isna().sum()
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

.dropna()

  • 결측값이 존재하는 row 삭제
iris.dropna().head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

.fillna()

  • 결측값 채워넣기
iris.fillna(0).head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Creating dataframes

  • 1) from a dictionary of lists
  • 2) from a list of dictionaries
# Create a dictionary of lists with new data
class_dict = {
  "class": ["A", "B"],
  "number_of_students": [40, 38],
  "Teacher": ["Kwon", "Park"]
}

pd.DataFrame(class_dict)
class number_of_students Teacher
0 A 40 Kwon
1 B 38 Park
class_list = [
    {"class": "A", "number_of_students": 40, "Teacher":"Kwon"},
    {"class": "B", "number_of_students": 38, "Teacher":"Park"},
]

pd.DataFrame(class_list)
class number_of_students Teacher
0 A 40 Kwon
1 B 38 Park

pd.read_csv() & .to_csv()

tp = pd.read_csv("temperatures.csv")
tp.head()
Unnamed: 0 date city country avg_temp_c
0 0 2000-01-01 Abidjan Côte D'Ivoire 27.293
1 1 2000-02-01 Abidjan Côte D'Ivoire 27.685
2 2 2000-03-01 Abidjan Côte D'Ivoire 29.061
3 3 2000-04-01 Abidjan Côte D'Ivoire 28.162
4 4 2000-05-01 Abidjan Côte D'Ivoire 27.547
tp.to_csv("name_of_file.csv")