캐글 (5)
메타 데이터를 이용한
데이터 관찰 및 준비
kaggle 관련 글
- 캐글 (1) Simple Matplotlib & Visualization Tips 공부하기
- 캐글 (2) 타이타닉 튜토리얼 1,2 공부하기
- 캐글 (3) 타이타닉 생존자 EDA 부터 분류까지
- 캐글 (4) 타이타닉 생존자 앙상블과 스태킹까지
- 캐글 (5) 메타 데이터를 이용한 데이터 관찰 및 준비
https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python
캐글의 Porto Seguro’s Safe Driver Prediction 데이터를 통해 메타 데이터를 이용하여 데이터 분석 준비하는 방법을 필사했습니다.
- Visual inspection of your data
- Defining the metadata
- Descriptive statistics
- Handling imbalanced classes
- Data quality checks
- Exploratory data visualization
- Feature engineering
- Feature selection
- Feature scaling
## 패키지 설치
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
pd.set_option('display.max_columns', 100)
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
데이터 확인하기
유사한 그룹에 속하는 feature은 태그가 지정됩니다. (ind,reg,car,calc)
feature 이름에서 bin은 이진 특성을, cat은 범주형 특성입니다.
그 외에는 연속 또는 순서형 특성입니다.
-1은 결측값입니다.
train.head()
id | target | ps_ind_01 | ps_ind_02_cat | ps_ind_03 | ps_ind_04_cat | ps_ind_05_cat | ps_ind_06_bin | ps_ind_07_bin | ps_ind_08_bin | ps_ind_09_bin | ps_ind_10_bin | ps_ind_11_bin | ps_ind_12_bin | ps_ind_13_bin | ps_ind_14 | ps_ind_15 | ps_ind_16_bin | ps_ind_17_bin | ps_ind_18_bin | ps_reg_01 | ps_reg_02 | ps_reg_03 | ps_car_01_cat | ps_car_02_cat | ps_car_03_cat | ps_car_04_cat | ps_car_05_cat | ps_car_06_cat | ps_car_07_cat | ps_car_08_cat | ps_car_09_cat | ps_car_10_cat | ps_car_11_cat | ps_car_11 | ps_car_12 | ps_car_13 | ps_car_14 | ps_car_15 | ps_calc_01 | ps_calc_02 | ps_calc_03 | ps_calc_04 | ps_calc_05 | ps_calc_06 | ps_calc_07 | ps_calc_08 | ps_calc_09 | ps_calc_10 | ps_calc_11 | ps_calc_12 | ps_calc_13 | ps_calc_14 | ps_calc_15_bin | ps_calc_16_bin | ps_calc_17_bin | ps_calc_18_bin | ps_calc_19_bin | ps_calc_20_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7 | 0 | 2 | 2 | 5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 1 | 0 | 0.7 | 0.2 | 0.718070 | 10 | 1 | -1 | 0 | 1 | 4 | 1 | 0 | 0 | 1 | 12 | 2 | 0.400000 | 0.883679 | 0.370810 | 3.605551 | 0.6 | 0.5 | 0.2 | 3 | 1 | 10 | 1 | 10 | 1 | 5 | 9 | 1 | 5 | 8 | 0 | 1 | 1 | 0 | 0 | 1 |
1 | 9 | 0 | 1 | 1 | 7 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 0.8 | 0.4 | 0.766078 | 11 | 1 | -1 | 0 | -1 | 11 | 1 | 1 | 2 | 1 | 19 | 3 | 0.316228 | 0.618817 | 0.388716 | 2.449490 | 0.3 | 0.1 | 0.3 | 2 | 1 | 9 | 5 | 8 | 1 | 7 | 3 | 1 | 1 | 9 | 0 | 1 | 1 | 0 | 1 | 0 |
2 | 13 | 0 | 5 | 4 | 9 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 1 | 0 | 0 | 0.0 | 0.0 | -1.000000 | 7 | 1 | -1 | 0 | -1 | 14 | 1 | 1 | 2 | 1 | 60 | 1 | 0.316228 | 0.641586 | 0.347275 | 3.316625 | 0.5 | 0.7 | 0.1 | 2 | 2 | 9 | 1 | 8 | 2 | 7 | 4 | 2 | 7 | 7 | 0 | 1 | 1 | 0 | 1 | 0 |
3 | 16 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 1 | 0 | 0 | 0.9 | 0.2 | 0.580948 | 7 | 1 | 0 | 0 | 1 | 11 | 1 | 1 | 3 | 1 | 104 | 1 | 0.374166 | 0.542949 | 0.294958 | 2.000000 | 0.6 | 0.9 | 0.1 | 2 | 4 | 7 | 1 | 8 | 4 | 2 | 2 | 2 | 4 | 9 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 17 | 0 | 0 | 2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 1 | 0 | 0 | 0.7 | 0.6 | 0.840759 | 11 | 1 | -1 | 0 | -1 | 14 | 1 | 1 | 2 | 1 | 82 | 3 | 0.316070 | 0.565832 | 0.365103 | 2.000000 | 0.4 | 0.6 | 0.0 | 2 | 2 | 6 | 3 | 10 | 2 | 12 | 3 | 1 | 1 | 3 | 0 | 0 | 0 | 1 | 1 | 0 |
train.tail()
id | target | ps_ind_01 | ps_ind_02_cat | ps_ind_03 | ps_ind_04_cat | ps_ind_05_cat | ps_ind_06_bin | ps_ind_07_bin | ps_ind_08_bin | ps_ind_09_bin | ps_ind_10_bin | ps_ind_11_bin | ps_ind_12_bin | ps_ind_13_bin | ps_ind_14 | ps_ind_15 | ps_ind_16_bin | ps_ind_17_bin | ps_ind_18_bin | ps_reg_01 | ps_reg_02 | ps_reg_03 | ps_car_01_cat | ps_car_02_cat | ps_car_03_cat | ps_car_04_cat | ps_car_05_cat | ps_car_06_cat | ps_car_07_cat | ps_car_08_cat | ps_car_09_cat | ps_car_10_cat | ps_car_11_cat | ps_car_11 | ps_car_12 | ps_car_13 | ps_car_14 | ps_car_15 | ps_calc_01 | ps_calc_02 | ps_calc_03 | ps_calc_04 | ps_calc_05 | ps_calc_06 | ps_calc_07 | ps_calc_08 | ps_calc_09 | ps_calc_10 | ps_calc_11 | ps_calc_12 | ps_calc_13 | ps_calc_14 | ps_calc_15_bin | ps_calc_16_bin | ps_calc_17_bin | ps_calc_18_bin | ps_calc_19_bin | ps_calc_20_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
595207 | 1488013 | 0 | 3 | 1 | 10 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 13 | 1 | 0 | 0 | 0.5 | 0.3 | 0.692820 | 10 | 1 | -1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 31 | 3 | 0.374166 | 0.684631 | 0.385487 | 2.645751 | 0.4 | 0.5 | 0.3 | 3 | 0 | 9 | 0 | 9 | 1 | 12 | 4 | 1 | 9 | 6 | 0 | 1 | 1 | 0 | 1 | 1 |
595208 | 1488016 | 0 | 5 | 1 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 6 | 1 | 0 | 0 | 0.9 | 0.7 | 1.382027 | 9 | 1 | -1 | 0 | -1 | 15 | 0 | 0 | 2 | 1 | 63 | 2 | 0.387298 | 0.972145 | -1.000000 | 3.605551 | 0.2 | 0.2 | 0.0 | 2 | 4 | 8 | 6 | 8 | 2 | 12 | 4 | 1 | 3 | 8 | 1 | 0 | 1 | 0 | 1 | 1 |
595209 | 1488017 | 0 | 1 | 1 | 10 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 1 | 0 | 0 | 0.9 | 0.2 | 0.659071 | 7 | 1 | -1 | 0 | -1 | 1 | 1 | 1 | 2 | 1 | 31 | 3 | 0.397492 | 0.596373 | 0.398748 | 1.732051 | 0.4 | 0.0 | 0.3 | 3 | 2 | 7 | 4 | 8 | 0 | 10 | 3 | 2 | 2 | 6 | 0 | 0 | 1 | 0 | 0 | 0 |
595210 | 1488021 | 0 | 5 | 2 | 3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 1 | 0 | 0 | 0.9 | 0.4 | 0.698212 | 11 | 1 | -1 | 0 | -1 | 11 | 1 | 1 | 2 | 1 | 101 | 3 | 0.374166 | 0.764434 | 0.384968 | 3.162278 | 0.0 | 0.7 | 0.0 | 4 | 0 | 9 | 4 | 9 | 2 | 11 | 4 | 1 | 4 | 2 | 0 | 1 | 1 | 1 | 0 | 0 |
595211 | 1488027 | 0 | 0 | 1 | 8 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 1 | 0 | 0 | 0.1 | 0.2 | -1.000000 | 7 | 0 | -1 | 0 | -1 | 0 | 1 | 0 | 2 | 1 | 34 | 2 | 0.400000 | 0.932649 | 0.378021 | 3.741657 | 0.4 | 0.0 | 0.5 | 2 | 3 | 10 | 4 | 10 | 2 | 5 | 4 | 4 | 3 | 8 | 0 | 1 | 0 | 0 | 0 | 0 |
train.shape
(595212, 59)
train.drop_duplicates()
train.shape
(595212, 59)
test.shape
(892816, 58)
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 595212 entries, 0 to 595211
Data columns (total 59 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 595212 non-null int64
1 target 595212 non-null int64
2 ps_ind_01 595212 non-null int64
3 ps_ind_02_cat 595212 non-null int64
4 ps_ind_03 595212 non-null int64
5 ps_ind_04_cat 595212 non-null int64
6 ps_ind_05_cat 595212 non-null int64
7 ps_ind_06_bin 595212 non-null int64
8 ps_ind_07_bin 595212 non-null int64
9 ps_ind_08_bin 595212 non-null int64
10 ps_ind_09_bin 595212 non-null int64
11 ps_ind_10_bin 595212 non-null int64
12 ps_ind_11_bin 595212 non-null int64
13 ps_ind_12_bin 595212 non-null int64
14 ps_ind_13_bin 595212 non-null int64
15 ps_ind_14 595212 non-null int64
16 ps_ind_15 595212 non-null int64
17 ps_ind_16_bin 595212 non-null int64
18 ps_ind_17_bin 595212 non-null int64
19 ps_ind_18_bin 595212 non-null int64
20 ps_reg_01 595212 non-null float64
21 ps_reg_02 595212 non-null float64
22 ps_reg_03 595212 non-null float64
23 ps_car_01_cat 595212 non-null int64
24 ps_car_02_cat 595212 non-null int64
25 ps_car_03_cat 595212 non-null int64
26 ps_car_04_cat 595212 non-null int64
27 ps_car_05_cat 595212 non-null int64
28 ps_car_06_cat 595212 non-null int64
29 ps_car_07_cat 595212 non-null int64
30 ps_car_08_cat 595212 non-null int64
31 ps_car_09_cat 595212 non-null int64
32 ps_car_10_cat 595212 non-null int64
33 ps_car_11_cat 595212 non-null int64
34 ps_car_11 595212 non-null int64
35 ps_car_12 595212 non-null float64
36 ps_car_13 595212 non-null float64
37 ps_car_14 595212 non-null float64
38 ps_car_15 595212 non-null float64
39 ps_calc_01 595212 non-null float64
40 ps_calc_02 595212 non-null float64
41 ps_calc_03 595212 non-null float64
42 ps_calc_04 595212 non-null int64
43 ps_calc_05 595212 non-null int64
44 ps_calc_06 595212 non-null int64
45 ps_calc_07 595212 non-null int64
46 ps_calc_08 595212 non-null int64
47 ps_calc_09 595212 non-null int64
48 ps_calc_10 595212 non-null int64
49 ps_calc_11 595212 non-null int64
50 ps_calc_12 595212 non-null int64
51 ps_calc_13 595212 non-null int64
52 ps_calc_14 595212 non-null int64
53 ps_calc_15_bin 595212 non-null int64
54 ps_calc_16_bin 595212 non-null int64
55 ps_calc_17_bin 595212 non-null int64
56 ps_calc_18_bin 595212 non-null int64
57 ps_calc_19_bin 595212 non-null int64
58 ps_calc_20_bin 595212 non-null int64
dtypes: float64(10), int64(49)
memory usage: 267.9 MB
float인지 int인지 자료형을 알 수 있으며 null 대신 -1이 들어갔으므로 null이 없습니다.
메타 데이터
메타데이터를 통해 데이터 정보를 저장합니다.
변수의 타입과 feature의 특성을 저장합니다.
data = []
for f in train.columns:
if f == 'target':
role = 'target'
elif f == 'id':
role = 'id'
else:
role = 'input'
if 'bin' in f or f == 'target':
level = 'binary'
elif 'cat' in f or f == 'id':
level = 'nominal'
elif train[f].dtype == float:
level = 'interval'
elif train[f].dtype == int:
level = 'ordinal'
keep = True
if f == 'id':
keep = False
dtype = train[f].dtype
f_dict = {
'varname': f,
'role': role,
'level': level,
'keep': keep,
'dtype': dtype
}
data.append(f_dict)
meta = pd.DataFrame(data, columns=['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace=True)
meta
role | level | keep | dtype | |
---|---|---|---|---|
varname | ||||
id | id | nominal | False | int64 |
target | target | binary | True | int64 |
ps_ind_01 | input | binary | True | int64 |
ps_ind_02_cat | input | nominal | True | int64 |
ps_ind_03 | input | nominal | True | int64 |
ps_ind_04_cat | input | nominal | True | int64 |
ps_ind_05_cat | input | nominal | True | int64 |
ps_ind_06_bin | input | binary | True | int64 |
ps_ind_07_bin | input | binary | True | int64 |
ps_ind_08_bin | input | binary | True | int64 |
ps_ind_09_bin | input | binary | True | int64 |
ps_ind_10_bin | input | binary | True | int64 |
ps_ind_11_bin | input | binary | True | int64 |
ps_ind_12_bin | input | binary | True | int64 |
ps_ind_13_bin | input | binary | True | int64 |
ps_ind_14 | input | binary | True | int64 |
ps_ind_15 | input | binary | True | int64 |
ps_ind_16_bin | input | binary | True | int64 |
ps_ind_17_bin | input | binary | True | int64 |
ps_ind_18_bin | input | binary | True | int64 |
ps_reg_01 | input | interval | True | float64 |
ps_reg_02 | input | interval | True | float64 |
ps_reg_03 | input | interval | True | float64 |
ps_car_01_cat | input | nominal | True | int64 |
ps_car_02_cat | input | nominal | True | int64 |
ps_car_03_cat | input | nominal | True | int64 |
ps_car_04_cat | input | nominal | True | int64 |
ps_car_05_cat | input | nominal | True | int64 |
ps_car_06_cat | input | nominal | True | int64 |
ps_car_07_cat | input | nominal | True | int64 |
ps_car_08_cat | input | nominal | True | int64 |
ps_car_09_cat | input | nominal | True | int64 |
ps_car_10_cat | input | nominal | True | int64 |
ps_car_11_cat | input | nominal | True | int64 |
ps_car_11 | input | nominal | True | int64 |
ps_car_12 | input | interval | True | float64 |
ps_car_13 | input | interval | True | float64 |
ps_car_14 | input | interval | True | float64 |
ps_car_15 | input | interval | True | float64 |
ps_calc_01 | input | interval | True | float64 |
ps_calc_02 | input | interval | True | float64 |
ps_calc_03 | input | interval | True | float64 |
ps_calc_04 | input | interval | True | int64 |
ps_calc_05 | input | interval | True | int64 |
ps_calc_06 | input | interval | True | int64 |
ps_calc_07 | input | interval | True | int64 |
ps_calc_08 | input | interval | True | int64 |
ps_calc_09 | input | interval | True | int64 |
ps_calc_10 | input | interval | True | int64 |
ps_calc_11 | input | interval | True | int64 |
ps_calc_12 | input | interval | True | int64 |
ps_calc_13 | input | interval | True | int64 |
ps_calc_14 | input | interval | True | int64 |
ps_calc_15_bin | input | binary | True | int64 |
ps_calc_16_bin | input | binary | True | int64 |
ps_calc_17_bin | input | binary | True | int64 |
ps_calc_18_bin | input | binary | True | int64 |
ps_calc_19_bin | input | binary | True | int64 |
ps_calc_20_bin | input | binary | True | int64 |
meta[(meta.level == 'nominal') & (meta.keep)].index
Index(['ps_ind_02_cat', 'ps_ind_03', 'ps_ind_04_cat', 'ps_ind_05_cat',
'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat',
'ps_car_05_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat',
'ps_car_09_cat', 'ps_car_10_cat', 'ps_car_11_cat', 'ps_car_11'],
dtype='object', name='varname')
pd.DataFrame({'count' : meta.groupby(['role', 'level'])['role'].size()}).reset_index()
role | level | count | |
---|---|---|---|
0 | id | nominal | 1 |
1 | input | binary | 20 |
2 | input | interval | 21 |
3 | input | nominal | 16 |
4 | target | binary | 1 |
통계량 살펴보기
메타 데이터에서 interval인 값만 찾아서 describe를 사용할 수 있습니다.
ps_reg_03, ps_car_12, ps_car_15에 결측값이 있습니다.
v = meta[(meta.level == 'interval') & (meta.keep)].index
train[v].describe()
ps_reg_01 | ps_reg_02 | ps_reg_03 | ps_car_12 | ps_car_13 | ps_car_14 | ps_car_15 | ps_calc_01 | ps_calc_02 | ps_calc_03 | ps_calc_04 | ps_calc_05 | ps_calc_06 | ps_calc_07 | ps_calc_08 | ps_calc_09 | ps_calc_10 | ps_calc_11 | ps_calc_12 | ps_calc_13 | ps_calc_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 |
mean | 0.610991 | 0.439184 | 0.551102 | 0.379945 | 0.813265 | 0.276256 | 3.065899 | 0.449756 | 0.449589 | 0.449849 | 2.372081 | 1.885886 | 7.689445 | 3.005823 | 9.225904 | 2.339034 | 8.433590 | 5.441382 | 1.441918 | 2.872288 | 7.539026 |
std | 0.287643 | 0.404264 | 0.793506 | 0.058327 | 0.224588 | 0.357154 | 0.731366 | 0.287198 | 0.286893 | 0.287153 | 1.117219 | 1.134927 | 1.334312 | 1.414564 | 1.459672 | 1.246949 | 2.904597 | 2.332871 | 1.202963 | 1.694887 | 2.746652 |
min | 0.000000 | 0.000000 | -1.000000 | -1.000000 | 0.250619 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.400000 | 0.200000 | 0.525000 | 0.316228 | 0.670867 | 0.333167 | 2.828427 | 0.200000 | 0.200000 | 0.200000 | 2.000000 | 1.000000 | 7.000000 | 2.000000 | 8.000000 | 1.000000 | 6.000000 | 4.000000 | 1.000000 | 2.000000 | 6.000000 |
50% | 0.700000 | 0.300000 | 0.720677 | 0.374166 | 0.765811 | 0.368782 | 3.316625 | 0.500000 | 0.400000 | 0.500000 | 2.000000 | 2.000000 | 8.000000 | 3.000000 | 9.000000 | 2.000000 | 8.000000 | 5.000000 | 1.000000 | 3.000000 | 7.000000 |
75% | 0.900000 | 0.600000 | 1.000000 | 0.400000 | 0.906190 | 0.396485 | 3.605551 | 0.700000 | 0.700000 | 0.700000 | 3.000000 | 3.000000 | 9.000000 | 4.000000 | 10.000000 | 3.000000 | 10.000000 | 7.000000 | 2.000000 | 4.000000 | 9.000000 |
max | 0.900000 | 1.800000 | 4.037945 | 1.264911 | 3.720626 | 0.636396 | 3.741657 | 0.900000 | 0.900000 | 0.900000 | 5.000000 | 6.000000 | 10.000000 | 9.000000 | 12.000000 | 7.000000 | 25.000000 | 19.000000 | 10.000000 | 13.000000 | 23.000000 |
v = meta[(meta.level == 'binary') & (meta.keep)].index
train[v].describe()
target | ps_ind_01 | ps_ind_06_bin | ps_ind_07_bin | ps_ind_08_bin | ps_ind_09_bin | ps_ind_10_bin | ps_ind_11_bin | ps_ind_12_bin | ps_ind_13_bin | ps_ind_14 | ps_ind_15 | ps_ind_16_bin | ps_ind_17_bin | ps_ind_18_bin | ps_calc_15_bin | ps_calc_16_bin | ps_calc_17_bin | ps_calc_18_bin | ps_calc_19_bin | ps_calc_20_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 |
mean | 0.036448 | 1.900378 | 0.393742 | 0.257033 | 0.163921 | 0.185304 | 0.000373 | 0.001692 | 0.009439 | 0.000948 | 0.012451 | 7.299922 | 0.660823 | 0.121081 | 0.153446 | 0.122427 | 0.627840 | 0.554182 | 0.287182 | 0.349024 | 0.153318 |
std | 0.187401 | 1.983789 | 0.488579 | 0.436998 | 0.370205 | 0.388544 | 0.019309 | 0.041097 | 0.096693 | 0.030768 | 0.127545 | 3.546042 | 0.473430 | 0.326222 | 0.360417 | 0.327779 | 0.483381 | 0.497056 | 0.452447 | 0.476662 | 0.360295 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 7.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.000000 | 3.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 10.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 |
max | 1.000000 | 7.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 13.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
target=1인 비율이 target=0보다 훨씬 작기 때문에 target=1을 오버 샘플링 혹은 target=0으로 언더샘플링 할 수 있습니다.
언더 샘플링은 불균형한 데이터 셋에서 높은 비율을 차지하던 클래스의 데이터 수를 줄임으로써 데이터 불균형을 해소하는 아이디어 입니다.
하지만 이 방법은 학습에 사용되는 전체 데이터 수를 급격하게 감소시켜 오히려 성능이 떨어질 수 있습니다.
오버 샘플링은 낮은 비율 클래스의 데이터 수를 늘림으로써 데이터 불균형을 해소하는 아이디어 입니다.
desired_apriori=0.10
idx_0 = train[train.target == 0].index
idx_1 = train[train.target == 1].index
nb_0 = len(train.loc[idx_0])
nb_1 = len(train.loc[idx_1])
undersampling_rate = ((1-desired_apriori)*nb_1)/(nb_0*desired_apriori)
undersampled_nb_0 = int(undersampling_rate*nb_0)
print('Rate to undersample records with target=0: {}'.format(undersampling_rate))
print('Number of records with target=0 after undersampling: {}'.format(undersampled_nb_0))
undersampled_idx = shuffle(idx_0, random_state=37, n_samples=undersampled_nb_0)
idx_list = list(undersampled_idx) + list(idx_1)
train = train.loc[idx_list].reset_index(drop=True)
Rate to undersample records with target=0: 0.34043569687437886
Number of records with target=0 after undersampling: 195246
ps_car_03_cat 및 ps_car_05_cat는 결측값이 있는 비율이 높습니다. 이 변수를 제거합니다.
결측값이 있는 다른 범주형 변수의 경우 결측값 -1을 그대로 둘 수 있습니다.
ps_reg_03 (continuous)에는 18%에 대한 결측값이 있습니다. 평균으로 대체합니다.
ps_car_11 (순서형)에는 결측 값이 5개만 있습니다. 모드로 대체합니다.
ps_car_12 (연속)에는 결측값이 1개만 있습니다. 평균으로 대체합니다.
ps_car_14 (연속)에는 7%에 대한 결측값이 있습니다. 평균으로 대체합니다.
vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
train.drop(vars_to_drop, inplace=True, axis=1)
meta.loc[(vars_to_drop),'keep'] = False # 메타 데이터 update
mean_imp = SimpleImputer(missing_values=-1, strategy='mean')
mode_imp = SimpleImputer(missing_values=-1, strategy='most_frequent')
train['ps_reg_03'] = mean_imp.fit_transform(train[['ps_reg_03']]).ravel()
train['ps_car_12'] = mean_imp.fit_transform(train[['ps_car_12']]).ravel()
train['ps_car_14'] = mean_imp.fit_transform(train[['ps_car_14']]).ravel()
train['ps_car_11'] = mode_imp.fit_transform(train[['ps_car_11']]).ravel()
v = meta[(meta.level == 'nominal') & (meta.keep)].index
for f in v:
dist_values = train[f].value_counts().shape[0]
print('Variable {} has {} distinct values'.format(f, dist_values))
Variable ps_ind_02_cat has 5 distinct values
Variable ps_ind_03 has 12 distinct values
Variable ps_ind_04_cat has 3 distinct values
Variable ps_ind_05_cat has 8 distinct values
Variable ps_car_01_cat has 13 distinct values
Variable ps_car_02_cat has 3 distinct values
Variable ps_car_04_cat has 10 distinct values
Variable ps_car_06_cat has 18 distinct values
Variable ps_car_07_cat has 3 distinct values
Variable ps_car_08_cat has 2 distinct values
Variable ps_car_09_cat has 6 distinct values
Variable ps_car_10_cat has 3 distinct values
Variable ps_car_11_cat has 104 distinct values
Variable ps_car_11 has 4 distinct values
def add_noise(series, noise_level):
return series * (1 + noise_level * np.random.randn(len(series)))
def target_encode(trn_series=None,
tst_series=None,
target=None,
min_samples_leaf=1,
smoothing=1,
noise_level=0):
assert len(trn_series) == len(target)
assert trn_series.name == tst_series.name
temp = pd.concat([trn_series, target], axis=1)
# 평균 계산
averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
prior = target.mean()
averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
averages.drop(["mean", "count"], axis=1, inplace=True)
ft_trn_series = pd.merge(
trn_series.to_frame(trn_series.name),
averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
on=trn_series.name,
how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
ft_trn_series.index = trn_series.index
ft_tst_series = pd.merge(
tst_series.to_frame(tst_series.name),
averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
on=tst_series.name,
how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
ft_tst_series.index = tst_series.index
return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)
train_encoded, test_encoded = target_encode(train["ps_car_11_cat"],
test["ps_car_11_cat"],
target=train.target,
min_samples_leaf=100,
smoothing=10,
noise_level=0.01)
train['ps_car_11_cat_te'] = train_encoded
train.drop('ps_car_11_cat', axis=1, inplace=True)
meta.loc['ps_car_11_cat','keep'] = False # 메타 데이터 업데이트
test['ps_car_11_cat_te'] = test_encoded
test.drop('ps_car_11_cat', axis=1, inplace=True)
탐색 데이터 시각화
범주값으로 결측값을 유지한다면 비율을 확인하기 좋습니다.
v = meta[(meta.level == 'nominal') & (meta.keep)].index
for f in v:
plt.figure()
fig, ax = plt.subplots(figsize=(20,10))
cat_perc = train[[f, 'target']].groupby([f],as_index=False).mean()
cat_perc.sort_values(by='target', ascending=False, inplace=True)
sns.barplot(ax=ax, x=f, y='target', data=cat_perc, order=cat_perc[f])
plt.ylabel('% target', fontsize=18)
plt.xlabel(f, fontsize=18)
plt.tick_params(axis='both', which='major', labelsize=18)
plt.show();
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
구간 변수는 상관 관계를 확인하는 것이 좋습니다.
def corr_heatmap(v):
correlations = train[v].corr()
cmap = sns.diverging_palette(220, 10, as_cmap=True)
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(correlations, cmap=cmap, vmax=1.0, center=0, fmt='.2f',
square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .75})
plt.show();
v = meta[(meta.level == 'interval') & (meta.keep)].index
corr_heatmap(v)
s = train.sample(frac=0.1)
sns.lmplot(x='ps_reg_02', y='ps_reg_03', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.show()
sns.lmplot(x='ps_car_12', y='ps_car_13', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.show()
sns.lmplot(x='ps_car_12', y='ps_car_14', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.show()
sns.lmplot(x='ps_car_15', y='ps_car_13', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.show()
Feature Engineering
# 더미 변수 만들기
v = meta[(meta.level == 'nominal') & (meta.keep)].index
print('Before dummification we have {} variables in train'.format(train.shape[1]))
train = pd.get_dummies(train, columns=v, drop_first=True)
print('After dummification we have {} variables in train'.format(train.shape[1]))
Before dummification we have 57 variables in train
After dummification we have 121 variables in train
# 상호작용 변수 만들기
v = meta[(meta.level == 'interval') & (meta.keep)].index
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
interactions = pd.DataFrame(data=poly.fit_transform(train[v]), columns=poly.get_feature_names(v))
interactions.drop(v, axis=1, inplace=True)
print('Before creating interactions we have {} variables in train'.format(train.shape[1]))
train = pd.concat([train, interactions], axis=1)
print('After creating interactions we have {} variables in train'.format(train.shape[1]))
C:\Users\keonj\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
Before creating interactions we have 121 variables in train
After creating interactions we have 352 variables in train
Feature Selection
# VarianceThreshold을 통해 분산이 작은 값 제거
selector = VarianceThreshold(threshold=.01)
selector.fit(train.drop(['id', 'target'], axis=1))
f = np.vectorize(lambda x : not x)
v = train.drop(['id', 'target'], axis=1).columns[f(selector.get_support())]
print('{} variables have too low variance.'.format(len(v)))
print('These variables are {}'.format(list(v)))
28 variables have too low variance.
These variables are ['ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_car_12', 'ps_car_14', 'ps_car_11_cat_te', 'ps_ind_05_cat_2', 'ps_ind_05_cat_5', 'ps_car_01_cat_1', 'ps_car_01_cat_2', 'ps_car_04_cat_3', 'ps_car_04_cat_4', 'ps_car_04_cat_5', 'ps_car_04_cat_6', 'ps_car_04_cat_7', 'ps_car_06_cat_2', 'ps_car_06_cat_5', 'ps_car_06_cat_8', 'ps_car_06_cat_12', 'ps_car_06_cat_16', 'ps_car_06_cat_17', 'ps_car_09_cat_4', 'ps_car_10_cat_1', 'ps_car_10_cat_2', 'ps_car_12^2', 'ps_car_12 ps_car_14', 'ps_car_14^2']
# 랜덤 포레스트를 통한 특성 선택 및 모델 선택
X_train = train.drop(['id', 'target'], axis=1)
y_train = train['target']
feat_labels = X_train.columns
rf = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)
rf.fit(X_train, y_train)
importances = rf.feature_importances_
indices = np.argsort(rf.feature_importances_)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30,feat_labels[indices[f]], importances[indices[f]]))
1) ps_car_11_cat_te 0.008718
2) ps_car_13 ps_calc_08 0.006701
3) ps_car_13 ps_calc_06 0.006655
4) ps_car_13^2 0.006577
5) ps_car_13 0.006575
6) ps_car_13 ps_car_14 0.006478
7) ps_car_12 ps_car_13 0.006459
8) ps_car_13 ps_car_15 0.006277
9) ps_reg_01 ps_car_13 0.006266
10) ps_reg_03 ps_car_13 0.006250
11) ps_car_14 ps_calc_08 0.006093
12) ps_car_14 ps_calc_06 0.006032
13) ps_car_14 ps_car_15 0.005884
14) ps_reg_03 ps_car_14 0.005808
15) ps_car_13 ps_calc_10 0.005789
16) ps_car_13 ps_calc_14 0.005764
17) ps_reg_03 ps_calc_08 0.005740
18) ps_car_13 ps_calc_11 0.005700
19) ps_reg_03 ps_car_12 0.005685
20) ps_reg_03 ps_calc_06 0.005609
21) ps_car_14 ps_calc_10 0.005494
22) ps_car_13 ps_calc_07 0.005493
23) ps_reg_03 ps_calc_10 0.005473
24) ps_car_14 ps_calc_14 0.005460
25) ps_reg_01 ps_car_14 0.005427
26) ps_car_14 ps_calc_11 0.005415
27) ps_car_13 ps_calc_04 0.005409
28) ps_reg_03 ps_car_15 0.005362
29) ps_reg_03 ps_calc_11 0.005356
30) ps_reg_03 ps_calc_14 0.005352
31) ps_reg_02 ps_car_13 0.005295
32) ps_car_13 ps_calc_09 0.005189
33) ps_car_14 0.005183
34) ps_car_14^2 0.005173
35) ps_reg_01 ps_reg_03 0.005137
36) ps_car_12 ps_car_14 0.005126
37) ps_reg_03 ps_calc_07 0.005110
38) ps_car_14 ps_calc_07 0.005089
39) ps_reg_03 0.005059
40) ps_car_13 ps_calc_01 0.005045
41) ps_reg_03^2 0.005036
42) ps_car_14 ps_calc_04 0.005024
43) ps_car_13 ps_calc_13 0.005012
44) ps_reg_03 ps_calc_04 0.005008
45) ps_car_13 ps_calc_03 0.004984
46) ps_car_13 ps_calc_02 0.004975
47) ps_car_13 ps_calc_05 0.004943
48) ps_reg_03 ps_calc_09 0.004808
49) ps_car_14 ps_calc_09 0.004771
50) ps_reg_03 ps_calc_13 0.004765
51) ps_car_14 ps_calc_02 0.004708
52) ps_car_14 ps_calc_13 0.004702
53) ps_calc_10 ps_calc_14 0.004656
54) ps_car_15 ps_calc_10 0.004655
55) ps_car_14 ps_calc_01 0.004647
56) ps_calc_10 ps_calc_11 0.004636
57) ps_car_14 ps_calc_03 0.004620
58) ps_reg_03 ps_calc_02 0.004619
59) ps_car_15 ps_calc_08 0.004616
60) ps_reg_03 ps_calc_03 0.004603
61) ps_calc_08 ps_calc_10 0.004573
62) ps_car_14 ps_calc_05 0.004566
63) ps_reg_03 ps_calc_05 0.004565
64) ps_car_15 ps_calc_14 0.004556
65) ps_reg_02 ps_car_14 0.004543
66) ps_reg_03 ps_calc_01 0.004534
67) ps_car_12 ps_calc_08 0.004515
68) ps_calc_08 ps_calc_14 0.004505
69) ps_car_12 ps_calc_10 0.004476
70) ps_car_12 ps_car_15 0.004467
71) ps_calc_11 ps_calc_14 0.004441
72) ps_car_15 ps_calc_11 0.004412
73) ps_calc_06 ps_calc_10 0.004408
74) ps_reg_02 ps_reg_03 0.004406
75) ps_car_12 ps_calc_14 0.004391
76) ps_car_12 ps_calc_06 0.004376
77) ps_ind_15 0.004353
78) ps_car_15 ps_calc_06 0.004341
79) ps_calc_06 ps_calc_14 0.004306
80) ps_calc_08 ps_calc_11 0.004282
81) ps_calc_06 ps_calc_08 0.004253
82) ps_car_12 ps_calc_11 0.004188
83) ps_calc_07 ps_calc_10 0.004163
84) ps_calc_06 ps_calc_11 0.004134
85) ps_calc_07 ps_calc_14 0.004076
86) ps_calc_02 ps_calc_10 0.004032
87) ps_calc_01 ps_calc_14 0.004025
88) ps_car_13 ps_calc_12 0.004025
89) ps_calc_01 ps_calc_10 0.004011
90) ps_reg_01 ps_calc_10 0.004011
91) ps_calc_03 ps_calc_10 0.004008
92) ps_calc_10 ps_calc_13 0.003989
93) ps_calc_04 ps_calc_10 0.003985
94) ps_calc_03 ps_calc_14 0.003971
95) ps_calc_13 ps_calc_14 0.003970
96) ps_car_15 ps_calc_07 0.003945
97) ps_calc_02 ps_calc_14 0.003937
98) ps_reg_01 ps_calc_14 0.003925
99) ps_calc_09 ps_calc_10 0.003895
100) ps_calc_04 ps_calc_14 0.003877
101) ps_calc_07 ps_calc_08 0.003863
102) ps_reg_03 ps_calc_12 0.003849
103) ps_calc_01 ps_calc_11 0.003822
104) ps_calc_09 ps_calc_14 0.003819
105) ps_reg_02 ps_car_15 0.003806
106) ps_calc_07 ps_calc_11 0.003779
107) ps_reg_02 ps_calc_10 0.003771
108) ps_calc_03 ps_calc_11 0.003771
109) ps_calc_02 ps_calc_11 0.003757
110) ps_reg_02 ps_calc_14 0.003757
111) ps_reg_01 ps_car_15 0.003754
112) ps_car_14 ps_calc_12 0.003745
113) ps_calc_01 ps_calc_08 0.003725
114) ps_calc_05 ps_calc_10 0.003722
115) ps_car_15 ps_calc_13 0.003717
116) ps_car_12 ps_calc_01 0.003715
117) ps_calc_11 ps_calc_13 0.003708
118) ps_calc_03 ps_calc_08 0.003693
119) ps_car_15 ps_calc_02 0.003684
120) ps_car_12 ps_calc_07 0.003682
121) ps_calc_06 ps_calc_07 0.003680
122) ps_car_15 ps_calc_03 0.003676
123) ps_car_12 ps_calc_03 0.003671
124) ps_car_12 ps_calc_02 0.003670
125) ps_car_15 ps_calc_01 0.003668
126) ps_calc_08 ps_calc_13 0.003664
127) ps_car_15 ps_calc_09 0.003660
128) ps_reg_01 ps_car_12 0.003653
129) ps_car_15 ps_calc_04 0.003648
130) ps_calc_02 ps_calc_08 0.003647
131) ps_reg_01 ps_calc_11 0.003636
132) ps_calc_04 ps_calc_11 0.003621
133) ps_reg_02 ps_calc_11 0.003607
134) ps_calc_05 ps_calc_14 0.003590
135) ps_calc_09 ps_calc_11 0.003569
136) ps_reg_01 ps_calc_08 0.003557
137) ps_calc_04 ps_calc_08 0.003544
138) ps_reg_02 ps_car_12 0.003506
139) ps_car_12 ps_calc_13 0.003504
140) ps_reg_02 ps_calc_08 0.003498
141) ps_calc_08 ps_calc_09 0.003486
142) ps_calc_02 ps_calc_07 0.003485
143) ps_calc_06 ps_calc_13 0.003480
144) ps_calc_01 ps_calc_06 0.003478
145) ps_calc_03 ps_calc_07 0.003461
146) ps_calc_03 ps_calc_06 0.003435
147) ps_car_12 ps_calc_04 0.003432
148) ps_calc_02 ps_calc_06 0.003429
149) ps_calc_01 ps_calc_13 0.003413
150) ps_reg_02 ps_calc_06 0.003413
151) ps_calc_01 ps_calc_07 0.003402
152) ps_calc_01 ps_calc_02 0.003385
153) ps_calc_02 ps_calc_13 0.003379
154) ps_car_15 ps_calc_05 0.003358
155) ps_calc_03 ps_calc_13 0.003357
156) ps_calc_04 ps_calc_06 0.003357
157) ps_calc_02 ps_calc_03 0.003345
158) ps_calc_01 ps_calc_03 0.003338
159) ps_calc_05 ps_calc_11 0.003313
160) ps_calc_05 ps_calc_08 0.003306
161) ps_car_12 ps_calc_09 0.003306
162) ps_reg_01 ps_calc_06 0.003295
163) ps_calc_06 ps_calc_09 0.003271
164) ps_calc_03 ps_calc_09 0.003266
165) ps_calc_03 ps_calc_04 0.003259
166) ps_reg_02 ps_calc_01 0.003258
167) ps_calc_02 ps_calc_09 0.003207
168) ps_calc_01 ps_calc_09 0.003206
169) ps_reg_02 ps_calc_03 0.003195
170) ps_reg_02 ps_calc_02 0.003194
171) ps_calc_10 ps_calc_12 0.003190
172) ps_calc_02 ps_calc_04 0.003190
173) ps_calc_01 ps_calc_04 0.003184
174) ps_calc_07 ps_calc_13 0.003152
175) ps_reg_02 ps_calc_07 0.003144
176) ps_calc_12 ps_calc_14 0.003128
177) ps_reg_01 ps_calc_07 0.003110
178) ps_car_12 ps_calc_05 0.003106
179) ps_reg_01 ps_calc_13 0.003098
180) ps_calc_02 ps_calc_05 0.003069
181) ps_calc_05 ps_calc_06 0.003065
182) ps_calc_01 ps_calc_05 0.003065
183) ps_calc_03 ps_calc_05 0.003060
184) ps_reg_01 ps_calc_03 0.003045
185) ps_reg_02 ps_calc_13 0.003041
186) ps_calc_09 ps_calc_13 0.003035
187) ps_reg_01 ps_calc_01 0.003024
188) ps_reg_01 ps_calc_02 0.003007
189) ps_calc_04 ps_calc_13 0.002981
190) ps_calc_07 ps_calc_09 0.002955
191) ps_calc_11 ps_calc_12 0.002941
192) ps_calc_04 ps_calc_07 0.002920
193) ps_reg_02 ps_calc_04 0.002900
194) ps_reg_02 ps_calc_09 0.002900
195) ps_ind_01 0.002898
196) ps_car_15 ps_calc_12 0.002890
197) ps_reg_01 ps_reg_02 0.002855
198) ps_reg_01 ps_calc_09 0.002844
199) ps_calc_05 ps_calc_13 0.002819
200) ps_reg_01 ps_calc_04 0.002815
201) ps_reg_02 ps_calc_05 0.002813
202) ps_calc_02 ps_calc_12 0.002800
203) ps_calc_08 ps_calc_12 0.002791
204) ps_calc_05 ps_calc_07 0.002779
205) ps_calc_01 ps_calc_12 0.002768
206) ps_calc_03 ps_calc_12 0.002737
207) ps_car_12 ps_calc_12 0.002715
208) ps_reg_01 ps_calc_05 0.002691
209) ps_calc_10 0.002666
210) ps_calc_06 ps_calc_12 0.002662
211) ps_calc_10^2 0.002659
212) ps_calc_04 ps_calc_09 0.002657
213) ps_calc_14^2 0.002583
214) ps_ind_05_cat_0 0.002573
215) ps_calc_14 0.002549
216) ps_calc_05 ps_calc_09 0.002523
217) ps_calc_12 ps_calc_13 0.002499
218) ps_calc_07 ps_calc_12 0.002487
219) ps_calc_04 ps_calc_05 0.002485
220) ps_reg_02 ps_calc_12 0.002445
221) ps_calc_09 ps_calc_12 0.002351
222) ps_reg_01 ps_calc_12 0.002345
223) ps_calc_11^2 0.002291
224) ps_calc_11 0.002261
225) ps_calc_04 ps_calc_12 0.002256
226) ps_car_15 0.002243
227) ps_car_15^2 0.002221
228) ps_car_12 0.002203
229) ps_car_12^2 0.002182
230) ps_calc_05 ps_calc_12 0.002146
231) ps_calc_08 0.002133
232) ps_calc_08^2 0.002118
233) ps_calc_03 0.001906
234) ps_calc_01 0.001902
235) ps_calc_01^2 0.001899
236) ps_calc_03^2 0.001897
237) ps_calc_02^2 0.001890
238) ps_calc_02 0.001872
239) ps_calc_06^2 0.001859
240) ps_calc_06 0.001849
241) ps_reg_02^2 0.001763
242) ps_ind_17_bin 0.001735
243) ps_reg_02 0.001712
244) ps_calc_13^2 0.001696
245) ps_calc_13 0.001695
246) ps_calc_07^2 0.001607
247) ps_calc_07 0.001594
248) ps_reg_01 0.001457
249) ps_reg_01^2 0.001445
250) ps_calc_09^2 0.001385
251) ps_calc_09 0.001382
252) ps_calc_04^2 0.001287
253) ps_calc_04 0.001279
254) ps_calc_05 0.001251
255) ps_calc_05^2 0.001245
256) ps_car_07_cat_1 0.001210
257) ps_calc_12^2 0.001203
258) ps_calc_12 0.001184
259) ps_ind_16_bin 0.001039
260) ps_ind_05_cat_6 0.000984
261) ps_ind_07_bin 0.000935
262) ps_car_09_cat_1 0.000840
263) ps_ind_06_bin 0.000810
264) ps_ind_02_cat_1 0.000746
265) ps_car_01_cat_9 0.000733
266) ps_calc_17_bin 0.000731
267) ps_ind_04_cat_1 0.000725
268) ps_ind_04_cat_0 0.000720
269) ps_car_07_cat_0 0.000716
270) ps_calc_19_bin 0.000713
271) ps_ind_02_cat_2 0.000712
272) ps_calc_18_bin 0.000710
273) ps_car_01_cat_11 0.000710
274) ps_calc_16_bin 0.000702
275) ps_ind_05_cat_4 0.000700
276) ps_ind_03_7 0.000691
277) ps_ind_03_1 0.000691
278) ps_car_09_cat_2 0.000678
279) ps_ind_05_cat_2 0.000672
280) ps_ind_08_bin 0.000672
281) ps_car_01_cat_7 0.000672
282) ps_ind_03_6 0.000671
283) ps_car_06_cat_1 0.000669
284) ps_car_11_3 0.000646
285) ps_car_09_cat_0 0.000643
286) ps_ind_03_5 0.000635
287) ps_car_11_2 0.000629
288) ps_calc_15_bin 0.000627
289) ps_ind_03_8 0.000613
290) ps_calc_20_bin 0.000608
291) ps_car_04_cat_2 0.000606
292) ps_car_01_cat_4 0.000604
293) ps_ind_18_bin 0.000579
294) ps_car_06_cat_11 0.000572
295) ps_ind_03_10 0.000571
296) ps_ind_03_3 0.000563
297) ps_ind_03_2 0.000561
298) ps_car_01_cat_8 0.000558
299) ps_ind_02_cat_4 0.000557
300) ps_car_01_cat_10 0.000555
301) ps_ind_09_bin 0.000550
302) ps_ind_02_cat_3 0.000543
303) ps_car_06_cat_14 0.000535
304) ps_ind_03_9 0.000534
305) ps_car_01_cat_6 0.000526
306) ps_ind_03_11 0.000522
307) ps_ind_03_4 0.000515
308) ps_car_01_cat_5 0.000511
309) ps_car_02_cat_0 0.000507
310) ps_car_02_cat_1 0.000504
311) ps_ind_14 0.000502
312) ps_car_06_cat_6 0.000483
313) ps_car_06_cat_4 0.000478
314) ps_car_08_cat_1 0.000471
315) ps_car_01_cat_3 0.000445
316) ps_car_01_cat_0 0.000439
317) ps_car_06_cat_10 0.000438
318) ps_car_06_cat_7 0.000432
319) ps_ind_05_cat_1 0.000422
320) ps_car_09_cat_3 0.000404
321) ps_car_04_cat_1 0.000402
322) ps_ind_12_bin 0.000380
323) ps_car_06_cat_9 0.000372
324) ps_car_06_cat_15 0.000365
325) ps_car_11_1 0.000363
326) ps_car_10_cat_1 0.000362
327) ps_car_09_cat_4 0.000355
328) ps_ind_05_cat_3 0.000334
329) ps_car_01_cat_2 0.000324
330) ps_car_06_cat_3 0.000321
331) ps_car_06_cat_17 0.000276
332) ps_car_06_cat_12 0.000270
333) ps_car_06_cat_16 0.000235
334) ps_car_01_cat_1 0.000232
335) ps_car_04_cat_8 0.000218
336) ps_car_04_cat_9 0.000212
337) ps_ind_05_cat_5 0.000184
338) ps_car_06_cat_13 0.000180
339) ps_car_06_cat_5 0.000178
340) ps_ind_11_bin 0.000143
341) ps_car_04_cat_6 0.000132
342) ps_ind_13_bin 0.000095
343) ps_car_04_cat_3 0.000087
344) ps_car_06_cat_2 0.000066
345) ps_car_04_cat_5 0.000058
346) ps_car_04_cat_7 0.000056
347) ps_car_06_cat_8 0.000045
348) ps_car_10_cat_2 0.000045
349) ps_ind_10_bin 0.000045
350) ps_car_04_cat_4 0.000027
sfm = SelectFromModel(rf, threshold='median', prefit=True)
print('Number of features before selection: {}'.format(X_train.shape[1]))
n_features = sfm.transform(X_train).shape[1]
print('Number of features after selection: {}'.format(n_features))
selected_vars = list(feat_labels[sfm.get_support()])
Number of features before selection: 350
C:\Users\keonj\anaconda3\lib\site-packages\sklearn\base.py:438: UserWarning: X has feature names, but SelectFromModel was fitted without feature names
warnings.warn(
Number of features after selection: 175
# Feature scaling
scaler = StandardScaler()
scaler.fit_transform(train.drop(['target'], axis=1))
array([[-0.90494248, -0.45941104, 1.25877984, ..., 0.40315483,
0.14885213, -0.62460393],
[ 0.24006954, 1.55538958, 1.25877984, ..., -0.17489762,
-0.04208459, -0.33950182],
[ 1.64508122, 1.05168943, 1.25877984, ..., -0.17489762,
0.53072557, 0.77897569],
...,
[ 1.73477713, -0.9631112 , -0.7944201 , ..., -0.83552899,
-0.99676819, -0.62460393],
[ 1.73485162, -0.9631112 , 1.25877984, ..., 0.40315483,
-0.36031245, -1.06322256],
[ 1.73512631, -0.45941104, -0.7944201 , ..., 0.40315483,
0.91259901, 0.36228799]])