/ PROGRAMMING

캐글 (5)
메타 데이터를 이용한
데이터 관찰 및 준비

kaggle 관련 글

https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python

캐글의 Porto Seguro’s Safe Driver Prediction 데이터를 통해 메타 데이터를 이용하여 데이터 분석 준비하는 방법을 필사했습니다.

  1. Visual inspection of your data
  2. Defining the metadata
  3. Descriptive statistics
  4. Handling imbalanced classes
  5. Data quality checks
  6. Exploratory data visualization
  7. Feature engineering
  8. Feature selection
  9. Feature scaling
## 패키지 설치

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier

pd.set_option('display.max_columns', 100)
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

데이터 확인하기

유사한 그룹에 속하는 feature은 태그가 지정됩니다. (ind,reg,car,calc)
feature 이름에서 bin은 이진 특성을, cat은 범주형 특성입니다.
그 외에는 연속 또는 순서형 특성입니다.
-1은 결측값입니다.

train.head()
id target ps_ind_01 ps_ind_02_cat ps_ind_03 ps_ind_04_cat ps_ind_05_cat ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_14 ps_ind_15 ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_reg_01 ps_reg_02 ps_reg_03 ps_car_01_cat ps_car_02_cat ps_car_03_cat ps_car_04_cat ps_car_05_cat ps_car_06_cat ps_car_07_cat ps_car_08_cat ps_car_09_cat ps_car_10_cat ps_car_11_cat ps_car_11 ps_car_12 ps_car_13 ps_car_14 ps_car_15 ps_calc_01 ps_calc_02 ps_calc_03 ps_calc_04 ps_calc_05 ps_calc_06 ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10 ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14 ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
0 7 0 2 2 5 1 0 0 1 0 0 0 0 0 0 0 11 0 1 0 0.7 0.2 0.718070 10 1 -1 0 1 4 1 0 0 1 12 2 0.400000 0.883679 0.370810 3.605551 0.6 0.5 0.2 3 1 10 1 10 1 5 9 1 5 8 0 1 1 0 0 1
1 9 0 1 1 7 0 0 0 0 1 0 0 0 0 0 0 3 0 0 1 0.8 0.4 0.766078 11 1 -1 0 -1 11 1 1 2 1 19 3 0.316228 0.618817 0.388716 2.449490 0.3 0.1 0.3 2 1 9 5 8 1 7 3 1 1 9 0 1 1 0 1 0
2 13 0 5 4 9 1 0 0 0 1 0 0 0 0 0 0 12 1 0 0 0.0 0.0 -1.000000 7 1 -1 0 -1 14 1 1 2 1 60 1 0.316228 0.641586 0.347275 3.316625 0.5 0.7 0.1 2 2 9 1 8 2 7 4 2 7 7 0 1 1 0 1 0
3 16 0 0 1 2 0 0 1 0 0 0 0 0 0 0 0 8 1 0 0 0.9 0.2 0.580948 7 1 0 0 1 11 1 1 3 1 104 1 0.374166 0.542949 0.294958 2.000000 0.6 0.9 0.1 2 4 7 1 8 4 2 2 2 4 9 0 0 0 0 0 0
4 17 0 0 2 0 1 0 1 0 0 0 0 0 0 0 0 9 1 0 0 0.7 0.6 0.840759 11 1 -1 0 -1 14 1 1 2 1 82 3 0.316070 0.565832 0.365103 2.000000 0.4 0.6 0.0 2 2 6 3 10 2 12 3 1 1 3 0 0 0 1 1 0
train.tail()
id target ps_ind_01 ps_ind_02_cat ps_ind_03 ps_ind_04_cat ps_ind_05_cat ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_14 ps_ind_15 ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_reg_01 ps_reg_02 ps_reg_03 ps_car_01_cat ps_car_02_cat ps_car_03_cat ps_car_04_cat ps_car_05_cat ps_car_06_cat ps_car_07_cat ps_car_08_cat ps_car_09_cat ps_car_10_cat ps_car_11_cat ps_car_11 ps_car_12 ps_car_13 ps_car_14 ps_car_15 ps_calc_01 ps_calc_02 ps_calc_03 ps_calc_04 ps_calc_05 ps_calc_06 ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10 ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14 ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
595207 1488013 0 3 1 10 0 0 0 0 0 1 0 0 0 0 0 13 1 0 0 0.5 0.3 0.692820 10 1 -1 0 1 1 1 1 0 1 31 3 0.374166 0.684631 0.385487 2.645751 0.4 0.5 0.3 3 0 9 0 9 1 12 4 1 9 6 0 1 1 0 1 1
595208 1488016 0 5 1 3 0 0 0 0 0 1 0 0 0 0 0 6 1 0 0 0.9 0.7 1.382027 9 1 -1 0 -1 15 0 0 2 1 63 2 0.387298 0.972145 -1.000000 3.605551 0.2 0.2 0.0 2 4 8 6 8 2 12 4 1 3 8 1 0 1 0 1 1
595209 1488017 0 1 1 10 0 0 1 0 0 0 0 0 0 0 0 12 1 0 0 0.9 0.2 0.659071 7 1 -1 0 -1 1 1 1 2 1 31 3 0.397492 0.596373 0.398748 1.732051 0.4 0.0 0.3 3 2 7 4 8 0 10 3 2 2 6 0 0 1 0 0 0
595210 1488021 0 5 2 3 1 0 0 0 1 0 0 0 0 0 0 12 1 0 0 0.9 0.4 0.698212 11 1 -1 0 -1 11 1 1 2 1 101 3 0.374166 0.764434 0.384968 3.162278 0.0 0.7 0.0 4 0 9 4 9 2 11 4 1 4 2 0 1 1 1 0 0
595211 1488027 0 0 1 8 0 0 1 0 0 0 0 0 0 0 0 7 1 0 0 0.1 0.2 -1.000000 7 0 -1 0 -1 0 1 0 2 1 34 2 0.400000 0.932649 0.378021 3.741657 0.4 0.0 0.5 2 3 10 4 10 2 5 4 4 3 8 0 1 0 0 0 0
train.shape
(595212, 59)
train.drop_duplicates()
train.shape
(595212, 59)
test.shape 
(892816, 58)
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 595212 entries, 0 to 595211
Data columns (total 59 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   id              595212 non-null  int64  
 1   target          595212 non-null  int64  
 2   ps_ind_01       595212 non-null  int64  
 3   ps_ind_02_cat   595212 non-null  int64  
 4   ps_ind_03       595212 non-null  int64  
 5   ps_ind_04_cat   595212 non-null  int64  
 6   ps_ind_05_cat   595212 non-null  int64  
 7   ps_ind_06_bin   595212 non-null  int64  
 8   ps_ind_07_bin   595212 non-null  int64  
 9   ps_ind_08_bin   595212 non-null  int64  
 10  ps_ind_09_bin   595212 non-null  int64  
 11  ps_ind_10_bin   595212 non-null  int64  
 12  ps_ind_11_bin   595212 non-null  int64  
 13  ps_ind_12_bin   595212 non-null  int64  
 14  ps_ind_13_bin   595212 non-null  int64  
 15  ps_ind_14       595212 non-null  int64  
 16  ps_ind_15       595212 non-null  int64  
 17  ps_ind_16_bin   595212 non-null  int64  
 18  ps_ind_17_bin   595212 non-null  int64  
 19  ps_ind_18_bin   595212 non-null  int64  
 20  ps_reg_01       595212 non-null  float64
 21  ps_reg_02       595212 non-null  float64
 22  ps_reg_03       595212 non-null  float64
 23  ps_car_01_cat   595212 non-null  int64  
 24  ps_car_02_cat   595212 non-null  int64  
 25  ps_car_03_cat   595212 non-null  int64  
 26  ps_car_04_cat   595212 non-null  int64  
 27  ps_car_05_cat   595212 non-null  int64  
 28  ps_car_06_cat   595212 non-null  int64  
 29  ps_car_07_cat   595212 non-null  int64  
 30  ps_car_08_cat   595212 non-null  int64  
 31  ps_car_09_cat   595212 non-null  int64  
 32  ps_car_10_cat   595212 non-null  int64  
 33  ps_car_11_cat   595212 non-null  int64  
 34  ps_car_11       595212 non-null  int64  
 35  ps_car_12       595212 non-null  float64
 36  ps_car_13       595212 non-null  float64
 37  ps_car_14       595212 non-null  float64
 38  ps_car_15       595212 non-null  float64
 39  ps_calc_01      595212 non-null  float64
 40  ps_calc_02      595212 non-null  float64
 41  ps_calc_03      595212 non-null  float64
 42  ps_calc_04      595212 non-null  int64  
 43  ps_calc_05      595212 non-null  int64  
 44  ps_calc_06      595212 non-null  int64  
 45  ps_calc_07      595212 non-null  int64  
 46  ps_calc_08      595212 non-null  int64  
 47  ps_calc_09      595212 non-null  int64  
 48  ps_calc_10      595212 non-null  int64  
 49  ps_calc_11      595212 non-null  int64  
 50  ps_calc_12      595212 non-null  int64  
 51  ps_calc_13      595212 non-null  int64  
 52  ps_calc_14      595212 non-null  int64  
 53  ps_calc_15_bin  595212 non-null  int64  
 54  ps_calc_16_bin  595212 non-null  int64  
 55  ps_calc_17_bin  595212 non-null  int64  
 56  ps_calc_18_bin  595212 non-null  int64  
 57  ps_calc_19_bin  595212 non-null  int64  
 58  ps_calc_20_bin  595212 non-null  int64  
dtypes: float64(10), int64(49)
memory usage: 267.9 MB

float인지 int인지 자료형을 알 수 있으며 null 대신 -1이 들어갔으므로 null이 없습니다.

메타 데이터

메타데이터를 통해 데이터 정보를 저장합니다.
변수의 타입과 feature의 특성을 저장합니다.

data = []
for f in train.columns:
    if f == 'target':
        role = 'target'
    elif f == 'id':
        role = 'id'
    else:
        role = 'input'
         
    if 'bin' in f or f == 'target':
        level = 'binary'
    elif 'cat' in f or f == 'id':
        level = 'nominal'
    elif train[f].dtype == float:
        level = 'interval'
    elif train[f].dtype == int:
        level = 'ordinal'
        
    keep = True
    if f == 'id':
        keep = False
    
    dtype = train[f].dtype
    
    f_dict = {
        'varname': f,
        'role': role,
        'level': level,
        'keep': keep,
        'dtype': dtype
    }
    data.append(f_dict)
    
meta = pd.DataFrame(data, columns=['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace=True)
meta
role level keep dtype
varname
id id nominal False int64
target target binary True int64
ps_ind_01 input binary True int64
ps_ind_02_cat input nominal True int64
ps_ind_03 input nominal True int64
ps_ind_04_cat input nominal True int64
ps_ind_05_cat input nominal True int64
ps_ind_06_bin input binary True int64
ps_ind_07_bin input binary True int64
ps_ind_08_bin input binary True int64
ps_ind_09_bin input binary True int64
ps_ind_10_bin input binary True int64
ps_ind_11_bin input binary True int64
ps_ind_12_bin input binary True int64
ps_ind_13_bin input binary True int64
ps_ind_14 input binary True int64
ps_ind_15 input binary True int64
ps_ind_16_bin input binary True int64
ps_ind_17_bin input binary True int64
ps_ind_18_bin input binary True int64
ps_reg_01 input interval True float64
ps_reg_02 input interval True float64
ps_reg_03 input interval True float64
ps_car_01_cat input nominal True int64
ps_car_02_cat input nominal True int64
ps_car_03_cat input nominal True int64
ps_car_04_cat input nominal True int64
ps_car_05_cat input nominal True int64
ps_car_06_cat input nominal True int64
ps_car_07_cat input nominal True int64
ps_car_08_cat input nominal True int64
ps_car_09_cat input nominal True int64
ps_car_10_cat input nominal True int64
ps_car_11_cat input nominal True int64
ps_car_11 input nominal True int64
ps_car_12 input interval True float64
ps_car_13 input interval True float64
ps_car_14 input interval True float64
ps_car_15 input interval True float64
ps_calc_01 input interval True float64
ps_calc_02 input interval True float64
ps_calc_03 input interval True float64
ps_calc_04 input interval True int64
ps_calc_05 input interval True int64
ps_calc_06 input interval True int64
ps_calc_07 input interval True int64
ps_calc_08 input interval True int64
ps_calc_09 input interval True int64
ps_calc_10 input interval True int64
ps_calc_11 input interval True int64
ps_calc_12 input interval True int64
ps_calc_13 input interval True int64
ps_calc_14 input interval True int64
ps_calc_15_bin input binary True int64
ps_calc_16_bin input binary True int64
ps_calc_17_bin input binary True int64
ps_calc_18_bin input binary True int64
ps_calc_19_bin input binary True int64
ps_calc_20_bin input binary True int64
meta[(meta.level == 'nominal') & (meta.keep)].index
Index(['ps_ind_02_cat', 'ps_ind_03', 'ps_ind_04_cat', 'ps_ind_05_cat',
       'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat',
       'ps_car_05_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat',
       'ps_car_09_cat', 'ps_car_10_cat', 'ps_car_11_cat', 'ps_car_11'],
      dtype='object', name='varname')
pd.DataFrame({'count' : meta.groupby(['role', 'level'])['role'].size()}).reset_index()
role level count
0 id nominal 1
1 input binary 20
2 input interval 21
3 input nominal 16
4 target binary 1

통계량 살펴보기

메타 데이터에서 interval인 값만 찾아서 describe를 사용할 수 있습니다.
ps_reg_03, ps_car_12, ps_car_15에 결측값이 있습니다.

v = meta[(meta.level == 'interval') & (meta.keep)].index
train[v].describe()
ps_reg_01 ps_reg_02 ps_reg_03 ps_car_12 ps_car_13 ps_car_14 ps_car_15 ps_calc_01 ps_calc_02 ps_calc_03 ps_calc_04 ps_calc_05 ps_calc_06 ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10 ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14
count 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000
mean 0.610991 0.439184 0.551102 0.379945 0.813265 0.276256 3.065899 0.449756 0.449589 0.449849 2.372081 1.885886 7.689445 3.005823 9.225904 2.339034 8.433590 5.441382 1.441918 2.872288 7.539026
std 0.287643 0.404264 0.793506 0.058327 0.224588 0.357154 0.731366 0.287198 0.286893 0.287153 1.117219 1.134927 1.334312 1.414564 1.459672 1.246949 2.904597 2.332871 1.202963 1.694887 2.746652
min 0.000000 0.000000 -1.000000 -1.000000 0.250619 -1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.400000 0.200000 0.525000 0.316228 0.670867 0.333167 2.828427 0.200000 0.200000 0.200000 2.000000 1.000000 7.000000 2.000000 8.000000 1.000000 6.000000 4.000000 1.000000 2.000000 6.000000
50% 0.700000 0.300000 0.720677 0.374166 0.765811 0.368782 3.316625 0.500000 0.400000 0.500000 2.000000 2.000000 8.000000 3.000000 9.000000 2.000000 8.000000 5.000000 1.000000 3.000000 7.000000
75% 0.900000 0.600000 1.000000 0.400000 0.906190 0.396485 3.605551 0.700000 0.700000 0.700000 3.000000 3.000000 9.000000 4.000000 10.000000 3.000000 10.000000 7.000000 2.000000 4.000000 9.000000
max 0.900000 1.800000 4.037945 1.264911 3.720626 0.636396 3.741657 0.900000 0.900000 0.900000 5.000000 6.000000 10.000000 9.000000 12.000000 7.000000 25.000000 19.000000 10.000000 13.000000 23.000000
v = meta[(meta.level == 'binary') & (meta.keep)].index
train[v].describe()
target ps_ind_01 ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_14 ps_ind_15 ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
count 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000
mean 0.036448 1.900378 0.393742 0.257033 0.163921 0.185304 0.000373 0.001692 0.009439 0.000948 0.012451 7.299922 0.660823 0.121081 0.153446 0.122427 0.627840 0.554182 0.287182 0.349024 0.153318
std 0.187401 1.983789 0.488579 0.436998 0.370205 0.388544 0.019309 0.041097 0.096693 0.030768 0.127545 3.546042 0.473430 0.326222 0.360417 0.327779 0.483381 0.497056 0.452447 0.476662 0.360295
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 7.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 0.000000
75% 0.000000 3.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 10.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000
max 1.000000 7.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 4.000000 13.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

target=1인 비율이 target=0보다 훨씬 작기 때문에 target=1을 오버 샘플링 혹은 target=0으로 언더샘플링 할 수 있습니다.
언더 샘플링은 불균형한 데이터 셋에서 높은 비율을 차지하던 클래스의 데이터 수를 줄임으로써 데이터 불균형을 해소하는 아이디어 입니다.
하지만 이 방법은 학습에 사용되는 전체 데이터 수를 급격하게 감소시켜 오히려 성능이 떨어질 수 있습니다.
오버 샘플링은 낮은 비율 클래스의 데이터 수를 늘림으로써 데이터 불균형을 해소하는 아이디어 입니다.

desired_apriori=0.10

idx_0 = train[train.target == 0].index
idx_1 = train[train.target == 1].index

nb_0 = len(train.loc[idx_0])
nb_1 = len(train.loc[idx_1])

undersampling_rate = ((1-desired_apriori)*nb_1)/(nb_0*desired_apriori)
undersampled_nb_0 = int(undersampling_rate*nb_0)
print('Rate to undersample records with target=0: {}'.format(undersampling_rate))
print('Number of records with target=0 after undersampling: {}'.format(undersampled_nb_0))

undersampled_idx = shuffle(idx_0, random_state=37, n_samples=undersampled_nb_0)

idx_list = list(undersampled_idx) + list(idx_1)

train = train.loc[idx_list].reset_index(drop=True)
Rate to undersample records with target=0: 0.34043569687437886
Number of records with target=0 after undersampling: 195246

ps_car_03_cat 및 ps_car_05_cat는 결측값이 있는 비율이 높습니다. 이 변수를 제거합니다.
결측값이 있는 다른 범주형 변수의 경우 결측값 -1을 그대로 둘 수 있습니다.
ps_reg_03 (continuous)에는 18%에 대한 결측값이 있습니다. 평균으로 대체합니다.
ps_car_11 (순서형)에는 결측 값이 5개만 있습니다. 모드로 대체합니다.
ps_car_12 (연속)에는 결측값이 1개만 있습니다. 평균으로 대체합니다.
ps_car_14 (연속)에는 7%에 대한 결측값이 있습니다. 평균으로 대체합니다.

vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
train.drop(vars_to_drop, inplace=True, axis=1)
meta.loc[(vars_to_drop),'keep'] = False  # 메타 데이터 update

mean_imp = SimpleImputer(missing_values=-1, strategy='mean')
mode_imp = SimpleImputer(missing_values=-1, strategy='most_frequent')
train['ps_reg_03'] = mean_imp.fit_transform(train[['ps_reg_03']]).ravel()
train['ps_car_12'] = mean_imp.fit_transform(train[['ps_car_12']]).ravel()
train['ps_car_14'] = mean_imp.fit_transform(train[['ps_car_14']]).ravel()
train['ps_car_11'] = mode_imp.fit_transform(train[['ps_car_11']]).ravel()
v = meta[(meta.level == 'nominal') & (meta.keep)].index

for f in v:
    dist_values = train[f].value_counts().shape[0]
    print('Variable {} has {} distinct values'.format(f, dist_values))
Variable ps_ind_02_cat has 5 distinct values
Variable ps_ind_03 has 12 distinct values
Variable ps_ind_04_cat has 3 distinct values
Variable ps_ind_05_cat has 8 distinct values
Variable ps_car_01_cat has 13 distinct values
Variable ps_car_02_cat has 3 distinct values
Variable ps_car_04_cat has 10 distinct values
Variable ps_car_06_cat has 18 distinct values
Variable ps_car_07_cat has 3 distinct values
Variable ps_car_08_cat has 2 distinct values
Variable ps_car_09_cat has 6 distinct values
Variable ps_car_10_cat has 3 distinct values
Variable ps_car_11_cat has 104 distinct values
Variable ps_car_11 has 4 distinct values
def add_noise(series, noise_level):
    return series * (1 + noise_level * np.random.randn(len(series)))

def target_encode(trn_series=None, 
                  tst_series=None, 
                  target=None, 
                  min_samples_leaf=1, 
                  smoothing=1,
                  noise_level=0):
    assert len(trn_series) == len(target)
    assert trn_series.name == tst_series.name
    temp = pd.concat([trn_series, target], axis=1)
    # 평균 계산
    averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])

    smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
    
    prior = target.mean()

    averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
    averages.drop(["mean", "count"], axis=1, inplace=True)

    ft_trn_series = pd.merge(
        trn_series.to_frame(trn_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
        on=trn_series.name,
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)

    ft_trn_series.index = trn_series.index 
    ft_tst_series = pd.merge(
        tst_series.to_frame(tst_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
        on=tst_series.name,
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)

    ft_tst_series.index = tst_series.index
    return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)
train_encoded, test_encoded = target_encode(train["ps_car_11_cat"], 
                             test["ps_car_11_cat"], 
                             target=train.target, 
                             min_samples_leaf=100,
                             smoothing=10,
                             noise_level=0.01)
    
train['ps_car_11_cat_te'] = train_encoded
train.drop('ps_car_11_cat', axis=1, inplace=True)
meta.loc['ps_car_11_cat','keep'] = False  # 메타 데이터 업데이트
test['ps_car_11_cat_te'] = test_encoded
test.drop('ps_car_11_cat', axis=1, inplace=True)

탐색 데이터 시각화

범주값으로 결측값을 유지한다면 비율을 확인하기 좋습니다.

v = meta[(meta.level == 'nominal') & (meta.keep)].index

for f in v:
    plt.figure()
    fig, ax = plt.subplots(figsize=(20,10))
    cat_perc = train[[f, 'target']].groupby([f],as_index=False).mean()
    cat_perc.sort_values(by='target', ascending=False, inplace=True)
    sns.barplot(ax=ax, x=f, y='target', data=cat_perc, order=cat_perc[f])
    plt.ylabel('% target', fontsize=18)
    plt.xlabel(f, fontsize=18)
    plt.tick_params(axis='both', which='major', labelsize=18)
    plt.show();
<Figure size 432x288 with 0 Axes>

output_28_1

<Figure size 432x288 with 0 Axes>

output_28_3

<Figure size 432x288 with 0 Axes>

output_28_5

<Figure size 432x288 with 0 Axes>

output_28_7

<Figure size 432x288 with 0 Axes>

output_28_9

<Figure size 432x288 with 0 Axes>

output_28_11

<Figure size 432x288 with 0 Axes>

output_28_13

<Figure size 432x288 with 0 Axes>

output_28_15

<Figure size 432x288 with 0 Axes>

output_28_17

<Figure size 432x288 with 0 Axes>

output_28_19

<Figure size 432x288 with 0 Axes>

output_28_21

<Figure size 432x288 with 0 Axes>

output_28_23

<Figure size 432x288 with 0 Axes>

output_28_25

구간 변수는 상관 관계를 확인하는 것이 좋습니다.

def corr_heatmap(v):
    correlations = train[v].corr()

    cmap = sns.diverging_palette(220, 10, as_cmap=True)

    fig, ax = plt.subplots(figsize=(10,10))
    sns.heatmap(correlations, cmap=cmap, vmax=1.0, center=0, fmt='.2f',
                square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .75})
    plt.show();
    
v = meta[(meta.level == 'interval') & (meta.keep)].index
corr_heatmap(v)

output_30_0

s = train.sample(frac=0.1)
sns.lmplot(x='ps_reg_02', y='ps_reg_03', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.show()

output_32_0

sns.lmplot(x='ps_car_12', y='ps_car_13', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.show()

output_33_0

sns.lmplot(x='ps_car_12', y='ps_car_14', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.show()

output_34_0

sns.lmplot(x='ps_car_15', y='ps_car_13', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.show()

output_35_0

Feature Engineering

# 더미 변수 만들기
v = meta[(meta.level == 'nominal') & (meta.keep)].index
print('Before dummification we have {} variables in train'.format(train.shape[1]))
train = pd.get_dummies(train, columns=v, drop_first=True)
print('After dummification we have {} variables in train'.format(train.shape[1]))
Before dummification we have 57 variables in train
After dummification we have 121 variables in train
# 상호작용 변수 만들기
v = meta[(meta.level == 'interval') & (meta.keep)].index
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
interactions = pd.DataFrame(data=poly.fit_transform(train[v]), columns=poly.get_feature_names(v))
interactions.drop(v, axis=1, inplace=True)
print('Before creating interactions we have {} variables in train'.format(train.shape[1]))
train = pd.concat([train, interactions], axis=1)
print('After creating interactions we have {} variables in train'.format(train.shape[1]))
C:\Users\keonj\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)


Before creating interactions we have 121 variables in train
After creating interactions we have 352 variables in train

Feature Selection

# VarianceThreshold을 통해 분산이 작은 값 제거
selector = VarianceThreshold(threshold=.01)
selector.fit(train.drop(['id', 'target'], axis=1))
f = np.vectorize(lambda x : not x) 
v = train.drop(['id', 'target'], axis=1).columns[f(selector.get_support())]
print('{} variables have too low variance.'.format(len(v)))
print('These variables are {}'.format(list(v)))
28 variables have too low variance.
These variables are ['ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_car_12', 'ps_car_14', 'ps_car_11_cat_te', 'ps_ind_05_cat_2', 'ps_ind_05_cat_5', 'ps_car_01_cat_1', 'ps_car_01_cat_2', 'ps_car_04_cat_3', 'ps_car_04_cat_4', 'ps_car_04_cat_5', 'ps_car_04_cat_6', 'ps_car_04_cat_7', 'ps_car_06_cat_2', 'ps_car_06_cat_5', 'ps_car_06_cat_8', 'ps_car_06_cat_12', 'ps_car_06_cat_16', 'ps_car_06_cat_17', 'ps_car_09_cat_4', 'ps_car_10_cat_1', 'ps_car_10_cat_2', 'ps_car_12^2', 'ps_car_12 ps_car_14', 'ps_car_14^2']
# 랜덤 포레스트를 통한 특성 선택 및 모델 선택

X_train = train.drop(['id', 'target'], axis=1)
y_train = train['target']

feat_labels = X_train.columns

rf = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)

rf.fit(X_train, y_train)
importances = rf.feature_importances_

indices = np.argsort(rf.feature_importances_)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30,feat_labels[indices[f]], importances[indices[f]]))
 1) ps_car_11_cat_te               0.008718
 2) ps_car_13 ps_calc_08           0.006701
 3) ps_car_13 ps_calc_06           0.006655
 4) ps_car_13^2                    0.006577
 5) ps_car_13                      0.006575
 6) ps_car_13 ps_car_14            0.006478
 7) ps_car_12 ps_car_13            0.006459
 8) ps_car_13 ps_car_15            0.006277
 9) ps_reg_01 ps_car_13            0.006266
10) ps_reg_03 ps_car_13            0.006250
11) ps_car_14 ps_calc_08           0.006093
12) ps_car_14 ps_calc_06           0.006032
13) ps_car_14 ps_car_15            0.005884
14) ps_reg_03 ps_car_14            0.005808
15) ps_car_13 ps_calc_10           0.005789
16) ps_car_13 ps_calc_14           0.005764
17) ps_reg_03 ps_calc_08           0.005740
18) ps_car_13 ps_calc_11           0.005700
19) ps_reg_03 ps_car_12            0.005685
20) ps_reg_03 ps_calc_06           0.005609
21) ps_car_14 ps_calc_10           0.005494
22) ps_car_13 ps_calc_07           0.005493
23) ps_reg_03 ps_calc_10           0.005473
24) ps_car_14 ps_calc_14           0.005460
25) ps_reg_01 ps_car_14            0.005427
26) ps_car_14 ps_calc_11           0.005415
27) ps_car_13 ps_calc_04           0.005409
28) ps_reg_03 ps_car_15            0.005362
29) ps_reg_03 ps_calc_11           0.005356
30) ps_reg_03 ps_calc_14           0.005352
31) ps_reg_02 ps_car_13            0.005295
32) ps_car_13 ps_calc_09           0.005189
33) ps_car_14                      0.005183
34) ps_car_14^2                    0.005173
35) ps_reg_01 ps_reg_03            0.005137
36) ps_car_12 ps_car_14            0.005126
37) ps_reg_03 ps_calc_07           0.005110
38) ps_car_14 ps_calc_07           0.005089
39) ps_reg_03                      0.005059
40) ps_car_13 ps_calc_01           0.005045
41) ps_reg_03^2                    0.005036
42) ps_car_14 ps_calc_04           0.005024
43) ps_car_13 ps_calc_13           0.005012
44) ps_reg_03 ps_calc_04           0.005008
45) ps_car_13 ps_calc_03           0.004984
46) ps_car_13 ps_calc_02           0.004975
47) ps_car_13 ps_calc_05           0.004943
48) ps_reg_03 ps_calc_09           0.004808
49) ps_car_14 ps_calc_09           0.004771
50) ps_reg_03 ps_calc_13           0.004765
51) ps_car_14 ps_calc_02           0.004708
52) ps_car_14 ps_calc_13           0.004702
53) ps_calc_10 ps_calc_14          0.004656
54) ps_car_15 ps_calc_10           0.004655
55) ps_car_14 ps_calc_01           0.004647
56) ps_calc_10 ps_calc_11          0.004636
57) ps_car_14 ps_calc_03           0.004620
58) ps_reg_03 ps_calc_02           0.004619
59) ps_car_15 ps_calc_08           0.004616
60) ps_reg_03 ps_calc_03           0.004603
61) ps_calc_08 ps_calc_10          0.004573
62) ps_car_14 ps_calc_05           0.004566
63) ps_reg_03 ps_calc_05           0.004565
64) ps_car_15 ps_calc_14           0.004556
65) ps_reg_02 ps_car_14            0.004543
66) ps_reg_03 ps_calc_01           0.004534
67) ps_car_12 ps_calc_08           0.004515
68) ps_calc_08 ps_calc_14          0.004505
69) ps_car_12 ps_calc_10           0.004476
70) ps_car_12 ps_car_15            0.004467
71) ps_calc_11 ps_calc_14          0.004441
72) ps_car_15 ps_calc_11           0.004412
73) ps_calc_06 ps_calc_10          0.004408
74) ps_reg_02 ps_reg_03            0.004406
75) ps_car_12 ps_calc_14           0.004391
76) ps_car_12 ps_calc_06           0.004376
77) ps_ind_15                      0.004353
78) ps_car_15 ps_calc_06           0.004341
79) ps_calc_06 ps_calc_14          0.004306
80) ps_calc_08 ps_calc_11          0.004282
81) ps_calc_06 ps_calc_08          0.004253
82) ps_car_12 ps_calc_11           0.004188
83) ps_calc_07 ps_calc_10          0.004163
84) ps_calc_06 ps_calc_11          0.004134
85) ps_calc_07 ps_calc_14          0.004076
86) ps_calc_02 ps_calc_10          0.004032
87) ps_calc_01 ps_calc_14          0.004025
88) ps_car_13 ps_calc_12           0.004025
89) ps_calc_01 ps_calc_10          0.004011
90) ps_reg_01 ps_calc_10           0.004011
91) ps_calc_03 ps_calc_10          0.004008
92) ps_calc_10 ps_calc_13          0.003989
93) ps_calc_04 ps_calc_10          0.003985
94) ps_calc_03 ps_calc_14          0.003971
95) ps_calc_13 ps_calc_14          0.003970
96) ps_car_15 ps_calc_07           0.003945
97) ps_calc_02 ps_calc_14          0.003937
98) ps_reg_01 ps_calc_14           0.003925
99) ps_calc_09 ps_calc_10          0.003895
100) ps_calc_04 ps_calc_14          0.003877
101) ps_calc_07 ps_calc_08          0.003863
102) ps_reg_03 ps_calc_12           0.003849
103) ps_calc_01 ps_calc_11          0.003822
104) ps_calc_09 ps_calc_14          0.003819
105) ps_reg_02 ps_car_15            0.003806
106) ps_calc_07 ps_calc_11          0.003779
107) ps_reg_02 ps_calc_10           0.003771
108) ps_calc_03 ps_calc_11          0.003771
109) ps_calc_02 ps_calc_11          0.003757
110) ps_reg_02 ps_calc_14           0.003757
111) ps_reg_01 ps_car_15            0.003754
112) ps_car_14 ps_calc_12           0.003745
113) ps_calc_01 ps_calc_08          0.003725
114) ps_calc_05 ps_calc_10          0.003722
115) ps_car_15 ps_calc_13           0.003717
116) ps_car_12 ps_calc_01           0.003715
117) ps_calc_11 ps_calc_13          0.003708
118) ps_calc_03 ps_calc_08          0.003693
119) ps_car_15 ps_calc_02           0.003684
120) ps_car_12 ps_calc_07           0.003682
121) ps_calc_06 ps_calc_07          0.003680
122) ps_car_15 ps_calc_03           0.003676
123) ps_car_12 ps_calc_03           0.003671
124) ps_car_12 ps_calc_02           0.003670
125) ps_car_15 ps_calc_01           0.003668
126) ps_calc_08 ps_calc_13          0.003664
127) ps_car_15 ps_calc_09           0.003660
128) ps_reg_01 ps_car_12            0.003653
129) ps_car_15 ps_calc_04           0.003648
130) ps_calc_02 ps_calc_08          0.003647
131) ps_reg_01 ps_calc_11           0.003636
132) ps_calc_04 ps_calc_11          0.003621
133) ps_reg_02 ps_calc_11           0.003607
134) ps_calc_05 ps_calc_14          0.003590
135) ps_calc_09 ps_calc_11          0.003569
136) ps_reg_01 ps_calc_08           0.003557
137) ps_calc_04 ps_calc_08          0.003544
138) ps_reg_02 ps_car_12            0.003506
139) ps_car_12 ps_calc_13           0.003504
140) ps_reg_02 ps_calc_08           0.003498
141) ps_calc_08 ps_calc_09          0.003486
142) ps_calc_02 ps_calc_07          0.003485
143) ps_calc_06 ps_calc_13          0.003480
144) ps_calc_01 ps_calc_06          0.003478
145) ps_calc_03 ps_calc_07          0.003461
146) ps_calc_03 ps_calc_06          0.003435
147) ps_car_12 ps_calc_04           0.003432
148) ps_calc_02 ps_calc_06          0.003429
149) ps_calc_01 ps_calc_13          0.003413
150) ps_reg_02 ps_calc_06           0.003413
151) ps_calc_01 ps_calc_07          0.003402
152) ps_calc_01 ps_calc_02          0.003385
153) ps_calc_02 ps_calc_13          0.003379
154) ps_car_15 ps_calc_05           0.003358
155) ps_calc_03 ps_calc_13          0.003357
156) ps_calc_04 ps_calc_06          0.003357
157) ps_calc_02 ps_calc_03          0.003345
158) ps_calc_01 ps_calc_03          0.003338
159) ps_calc_05 ps_calc_11          0.003313
160) ps_calc_05 ps_calc_08          0.003306
161) ps_car_12 ps_calc_09           0.003306
162) ps_reg_01 ps_calc_06           0.003295
163) ps_calc_06 ps_calc_09          0.003271
164) ps_calc_03 ps_calc_09          0.003266
165) ps_calc_03 ps_calc_04          0.003259
166) ps_reg_02 ps_calc_01           0.003258
167) ps_calc_02 ps_calc_09          0.003207
168) ps_calc_01 ps_calc_09          0.003206
169) ps_reg_02 ps_calc_03           0.003195
170) ps_reg_02 ps_calc_02           0.003194
171) ps_calc_10 ps_calc_12          0.003190
172) ps_calc_02 ps_calc_04          0.003190
173) ps_calc_01 ps_calc_04          0.003184
174) ps_calc_07 ps_calc_13          0.003152
175) ps_reg_02 ps_calc_07           0.003144
176) ps_calc_12 ps_calc_14          0.003128
177) ps_reg_01 ps_calc_07           0.003110
178) ps_car_12 ps_calc_05           0.003106
179) ps_reg_01 ps_calc_13           0.003098
180) ps_calc_02 ps_calc_05          0.003069
181) ps_calc_05 ps_calc_06          0.003065
182) ps_calc_01 ps_calc_05          0.003065
183) ps_calc_03 ps_calc_05          0.003060
184) ps_reg_01 ps_calc_03           0.003045
185) ps_reg_02 ps_calc_13           0.003041
186) ps_calc_09 ps_calc_13          0.003035
187) ps_reg_01 ps_calc_01           0.003024
188) ps_reg_01 ps_calc_02           0.003007
189) ps_calc_04 ps_calc_13          0.002981
190) ps_calc_07 ps_calc_09          0.002955
191) ps_calc_11 ps_calc_12          0.002941
192) ps_calc_04 ps_calc_07          0.002920
193) ps_reg_02 ps_calc_04           0.002900
194) ps_reg_02 ps_calc_09           0.002900
195) ps_ind_01                      0.002898
196) ps_car_15 ps_calc_12           0.002890
197) ps_reg_01 ps_reg_02            0.002855
198) ps_reg_01 ps_calc_09           0.002844
199) ps_calc_05 ps_calc_13          0.002819
200) ps_reg_01 ps_calc_04           0.002815
201) ps_reg_02 ps_calc_05           0.002813
202) ps_calc_02 ps_calc_12          0.002800
203) ps_calc_08 ps_calc_12          0.002791
204) ps_calc_05 ps_calc_07          0.002779
205) ps_calc_01 ps_calc_12          0.002768
206) ps_calc_03 ps_calc_12          0.002737
207) ps_car_12 ps_calc_12           0.002715
208) ps_reg_01 ps_calc_05           0.002691
209) ps_calc_10                     0.002666
210) ps_calc_06 ps_calc_12          0.002662
211) ps_calc_10^2                   0.002659
212) ps_calc_04 ps_calc_09          0.002657
213) ps_calc_14^2                   0.002583
214) ps_ind_05_cat_0                0.002573
215) ps_calc_14                     0.002549
216) ps_calc_05 ps_calc_09          0.002523
217) ps_calc_12 ps_calc_13          0.002499
218) ps_calc_07 ps_calc_12          0.002487
219) ps_calc_04 ps_calc_05          0.002485
220) ps_reg_02 ps_calc_12           0.002445
221) ps_calc_09 ps_calc_12          0.002351
222) ps_reg_01 ps_calc_12           0.002345
223) ps_calc_11^2                   0.002291
224) ps_calc_11                     0.002261
225) ps_calc_04 ps_calc_12          0.002256
226) ps_car_15                      0.002243
227) ps_car_15^2                    0.002221
228) ps_car_12                      0.002203
229) ps_car_12^2                    0.002182
230) ps_calc_05 ps_calc_12          0.002146
231) ps_calc_08                     0.002133
232) ps_calc_08^2                   0.002118
233) ps_calc_03                     0.001906
234) ps_calc_01                     0.001902
235) ps_calc_01^2                   0.001899
236) ps_calc_03^2                   0.001897
237) ps_calc_02^2                   0.001890
238) ps_calc_02                     0.001872
239) ps_calc_06^2                   0.001859
240) ps_calc_06                     0.001849
241) ps_reg_02^2                    0.001763
242) ps_ind_17_bin                  0.001735
243) ps_reg_02                      0.001712
244) ps_calc_13^2                   0.001696
245) ps_calc_13                     0.001695
246) ps_calc_07^2                   0.001607
247) ps_calc_07                     0.001594
248) ps_reg_01                      0.001457
249) ps_reg_01^2                    0.001445
250) ps_calc_09^2                   0.001385
251) ps_calc_09                     0.001382
252) ps_calc_04^2                   0.001287
253) ps_calc_04                     0.001279
254) ps_calc_05                     0.001251
255) ps_calc_05^2                   0.001245
256) ps_car_07_cat_1                0.001210
257) ps_calc_12^2                   0.001203
258) ps_calc_12                     0.001184
259) ps_ind_16_bin                  0.001039
260) ps_ind_05_cat_6                0.000984
261) ps_ind_07_bin                  0.000935
262) ps_car_09_cat_1                0.000840
263) ps_ind_06_bin                  0.000810
264) ps_ind_02_cat_1                0.000746
265) ps_car_01_cat_9                0.000733
266) ps_calc_17_bin                 0.000731
267) ps_ind_04_cat_1                0.000725
268) ps_ind_04_cat_0                0.000720
269) ps_car_07_cat_0                0.000716
270) ps_calc_19_bin                 0.000713
271) ps_ind_02_cat_2                0.000712
272) ps_calc_18_bin                 0.000710
273) ps_car_01_cat_11               0.000710
274) ps_calc_16_bin                 0.000702
275) ps_ind_05_cat_4                0.000700
276) ps_ind_03_7                    0.000691
277) ps_ind_03_1                    0.000691
278) ps_car_09_cat_2                0.000678
279) ps_ind_05_cat_2                0.000672
280) ps_ind_08_bin                  0.000672
281) ps_car_01_cat_7                0.000672
282) ps_ind_03_6                    0.000671
283) ps_car_06_cat_1                0.000669
284) ps_car_11_3                    0.000646
285) ps_car_09_cat_0                0.000643
286) ps_ind_03_5                    0.000635
287) ps_car_11_2                    0.000629
288) ps_calc_15_bin                 0.000627
289) ps_ind_03_8                    0.000613
290) ps_calc_20_bin                 0.000608
291) ps_car_04_cat_2                0.000606
292) ps_car_01_cat_4                0.000604
293) ps_ind_18_bin                  0.000579
294) ps_car_06_cat_11               0.000572
295) ps_ind_03_10                   0.000571
296) ps_ind_03_3                    0.000563
297) ps_ind_03_2                    0.000561
298) ps_car_01_cat_8                0.000558
299) ps_ind_02_cat_4                0.000557
300) ps_car_01_cat_10               0.000555
301) ps_ind_09_bin                  0.000550
302) ps_ind_02_cat_3                0.000543
303) ps_car_06_cat_14               0.000535
304) ps_ind_03_9                    0.000534
305) ps_car_01_cat_6                0.000526
306) ps_ind_03_11                   0.000522
307) ps_ind_03_4                    0.000515
308) ps_car_01_cat_5                0.000511
309) ps_car_02_cat_0                0.000507
310) ps_car_02_cat_1                0.000504
311) ps_ind_14                      0.000502
312) ps_car_06_cat_6                0.000483
313) ps_car_06_cat_4                0.000478
314) ps_car_08_cat_1                0.000471
315) ps_car_01_cat_3                0.000445
316) ps_car_01_cat_0                0.000439
317) ps_car_06_cat_10               0.000438
318) ps_car_06_cat_7                0.000432
319) ps_ind_05_cat_1                0.000422
320) ps_car_09_cat_3                0.000404
321) ps_car_04_cat_1                0.000402
322) ps_ind_12_bin                  0.000380
323) ps_car_06_cat_9                0.000372
324) ps_car_06_cat_15               0.000365
325) ps_car_11_1                    0.000363
326) ps_car_10_cat_1                0.000362
327) ps_car_09_cat_4                0.000355
328) ps_ind_05_cat_3                0.000334
329) ps_car_01_cat_2                0.000324
330) ps_car_06_cat_3                0.000321
331) ps_car_06_cat_17               0.000276
332) ps_car_06_cat_12               0.000270
333) ps_car_06_cat_16               0.000235
334) ps_car_01_cat_1                0.000232
335) ps_car_04_cat_8                0.000218
336) ps_car_04_cat_9                0.000212
337) ps_ind_05_cat_5                0.000184
338) ps_car_06_cat_13               0.000180
339) ps_car_06_cat_5                0.000178
340) ps_ind_11_bin                  0.000143
341) ps_car_04_cat_6                0.000132
342) ps_ind_13_bin                  0.000095
343) ps_car_04_cat_3                0.000087
344) ps_car_06_cat_2                0.000066
345) ps_car_04_cat_5                0.000058
346) ps_car_04_cat_7                0.000056
347) ps_car_06_cat_8                0.000045
348) ps_car_10_cat_2                0.000045
349) ps_ind_10_bin                  0.000045
350) ps_car_04_cat_4                0.000027
sfm = SelectFromModel(rf, threshold='median', prefit=True)
print('Number of features before selection: {}'.format(X_train.shape[1]))
n_features = sfm.transform(X_train).shape[1]
print('Number of features after selection: {}'.format(n_features))
selected_vars = list(feat_labels[sfm.get_support()])
Number of features before selection: 350


C:\Users\keonj\anaconda3\lib\site-packages\sklearn\base.py:438: UserWarning: X has feature names, but SelectFromModel was fitted without feature names
  warnings.warn(


Number of features after selection: 175
# Feature scaling
scaler = StandardScaler()
scaler.fit_transform(train.drop(['target'], axis=1))
array([[-0.90494248, -0.45941104,  1.25877984, ...,  0.40315483,
         0.14885213, -0.62460393],
       [ 0.24006954,  1.55538958,  1.25877984, ..., -0.17489762,
        -0.04208459, -0.33950182],
       [ 1.64508122,  1.05168943,  1.25877984, ..., -0.17489762,
         0.53072557,  0.77897569],
       ...,
       [ 1.73477713, -0.9631112 , -0.7944201 , ..., -0.83552899,
        -0.99676819, -0.62460393],
       [ 1.73485162, -0.9631112 ,  1.25877984, ...,  0.40315483,
        -0.36031245, -1.06322256],
       [ 1.73512631, -0.45941104, -0.7944201 , ...,  0.40315483,
         0.91259901,  0.36228799]])