Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

程序员文章站 2022-03-10 08:17:18

...

Santander Customer Transaction Prediction: EDA and Baseline

1 Description

At Santander our mission is to help people and businesses prosper. We are always looking for ways to help our customers understand their financial health and identify which products and services might help them achieve their monetary goals.

Our data science team is continually challenging our machine learning algorithms, working with the global data science community to make sure we can more accurately identify new ways to solve our most common challenge, binary classification problems such as: is a customer satisfied? Will a customer buy this product? Can a customer pay this loan?

In this challenge, we invite Kagglers to help us identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this competition has the same structure as the real data we have available to solve this problem.

2 Prepare The Data

2.1 Import and preparation

First we import the packages that we might need in the solution.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import six.moves.urllib as urllib
import sklearn
import scipy
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_auc_score, roc_curve
import lightgbm as lgb
%matplotlib inline

PATH='E:/kaggle/santander-customer-transaction-prediction/'
train=pd.read_csv(PATH+'train.csv')
test=pd.read_csv(PATH+'test.csv')

Check the data information

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Columns: 202 entries, ID_code to var_199
dtypes: float64(200), int64(1), object(1)
memory usage: 308.2+ MB

Check the dimension of the data

train.shape

(200000, 202)

train.head()

	ID_code	var_0	var_1	var_2	var_3	var_4	var_5	var_6	var_7	...	var_190	var_191	var_192	var_193	var_194	var_195	var_196	var_197	var_198	var_199
0	train_0	8.9255	-6.7863	11.9081	5.0930	11.4607	-9.2834	5.1187	18.6266	...	4.4354	3.9642	3.1364	1.6910	18.5227	-2.3978	7.8784	8.5635	12.7803	-1.0914
1	train_1	11.5006	-4.1473	13.8588	5.3890	12.3622	7.0433	5.6208	16.5338	...	7.6421	7.7214	2.5837	10.9516	15.4305	2.0339	8.1267	8.7889	18.3560	1.9518
2	train_2	8.6093	-2.7457	12.0805	7.8928	10.5825	-9.0837	6.9427	14.6155	...	2.9057	9.7905	1.6704	1.6858	21.6042	3.1417	-6.5213	8.2675	14.7222	0.3965
3	train_3	11.0604	-2.1518	8.9522	7.1957	12.5846	-1.8361	5.8428	14.9250	...	4.4666	4.7433	0.7178	1.4214	23.0347	-1.2706	-2.9275	10.2922	17.9697	-8.9996
4	train_4	9.8369	-1.4834	12.8746	6.6375	12.2772	2.4486	5.9405	19.2514	...	-1.4905	9.5214	-0.1508	9.1942	13.2876	-1.5121	3.9267	9.5031	17.9974	-8.8104

5 rows × 202 columns

We can observe the basic condition of the data here. We can not infer any actual information from the name of the columns and the data, too. So it is better for us to find out more. Before that, first test whether there are missing values.

2.2 Check the Data

# check the missing values
data_na=(train.isnull().sum()/len(train))*100
data_na=data_na.drop(data_na[data_na==0].index).sort_values(ascending=False)
missing_data=pd.DataFrame({'MissingRatio':data_na})
print(missing_data)

Empty DataFrame
Columns: [MissingRatio]
Index: []

We can see there are no missing values.

train.target.value_counts()

0    179902
1     20098
Name: target, dtype: int64

The dataset may be quite unbalanced, we can see that almost 90 percent of the items have the target ‘0’ while 10 percent are ‘1’.

We first extract all the features here.

features=[col for col in train.columns if col not in ['ID_code','target']]

3 EDA

3.1 Check the Train-test Distribution

Before we doing our work, we might be extremely interested in the distribution of the dataset. The division of train set and test set should be as balanced as possible in all kinds of aspects. So we first examine this point.

First we check the mean values per row.

# check the distribution
plt.figure(figsize=(18,10))
plt.title('Distribution of mean values per row in the train and test set')
sns.distplot(train[features].mean(axis=1),color='green',kde=True,bins=120,label='train')
sns.distplot(test[features].mean(axis=1),color='red',kde=True,bins=120,label='test')
plt.legend()
plt.show()

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

Then we apply the same operation to the columns.

plt.figure(figsize=(18,10))
plt.title('Distribution of mean values per column in the train and test set')
sns.distplot(train[features].mean(axis=0),color='purple',kde=True,bins=120,label='train')
sns.distplot(test[features].mean(axis=0),color='orange',kde=True,bins=120,label='test')
plt.legend()
plt.show()

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

Besides, the standard deviation also worth examining.

plt.figure(figsize=(18,10))
plt.title('Distribution of std values per rows in the train and test set')
sns.distplot(train[features].std(axis=1),color='black',kde=True,bins=120,label='train')
sns.distplot(test[features].std(axis=1),color='yellow',kde=True,bins=120,label='test')
plt.legend()
plt.show()

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

plt.figure(figsize=(18,10))
plt.title('Distribution of std values per column in the train and test set')
sns.distplot(train[features].std(axis=0),color='blue',kde=True,bins=120,label='train')
sns.distplot(test[features].std(axis=0),color='green',kde=True,bins=120,label='test')
plt.legend()
plt.show()

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

We can see the data distribution of each row and column in the train set and the test set are almost balanced.

3.2 Check the Feature Correlation

# check the feature correlation
corrmat=train.corr()
plt.subplots(figsize=(18,18))
sns.heatmap(corrmat,vmax=0.9,square=True)

<matplotlib.axes._subplots.AxesSubplot at 0x25c953f7358>

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

We can see that the correlation between features are barely slight. Also it is worth to check the biggest correlation value.

%%time
correlations=train[features].corr().unstack().sort_values(kind='quicksort').reset_index()
correlations=correlations[correlations['level_0']!=correlations['level_1']]

Wall time: 16.2 s

correlations.tail(10)

	level_0	level_1	0
39790	var_122	var_132	0.008956
39791	var_132	var_122	0.008956
39792	var_146	var_169	0.009071
39793	var_169	var_146	0.009071
39794	var_189	var_183	0.009359
39795	var_183	var_189	0.009359
39796	var_174	var_81	0.009490
39797	var_81	var_174	0.009490
39798	var_165	var_81	0.009714
39799	var_81	var_165	0.009714

correlations.head(10)

	level_0	level_1	0
0	var_26	var_139	-0.009844
1	var_139	var_26	-0.009844
2	var_148	var_53	-0.009788
3	var_53	var_148	-0.009788
4	var_80	var_6	-0.008958
5	var_6	var_80	-0.008958
6	var_1	var_80	-0.008855
7	var_80	var_1	-0.008855
8	var_13	var_2	-0.008795
9	var_2	var_13	-0.008795

Well, the maximum absolute value of feature correlation is below 0.01. So we might not get any useful information from here.

3.3 Further Exploring

How about the distribution of each feature, here we try to print all the distribution plot on a single graph.

# check the distribution of each feature
def plot_features(df1,df2,label1,label2,features):
    sns.set_style('whitegrid')
    plt.figure()
    fig,ax=plt.subplots(10,20,figsize=(18,22))
    i=0
    for feature in features:
        i+=1
        plt.subplot(10,20,i)
        sns.distplot(df1[feature],hist=False,label=label1)
        sns.distplot(df2[feature],hist=False,label=label2)
        plt.xlabel(feature,fontsize=9)
        locs, labels=plt.xticks()
        plt.tick_params(axis='x',which='major',labelsize=6,pad=-6)
        plt.tick_params(axis='y',which='major',labelsize=6)
    plt.show()
        
t0=train.loc[train['target']==0]
t1=train.loc[train['target']==1]
features=train.columns.values[2:202]
plot_features(t0,t1,'0','1',features)

<Figure size 432x288 with 0 Axes>

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

features=train.columns.values[2:202]
plot_features(train,test,'train','test',features)

<Figure size 432x288 with 0 Axes>

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

All the features here are nearly balanced, it can make our work really convenient.

3.4 Other Statistical Indicators that Worth Checking

In order to have a more comprehensive grasp of the whole data, we can check every statistical indicators that might provide more

# Distribution of min and max
t0=train.loc[train['target']==0]
t1=train.loc[train['target']==1]
plt.figure(figsize=(18,10))
plt.title('Distribution of min values per row in the train set')
sns.distplot(t0[features].min(axis=1),color='orange',kde=True,bins=120,label='0')
sns.distplot(t1[features].min(axis=1),color='red',kde=True,bins=120,label='1')
plt.legend()
plt.show()

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

plt.figure(figsize=(18,10))
plt.title('Distribution of min values per column in the train set')
sns.distplot(t0[features].min(axis=0),color='blue',kde=True,bins=120,label='0')
sns.distplot(t1[features].min(axis=0),color='green',kde=True,bins=120,label='1')
plt.legend()
plt.plot()

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

plt.figure(figsize=(18,10))
plt.title('Distribution of max values per row in the train set')
sns.distplot(t0[features].max(axis=1),color='orange',kde=True,bins=120,label='0')
sns.distplot(t1[features].max(axis=1),color='red',kde=True,bins=120,label='1')
plt.legend()
plt.show()

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

plt.figure(figsize=(18,10))
plt.title('Distribution of max values per column in the train set')
sns.distplot(t0[features].max(axis=0),color='blue',kde=True,bins=120,label='0')
sns.distplot(t1[features].max(axis=0),color='green',kde=True,bins=120,label='1')
plt.legend()
plt.show()

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

# skewness and kurtosis
plt.figure(figsize=(18,10))
plt.title('Distribution of skew values per row in the train set')
sns.distplot(t0[features].skew(axis=1),color='orange',kde=True,bins=120,label='0')
sns.distplot(t1[features].skew(axis=1),color='red',kde=True,bins=120,label='1')
plt.legend()
plt.show()

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

plt.figure(figsize=(18,10))
plt.title('Distribution of skew values per column in the train set')
sns.distplot(t0[features].skew(axis=0),color='blue',kde=True,bins=120,label='0')
sns.distplot(t1[features].skew(axis=0),color='green',kde=True,bins=120,label='1')
plt.legend()
plt.show()

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

plt.figure(figsize=(18,10))
plt.title('Distribution of kurtosis values per row in the train set')
sns.distplot(t0[features].kurtosis(axis=1),color='orange',kde=True,bins=120,label='0')
sns.distplot(t1[features].kurtosis(axis=1),color='red',kde=True,bins=120,label='1')
plt.legend()
plt.show()

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

plt.figure(figsize=(18,10))
plt.title('Distribution of kurtosis values per column in the train set')
sns.distplot(t0[features].kurtosis(axis=0),color='blue',kde=True,bins=120,label='0')
sns.distplot(t1[features].kurtosis(axis=0),color='green',kde=True,bins=120,label='1')
plt.legend()
plt.show()

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

4 Feature Engineering and Modeling

4.1 Create New Features

We can add the statistical indicators to the dataset for modeling. They may be useful.

# creating new features
idx=features=train.columns.values[2:202]
for df in [train,test]:
    df['sum']=df[idx].sum(axis=1)
    df['min']=df[idx].min(axis=1)
    df['max']=df[idx].max(axis=1)
    df['mean']=df[idx].mean(axis=1)
    df['std']=df[idx].std(axis=1)
    df['skew']=df[idx].skew(axis=1)
    df['kurt']=df[idx].kurtosis(axis=1)
    df['med']=df[idx].median(axis=1)
train[train.columns[202:]].head(10)

	sum	min	max	mean	std	skew	kurt	med
0	1456.3182	-21.4494	43.1127	7.281591	9.331540	0.101580	1.331023	6.77040
1	1415.3636	-47.3797	40.5632	7.076818	10.336130	-0.351734	4.110215	7.22315
2	1240.8966	-22.4038	33.8820	6.204483	8.753387	-0.056957	0.546438	5.89940
3	1288.2319	-35.1659	38.1015	6.441160	9.594064	-0.480116	2.630499	6.70260
4	1354.2310	-65.4863	41.1037	6.771155	11.287122	-1.463426	9.787399	6.94735
5	1272.3216	-44.7257	35.2664	6.361608	9.313012	-0.920439	4.581343	6.23790
6	1509.4490	-29.9763	39.9599	7.547245	9.246130	-0.133489	1.816453	7.47605
7	1438.5083	-27.2543	31.9043	7.192541	9.162558	-0.300415	1.174273	6.97300
8	1369.7375	-31.7855	42.4798	6.848688	9.837520	0.084047	1.997040	6.32870
9	1303.1155	-39.3042	34.4640	6.515577	9.943238	-0.670024	2.521160	6.36320

test[test.columns[201:]].head(10)

	sum	min	max	mean	std	skew	kurt	med
0	1416.6404	-31.9891	42.0248	7.083202	9.910632	-0.088518	1.871262	7.31440
1	1249.6860	-41.1924	35.6020	6.248430	9.541267	-0.559785	3.391068	6.43960
2	1430.2599	-34.3488	39.3654	7.151300	9.967466	-0.135084	2.326901	7.26355
3	1411.4447	-21.4797	40.3383	7.057224	8.257204	-0.167741	2.253054	6.89675
4	1423.7364	-24.8254	45.5510	7.118682	10.043542	0.293484	2.044943	6.83375
5	1273.1592	-19.8952	30.2647	6.365796	8.728466	-0.031814	0.113763	5.83800
6	1440.7387	-18.7481	37.4611	7.203693	8.676615	-0.045407	0.653782	6.66335
7	1429.5281	-22.7363	33.2387	7.147640	9.697687	-0.017784	0.713021	7.44665
8	1270.4978	-17.4719	28.1225	6.352489	8.257376	-0.138639	0.342360	6.55820
9	1271.6875	-32.8776	38.3319	6.358437	9.489171	-0.354497	1.934290	6.83960

Now let’s check the distributions of the new features.

def plot_new_features(df1,df2,label1,label2,features):
    sns.set_style('whitegrid')
    plt.figure()
    fig,ax=plt.subplots(2,4,figsize=(18,8))
    i=0
    for feature in features:
        i+=1
        plt.subplot(2,4,i)
        sns.kdeplot(df1[feature],bw=0.5,label=label1)
        sns.kdeplot(df2[feature],bw=0.5,label=label2)
        plt.xlabel(feature,fontsize=11)
        locs,labels=plt.xticks()
        plt.tick_params(axis='x',which='major',labelsize=8)
        plt.tick_params(axis='y',which='major',labelsize=8)
    plt.show()
t0=train.loc[train['target']==0]
t1=train.loc[train['target']==1]
features=train.columns.values[202:]
plot_new_features(t0,t1,'0','1',features)

<Figure size 432x288 with 0 Axes>

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

print('Columns in train_set:{} Columns in test_set:{}'.format(len(train.columns),len(test.columns)))

Columns in train_set:210 Columns in test_set:209

4.2 Training the Model

Here’s a baseline model that uses LightGBM.

# training the model
features=[col for col in train.columns if col not in ['ID_code','target']]
target=train['target']
param={
    'bagging_freq':5,
    'bagging_fraction':0.4,
    'boost':'gbdt',
    'boost_from_average':'false',
    'feature_fraction':0.05,
    'learning_rate':0.01,
    'max_depth':-1,
    'metric':'auc',
    'min_data_in_leaf':80,
    'min_sum_hessian_in_leaf':10.0,
    'num_leaves':13,
    'num_threads':8,
    'tree_learner':'serial',
    'objective':'binary',
    'verbosity':1
}

folds = StratifiedKFold(n_splits=10, shuffle=False, random_state=44000)
oof = np.zeros(len(train))
predictions = np.zeros(len(test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train.values, target.values)):
    print("Fold {}".format(fold_))
    trn_data = lgb.Dataset(train.iloc[trn_idx][features], label=target.iloc[trn_idx])
    val_data = lgb.Dataset(train.iloc[val_idx][features], label=target.iloc[val_idx])

    num_round = 1000000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 3000)
    oof[val_idx] = clf.predict(train.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(roc_auc_score(target, oof)))

Fold 0
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.900229	valid_1's auc: 0.881617
[2000]	training's auc: 0.91128	valid_1's auc: 0.889429
[3000]	training's auc: 0.918765	valid_1's auc: 0.893439
[4000]	training's auc: 0.924616	valid_1's auc: 0.895931
[5000]	training's auc: 0.929592	valid_1's auc: 0.897636
[6000]	training's auc: 0.933838	valid_1's auc: 0.898786
[7000]	training's auc: 0.937858	valid_1's auc: 0.899318
[8000]	training's auc: 0.941557	valid_1's auc: 0.899733
[9000]	training's auc: 0.94517	valid_1's auc: 0.899901
[10000]	training's auc: 0.948529	valid_1's auc: 0.900143
[11000]	training's auc: 0.951807	valid_1's auc: 0.900281
[12000]	training's auc: 0.954903	valid_1's auc: 0.900269
[13000]	training's auc: 0.957815	valid_1's auc: 0.900107
[14000]	training's auc: 0.960655	valid_1's auc: 0.89994
Early stopping, best iteration is:
[11603]	training's auc: 0.953681	valid_1's auc: 0.900347
Fold 1
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.900404	valid_1's auc: 0.882765
[2000]	training's auc: 0.911307	valid_1's auc: 0.889508
[3000]	training's auc: 0.918917	valid_1's auc: 0.893254
[4000]	training's auc: 0.924779	valid_1's auc: 0.895682
[5000]	training's auc: 0.929704	valid_1's auc: 0.897004
[6000]	training's auc: 0.933907	valid_1's auc: 0.897785
[7000]	training's auc: 0.93784	valid_1's auc: 0.89799
[8000]	training's auc: 0.941511	valid_1's auc: 0.898383
[9000]	training's auc: 0.945033	valid_1's auc: 0.898701
[10000]	training's auc: 0.94837	valid_1's auc: 0.898763
[11000]	training's auc: 0.951605	valid_1's auc: 0.89877
[12000]	training's auc: 0.954709	valid_1's auc: 0.898751
[13000]	training's auc: 0.957618	valid_1's auc: 0.898634
Early stopping, best iteration is:
[10791]	training's auc: 0.950935	valid_1's auc: 0.89889
Fold 2
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.90084	valid_1's auc: 0.87531
[2000]	training's auc: 0.911957	valid_1's auc: 0.883717
[3000]	training's auc: 0.919463	valid_1's auc: 0.888423
[4000]	training's auc: 0.925317	valid_1's auc: 0.891101
[5000]	training's auc: 0.930106	valid_1's auc: 0.892821
[6000]	training's auc: 0.93436	valid_1's auc: 0.89362
[7000]	training's auc: 0.938282	valid_1's auc: 0.89429
[8000]	training's auc: 0.941897	valid_1's auc: 0.894544
[9000]	training's auc: 0.945462	valid_1's auc: 0.894652
[10000]	training's auc: 0.948798	valid_1's auc: 0.894821
[11000]	training's auc: 0.952036	valid_1's auc: 0.894888
[12000]	training's auc: 0.955136	valid_1's auc: 0.894657
[13000]	training's auc: 0.958081	valid_1's auc: 0.894511
[14000]	training's auc: 0.960904	valid_1's auc: 0.894327
Early stopping, best iteration is:
[11094]	training's auc: 0.952334	valid_1's auc: 0.894948
Fold 3
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.900276	valid_1's auc: 0.882173
[2000]	training's auc: 0.911124	valid_1's auc: 0.889171
[3000]	training's auc: 0.918758	valid_1's auc: 0.893614
[4000]	training's auc: 0.92463	valid_1's auc: 0.89627
[5000]	training's auc: 0.929475	valid_1's auc: 0.897519
[6000]	training's auc: 0.933971	valid_1's auc: 0.898018
[7000]	training's auc: 0.937925	valid_1's auc: 0.898396
[8000]	training's auc: 0.941684	valid_1's auc: 0.898475
[9000]	training's auc: 0.945229	valid_1's auc: 0.898597
[10000]	training's auc: 0.948626	valid_1's auc: 0.898725
[11000]	training's auc: 0.951822	valid_1's auc: 0.898657
[12000]	training's auc: 0.95488	valid_1's auc: 0.898504
[13000]	training's auc: 0.957871	valid_1's auc: 0.898503
Early stopping, best iteration is:
[10712]	training's auc: 0.950891	valid_1's auc: 0.898759
Fold 4
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.900213	valid_1's auc: 0.883231
[2000]	training's auc: 0.911052	valid_1's auc: 0.890297
[3000]	training's auc: 0.918649	valid_1's auc: 0.894252
[4000]	training's auc: 0.924548	valid_1's auc: 0.896724
[5000]	training's auc: 0.92951	valid_1's auc: 0.897923
[6000]	training's auc: 0.93393	valid_1's auc: 0.898887
[7000]	training's auc: 0.937896	valid_1's auc: 0.899048
[8000]	training's auc: 0.941556	valid_1's auc: 0.899335
[9000]	training's auc: 0.945033	valid_1's auc: 0.899469
[10000]	training's auc: 0.94841	valid_1's auc: 0.899536
[11000]	training's auc: 0.951679	valid_1's auc: 0.899371
[12000]	training's auc: 0.954731	valid_1's auc: 0.899314
[13000]	training's auc: 0.95771	valid_1's auc: 0.899024
Early stopping, best iteration is:
[10307]	training's auc: 0.949415	valid_1's auc: 0.899591
Fold 5
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.899832	valid_1's auc: 0.887942
[2000]	training's auc: 0.910762	valid_1's auc: 0.895511
[3000]	training's auc: 0.918306	valid_1's auc: 0.899303
[4000]	training's auc: 0.924334	valid_1's auc: 0.901522
[5000]	training's auc: 0.929353	valid_1's auc: 0.902569
[6000]	training's auc: 0.933747	valid_1's auc: 0.903396
[7000]	training's auc: 0.937725	valid_1's auc: 0.903844
[8000]	training's auc: 0.941422	valid_1's auc: 0.904181
[9000]	training's auc: 0.944946	valid_1's auc: 0.904167
[10000]	training's auc: 0.948326	valid_1's auc: 0.903872
[11000]	training's auc: 0.951534	valid_1's auc: 0.903846
Early stopping, best iteration is:
[8408]	training's auc: 0.942866	valid_1's auc: 0.904303
Fold 6
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.899935	valid_1's auc: 0.884744
[2000]	training's auc: 0.910967	valid_1's auc: 0.892097
[3000]	training's auc: 0.918595	valid_1's auc: 0.896277
[4000]	training's auc: 0.924503	valid_1's auc: 0.898606
[5000]	training's auc: 0.929414	valid_1's auc: 0.89991
[6000]	training's auc: 0.933745	valid_1's auc: 0.900743
[7000]	training's auc: 0.937714	valid_1's auc: 0.901066
[8000]	training's auc: 0.94139	valid_1's auc: 0.900995
[9000]	training's auc: 0.944926	valid_1's auc: 0.901016
Early stopping, best iteration is:
[6986]	training's auc: 0.937661	valid_1's auc: 0.901085
Fold 7
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.899968	valid_1's auc: 0.881017
[2000]	training's auc: 0.910826	valid_1's auc: 0.889131
[3000]	training's auc: 0.918484	valid_1's auc: 0.893968
[4000]	training's auc: 0.924432	valid_1's auc: 0.896794
[5000]	training's auc: 0.929348	valid_1's auc: 0.898531
[6000]	training's auc: 0.933656	valid_1's auc: 0.899541
[7000]	training's auc: 0.937572	valid_1's auc: 0.899903
[8000]	training's auc: 0.941255	valid_1's auc: 0.900259
[9000]	training's auc: 0.944865	valid_1's auc: 0.900205
[10000]	training's auc: 0.948314	valid_1's auc: 0.900135
[11000]	training's auc: 0.951556	valid_1's auc: 0.900281
[12000]	training's auc: 0.954647	valid_1's auc: 0.900202
[13000]	training's auc: 0.957629	valid_1's auc: 0.900083
[14000]	training's auc: 0.960473	valid_1's auc: 0.900019
Early stopping, best iteration is:
[11028]	training's auc: 0.951647	valid_1's auc: 0.900328
Fold 8
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.899642	valid_1's auc: 0.889764
[2000]	training's auc: 0.91067	valid_1's auc: 0.897589
[3000]	training's auc: 0.918364	valid_1's auc: 0.901604
[4000]	training's auc: 0.92421	valid_1's auc: 0.903614
[5000]	training's auc: 0.929197	valid_1's auc: 0.904601
[6000]	training's auc: 0.933471	valid_1's auc: 0.905101
[7000]	training's auc: 0.93741	valid_1's auc: 0.905128
[8000]	training's auc: 0.941136	valid_1's auc: 0.905215
[9000]	training's auc: 0.944594	valid_1's auc: 0.905207
[10000]	training's auc: 0.948042	valid_1's auc: 0.905092
[11000]	training's auc: 0.951259	valid_1's auc: 0.905037
Early stopping, best iteration is:
[8028]	training's auc: 0.941228	valid_1's auc: 0.905247
Fold 9
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.900193	valid_1's auc: 0.884426
[2000]	training's auc: 0.911194	valid_1's auc: 0.891741
[3000]	training's auc: 0.918785	valid_1's auc: 0.895999
[4000]	training's auc: 0.924653	valid_1's auc: 0.8984
[5000]	training's auc: 0.929607	valid_1's auc: 0.899584
[6000]	training's auc: 0.933898	valid_1's auc: 0.900395
[7000]	training's auc: 0.937896	valid_1's auc: 0.900785
[8000]	training's auc: 0.941574	valid_1's auc: 0.900916
[9000]	training's auc: 0.945132	valid_1's auc: 0.901081
[10000]	training's auc: 0.948568	valid_1's auc: 0.901075
[11000]	training's auc: 0.951714	valid_1's auc: 0.901069
[12000]	training's auc: 0.954815	valid_1's auc: 0.901025
[13000]	training's auc: 0.957792	valid_1's auc: 0.901129
Early stopping, best iteration is:
[10567]	training's auc: 0.950365	valid_1's auc: 0.901193
CV score: 0.90025

We are also interested in the feature importance. What feature counts most during the prediction process.

cols = (feature_importance_df[["Feature", "importance"]]
        .groupby("Feature")
        .mean()
        .sort_values(by="importance", ascending=False)[:150].index)
best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]

plt.figure(figsize=(14,28))
sns.barplot(x="importance", y="Feature", data=best_features.sort_values(by="importance",ascending=False))
plt.title('Features importance (averaged/folds)')
plt.show()

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

5 Submission and Final Result

submission=pd.DataFrame({"ID_code":test['ID_code'].values})
submission['target']=predictions
submission.to_csv(PATH+'submission.csv',index=False)

The simple submission’s public score here is 0.89889 and the private score is 0.90021, which ranks 329/8780, top 3.7% on private broad.

Kaggle | Santander Customer Transaction Prediction（EDA and Baseline）

Santander Customer Transaction Prediction: EDA and Baseline

1 Description

2 Prepare The Data

2.1 Import and preparation

2.2 Check the Data

3 EDA

3.1 Check the Train-test Distribution

3.2 Check the Feature Correlation

3.3 Further Exploring

3.4 Other Statistical Indicators that Worth Checking

4 Feature Engineering and Modeling

4.1 Create New Features

4.2 Training the Model

5 Submission and Final Result

解决Django transaction进行事务管理踩过的坑

2020KDD-CUP阿里天池一篇baseline复现debiasingRush（一）

消息称中芯国际成熟制程关键供应确认已获许可证：包括EDA等

Flutter 布局（四）- Baseline、FractionallySizedBox、IntrinsicHeight、IntrinsicWidth详解

Kaggle实战入门（二）之房价预测Housing Prices Competition

【深度学习-语音分类】语种识别挑战赛Baseline

破解卡脖子新思路 CCF容错大会发布OpenBELT开源EDA倡议

Codeigniter框架的更新事务（transaction）BUG及解决方法

php中在PDO中使用事务(Transaction)

mysql中TCL(事务控制语言)TRANSACTION