Preparing the Data

In this section, we will prepare the data for modeling, training and testing.

Identify feature and target columns

The target (as noted previously) is the ‘passed’ column. Here I’ll list the feature columns to get an idea of what’s there.

Variable Description Data Values
Dalc workday alcohol consumption 1, 2, 3, 4, 5
Fedu father’s education 0, 1, 2, 3, 4
Fjob father’s job at_home, health, other, services, teacher
Medu mother’s education 0, 1, 2, 3, 4
Mjob mother’s job at_home, health, other, services, teacher
Pstatus parent’s cohabitation status A, T
Walc weekend alcohol consumption 1, 2, 3, 4, 5
absences number of school absences 0 ... 75
activities extra-curricular activities no, yes
address student’s home address type R, U
age student’s age 15, 16, 17, 18, 19, 20, 21, 22
failures number of past class failures 0, 1, 2, 3
famrel quality of family relationships 1, 2, 3, 4, 5
famsize family size GT3, LE3
famsup family educational support no, yes
freetime free time after school 1, 2, 3, 4, 5
goout going out with friends 1, 2, 3, 4, 5
guardian student’s guardian father, mother, other
health current health status 1, 2, 3, 4, 5
higher wants to take higher education no, yes
internet Internet access at home no, yes
nursery attended nursery school no, yes
paid extra paid classes within the course subject (Math or Portuguese) no, yes
reason reason to choose this school course, home, other, reputation
romantic within a romantic relationship no, yes
school student’s school GP, MS
schoolsup extra educational support no, yes
sex student’s sex F, M
studytime weekly study time 1, 2, 3, 4
traveltime home to school travel time 1, 2, 3, 4

Preprocess feature columns

Some Machine Learning algorithms (e.g. Logistic Regression) require numeric data so the columns with string-data need to be transformed. The columns in this data-set that had ‘yes’ or ‘no’ values had the values converted to 1 and 0 respectively. Those columns that had other kinds of categorical data were transformed into dummy-variable columns.

In addition, the target data was also changed so that instead of ‘yes’ and ‘no’ values it contained only ‘1’ and ‘0’ values.

  • Original Feature Columns: 30
  • With Dummies: 48

With dummy variables there are now 18 more columns in the feature data.

Split data into training and test sets

Next the data was shuffled and then split into training and testing sets.

Training and Testing Data
Set Count
Training Instances 300
Test Instances 95