Machine learning resembles sex in secondary school. Everybody is discussing it, a couple of realizing what to do, and just your instructor is doing it. In the event that you ever attempted to peruse articles about AI on the Internet, no doubt you discovered two kinds of them: thick scholarly sets of three loaded up with hypotheses (I was unable to try and get past portion of one) or off-putting fantasies about man-made consciousness, information science sorcery, and occupations of things to come.
Kaggle is platform for data scientists to share/find data sets, explore, build and train data models to make predictions in web based environment.
Few key points about ML and Kaggle are below, just to familiarize you with these fairy tales.
Decision Tree are type of data model
Capturing patterns from data is called fitting or training the data
The data used to fit the model is called training data
After the model is fit, you can apply it to make new data to make predictions
Panda library of Python is super easy and famous for data analysis and manipulation,
import pandas as pd
Data Frames holds the type of data in tabular format
We can load data using function below
data = pd.read_csv(‘path’ of csv) # to read csv file
data.shape() #returns number of rows and columns in data set (table)
data.head() #return first 5 rows to give you idea about data
data.describe() #gives you statistics of data (mean, min, max, count, std…)
Suppose we have housing prices data to analyse and do price predictions in the end for particular area
To achieve this, we need to focus on couple of things:
Prediction Target, simply a target for our predictions or can call it output.
Features, in simple terms, its our data, variables which we wants to use for our data for prediction.
Generally data scientists use lowercase y variable for prediction target and uppercase X variable for features arrays.
Model data in Decision Tree, DecisionTreeRegressor is used here
Fit Features and Prediction Target into Model.
y.head() shows Prediction Target (prices)
X.head() shows features and first few rows
There are many metrics for summarizing model quality like MAE (Mean Absolute Error) to validate data (training data). MAE can be easily evaluate by scikit-learn.metrics module.
Sometimes its wise to split data in small data segments and evaluate their MAE to see what is more meaningful data for the Prediction Target, lower MAE will give us better idea which data segment is better BUT beware of overfitting or underfitting.
Overfitting is where model matches the training data almost perfectly but does poorly with validation or other new data.
Underfitting is where model most important distinction and patterns in the data so it performs poorly even with training data.
Instead of putting all your data into single Decision Tree, its more advisable to put your data into multiple Trees (Forest Tree model) which mostly gives better results as it gives you average all predictions. It can gets you lesser MAE number which is what we looking for.
comparison
Kaggle offers a great forum for learning and applying to actual competitions and datasets the knowledge and skills you have learned. There are a number of datasets to choose from and each contest provides a very welcoming group to assist you. I would suggest signing up and trying it out for Kaggle!