Demystify Artificial Neural Networks

13 min readOct 4, 2024

credit :https://www.leewayhertz.com/what-are-neural-networks/ — image credit: www.leewayhertz.com/what-are-neural-networks and thanks AI for rewriting my thoughts :)

A Neural Network is a series of algorithms designed to recognize patterns, inspired by how the human brain works. It consists of interconnected layers of nodes (neurons) that process input data and “learn” from it through multiple iterations

Input Layer: This is where data enters the network (e.g., features like age, income, etc.).
Hidden Layers: The network performs its magic in these layers, extracting patterns from the data.
Output Layer: This layer produces the final output (e.g ., a classification, like “yes” or “no”).

Each connection between neurons has a weight that gets adjusted during learning to improve predictions. The process of adjusting these weights is done using backpropagation.

What is a Perceptron?

The perceptron is the building block of an Artificial Neural Network (ANN). It’s the simplest form of a neuron in an ANN. Just like how biological neurons take inputs, process them, and decide to “fire” or not, a perceptron takes in inputs, processes them, and makes a decision (classification).

Real-World Example: E-commerce Product Classification

Imagine you’re running an online store, and you want to automatically classify whether a product is “popular” or “not popular” based on certain features like:

Number of reviews
Average rating
Price

A perceptron could be used to classify these products into “popular” (1) or “not popular” (0).

How Does It Work?

Input Features: The perceptron takes in multiple inputs (e.g., number of reviews, price, etc.).
Weights: Each input is multiplied by a weight, which signifies how important that input is in making the decision.
Summation: The weighted inputs are added together.
Activation Function: The sum is passed through a step function (or another activation function) to determine if the result is greater than a threshold (like a simple yes/no decision).

In real-world applications, perceptrons are the building blocks of more complex neural networks.

Why Do Perceptrons Matter?

Perceptrons introduced the idea that a model could “learn” weights from data and make decisions, which is foundational for all neural networks. They are a key part of modern AI technologies.

Applications:

Spam Detection: Email systems use perceptron-based models to classify whether a message is spam or not based on email content.
Fraud Detection: Banks use more complex networks (like MLPs) built on perceptrons to detect unusual spending patterns.

from sklearn.linear_model import Perceptron
import numpy as np

# Example data (features: number of reviews, average rating, price)
X = np.array([[50, 4.5, 20], [10, 2.5, 10], [100, 4.8, 30], [5, 1.5, 5]])
# Labels (1: Popular, 0: Not Popular)
y = np.array([1, 0, 1, 0])

# Initialize and train the perceptron
perceptron = Perceptron()
perceptron.fit(X, y)

# Predict whether a new product is popular
new_product = np.array([[30, 3.8, 15]])  # New product with some reviews, avg rating, price
prediction = perceptron.predict(new_product)

print(f"Prediction (1 = Popular, 0 = Not Popular): {prediction[0]}")

Neural Network / MLP (Multi layer Perceptron model)

In an Artificial Neural Network (ANN), there are many perceptrons (also called neurons) working together to solve more complex problems. Each neuron receives input, applies weights, processes the sum, and produces an output.

So, a neuron in an ANN is like a single decision-making unit, but when you combine many neurons together, they can learn more complex patterns.

Real-World Example: Predicting Customer Behavior in Travel

A single neuron might predict something simple like whether a customer is likely to search for a flight. However, when many neurons are connected together in an ANN, they can predict more complex behaviors, like whether a customer will book a flight after visiting a travel website multiple times.

In a neural network, the computation for each neuron (node) works like this:

Input x: These are the features or data points that you’re feeding into the network (e.g., age, income, etc.).
Weight w: The weight represents the importance of a particular input. If a feature is more important for the prediction, it will have a higher weight. These weights are learned by the model during training.
Bias b: This is an extra term added to help the model fit the data better. Think of it like a shift or offset. It ensures that even if the input is zero, the neuron can still activate.
Mathematical Operation: For each neuron, you take the input x, multiply it by the weight w, and then add the bias b. So, mathematically, it looks like:

z = x × w + b

Real-World Analogy:

Imagine you’re deciding whether to buy a house. The input could be factors like the house size, location, and number of bedrooms. The weight would be how important each factor is to you — maybe size is more important than location. The bias could be your overall budget constraint. You combine all these factors to decide if the house is a good deal or not.

Activation Functions

Now, after calculating z=x×w+b, the neuron needs to decide whether to “activate” or not produce an output. This is where the activation function comes in.

Why Do We Need Activation Functions? Without an activation function, the output would just be a linear combination of the inputs. But real-world problems are often non-linear, so activation functions introduce non-linearity, allowing the network to model more complex relationships.
It helps the model make decisions. For instance, it helps decide if an email is spam or not, based on multiple factors (inputs).

Common Activation Functions:

Sigmoid: Maps the output to a value between 0 and 1. Great for binary classification problems.

Example: Will this customer make a purchase (yes/no)?

2. ReLU (Rectified Linear Unit): Sets all negative values to zero and leaves positive values unchanged. Most commonly used in deep networks.

Example: Used in complex tasks like image recognition.

3. Tanh: Maps the output to a value between -1 and 1. Similar to sigmoid but centered around zero.

Example: Used in cases where you need a stronger signal on both sides (positive and negative).

Softmax: Converts the output into probabilities for multi-class classification problems.

Example: Used to classify an image as one of several categories (e.g., dog, cat, car).

Multi-Class Classification:

Multi-class classification is a machine learning task where you need to classify data into more than two categories. Unlike binary classification (where the output is one of two classes), in multi-class classification, the output can be one of several classes.

Real-World Examples:

Image Classification: Predicting whether an image is a cat, dog, or bird.
Product Categorization: In e-commerce, predicting whether a product belongs to the electronics, clothing, or home goods category.
Customer Segmentation: Predicting whether a customer belongs to low-spending, medium-spending, or high-spending groups.

How Does Multi-Class Classification Work?

For multi-class classification, you often use algorithms that can handle more than two categories. Here’s an overview of common approaches:

1. One-vs-Rest (OvR):

The classifier learns to predict one class versus all others. For each class, a separate binary classifier is trained.
Example: For an image dataset with three classes (cat, dog, bird), you build three classifiers:
Cat vs. not-cat
Dog vs. not-dog
Bird vs. not-bird
How it Works: Each classifier gives a probability score, and the class with the highest score is chosen as the final prediction.

2. One-vs-One (OvO):

Every possible pair of classes gets its own binary classifier.
Example: For 3 classes (cat, dog, bird), you train:
Cat vs. Dog
Cat vs. Bird
Dog vs. Bird
How it Works: When you make a prediction, each classifier votes for a class, and the class with the most votes is the final prediction.

3. Softmax Classifier (Direct Multi-Class):

Unlike OvR or OvO, the softmax classifier can handle all classes at once. It directly computes probabilities for each class, and the one with the highest probability is selected.
How it Works: The softmax activation function converts the raw outputs (logits) into probabilities that sum to 1, making it a natural fit for multi-class problems.

Which Algorithms are Good for Multi-Class Classification?

1. Logistic Regression (with Softmax):

Works well for problems where the classes are linearly separable.
Used in problems like document classification or image classification.

2. Random Forest:

Good for problems where data may not be linearly separable.
Works well in cases like customer segmentation or fraud detection.

3. Support Vector Machines (SVM):

Great for high-dimensional data, like text classification.
Works with the One-vs-Rest strategy for multi-class tasks.

4. Neural Networks (using Softmax for Output Layer):

Powerful for complex data like images or speech. For example, classifying images in e-commerce sites into categories like clothing, shoes, and accessories.

How Do You Decide Which to Use?

Type of Data: If your data has complex patterns (like images, text), neural networks are a great fit. If your data is structured (like tables with numerical values), logistic regression or random forests could work well.
Performance: Neural networks can capture complex relationships, but they require more data and computation. Simpler models like logistic regression can be faster but may not work well for complex data.
Amount of Data: Neural networks tend to perform better with larger datasets, while models like SVM or logistic regression may work well with smaller datasets.

Real-World Applications

E-Commerce Product Classification: In an online marketplace, multi-class classifiers help categorize products based on features like title, description, and price. For example, a classifier can predict whether a product belongs to electronics, furniture, or clothing.
Travel Recommendation Systems: Multi-class classifiers can categorize customers into segments based on their travel preferences. For example, you can predict whether a customer prefers adventure trips, beach vacations, or city tours.

Problem: Predicting Customer Satisfaction Level (Low, Medium, High)

Imagine an e-commerce platform wants to predict the satisfaction level of customers based on several features like:

Purchase Amount: How much they spent.
Delivery Time: How many days it took to deliver the product.
Customer Support Response Time: How quickly support responded to their inquiry.
Product Rating: How they rated the product.

We will classify customer satisfaction into three categories:

Low (0)
Medium (1)
High (2)

Python Example

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Step 1: Create a larger synthetic dataset
np.random.seed(42)

# Generate random data for purchase amount, delivery time, product quality, and customer service rating
X = np.random.randint(50, 1000, size=(100, 4))  # Features: purchase amount, delivery time, etc.
y = np.random.randint(0, 3, size=(100,))        # Target: customer satisfaction (0: Low, 1: Medium, 2: High)

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 2: Train the Random Forest classifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Step 3: Predict on the test set
y_pred = model.predict(X_test)

# Step 4: Print predictions and actual labels to compare
print(f"Predictions: {y_pred}")
print(f"Actual labels: {y_test}")

# Step 5: Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Step 6: Confusion matrix to check where the model is making mistakes
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

What This Code Does:

Data Generation:

Creates a synthetic dataset (X) with random values for features like purchase amount, delivery time, etc.
Randomly assigns customer satisfaction levels (y): 0 (Low), 1 (Medium), and 2 (High).

2. Model Training:

Splits the dataset into a training set (70%) and test set (30%).
Trains a RandomForestClassifier on the training data.

3. Evaluation:

Predicts customer satisfaction for the test set.
Prints the predictions and actual labels.
Calculates the accuracy of the model.
Prints a confusion matrix to analyze the performance for each class.

BTW I only got 40% accuracy

Cost Functions and Gradient Descent

In machine learning, the cost function (also known as the loss function) tells you how far off your model’s predictions are from the actual results. If you’re getting a 40% accuracy, it means your model’s predictions aren’t closely matching the real values. The cost function helps quantify this error.

Let’s break this down with an e-commerce example:

Scenario:

Imagine you’re running an online store, and you want to predict whether customers will buy products (Category A, Category B, or Category C) based on features like the amount of time spent browsing, the product’s price, and the customer’s past purchase history.

To make accurate predictions, your model needs to minimize its errors — this is where the cost function comes in.

Cost Function:

In simple terms, it calculates how wrong your model’s predictions are.
For multi-class classification, a common cost function is Cross-Entropy Loss. It penalizes wrong predictions more severely and helps guide the model towards making better predictions.

Example (related to e-commerce): Let’s say you predicted a customer will buy Category A when they actually bought Category B. The cost function will calculate the “penalty” for this wrong prediction. Your model aims to minimize this penalty.

Gradient Descent:

Now, once you have the cost (error), you need a way to reduce it. This is where Gradient Descent comes in.

Gradient Descent is like taking steps downhill on a cost “landscape” to find the lowest point, which corresponds to the best model (one with the least error).
Imagine a hill where you want to find the lowest point, and each step is an adjustment of the model’s weights and biases.

How it Works:

Initialization: Start with some random weights for your model.
Compute Gradient: The gradient tells you how much to change your weights to reduce the cost.
Update Weights: Adjust the weights slightly in the opposite direction of the gradient (like taking steps downhill).
Repeat: Keep updating weights until the cost is minimized, or at least small enough.

The key idea is that the model is “learning” from its mistakes, adjusting its weights and biases to make better predictions.

Why is This Useful?

For your current model with only 40% accuracy, this process helps the model to gradually improve. If the cost function sees a large error, it adjusts the weights and biases in the right direction, making better predictions over time.

Let’s code it.

Cost Function (Cross-Entropy Loss): This is already being handled by the MLPClassifier from scikit-learn. The default loss function for classification tasks is cross-entropy.
Gradient Descent: The MLPClassifier uses stochastic gradient descent (SGD) as part of its learning process. By adjusting the learning rate, iterations, and optimizer (solver), you can directly influence how gradient descent works.

Let’s modify the code to highlight how the loss function and gradient descent come into play and tune it for better accuracy.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, log_loss

# Simulated e-commerce dataset: time_spent, product_price, customer_purchase_history
X = np.array([
    [30, 100, 3],  # 30 minutes browsing, $100 price, 3 past purchases
    [20, 80, 1],
    [25, 120, 4],
    [40, 60, 0],
    [35, 110, 2],
    [50, 150, 5],
    [60, 90, 0],
])

# Labels: 0 = Category A, 1 = Category B, 2 = Category C
y = np.array([0, 1, 2, 0, 1, 2, 0])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features (recommended for neural networks)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define the neural network (MLPClassifier uses cross-entropy loss by default)
mlp = MLPClassifier(hidden_layer_sizes=(5,), 
                    max_iter=500, 
                    learning_rate_init=0.01,  # Starting learning rate
                    solver='adam',  # Adam is an optimization method (gradient descent variant)
                    random_state=42)

# Train the model
mlp.fit(X_train, y_train)

# Predict on test data
y_pred = mlp.predict(X_test)
y_pred_proba = mlp.predict_proba(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Calculate Cross-Entropy Loss (same as log loss in scikit-learn)
loss = log_loss(y_test, y_pred_proba)
print(f"Cross-Entropy Loss: {loss}")

# Print weight coefficients learned by the model
print(f"Weights: {mlp.coefs_}")

How It Works:

Cross-Entropy Loss: This measures the difference between predicted probabilities and actual labels. The log_loss function calculates this, and it’s automatically minimized by the model during training.
Gradient Descent: The solver ‘adam’ is used, which is a version of gradient descent. It’s known for faster convergence and is widely used in deep learning models.
Learning Rate: The learning_rate_init parameter sets the initial step size for gradient descent. A lower value might result in slower learning but better convergence.

Feedforward (How Neural Networks Make Predictions)

Input Layer: In feedforward, data starts at the input layer. Each node represents a feature (e.g., for house price prediction, inputs could be square footage, location, etc.)
Weights & Bias: Each input is multiplied by its corresponding weight, which signifies the strength or importance of that input in predicting the output. A bias is then added to shift the activation.
Activation Function: After getting the weighted sum (zzz), we pass it through an activation function (e.g., ReLU, sigmoid, etc.), which determines if the neuron should be “activated” (fire a signal) or not.
Hidden Layers: The output from each neuron in one layer becomes the input to neurons in the next layer. This process continues across all hidden layers in the network.
Output Layer: Finally, the result from the last hidden layer goes through the output layer, which gives us the model’s prediction (for classification, it may predict probabilities for each class).

Backpropagation (How Neural Networks Learn)

After making predictions with feedforward, backpropagation adjusts weights to improve predictions in future iterations:

Loss Function Calculation: First, the model calculates how far off the prediction is from the actual result. This is measured using a loss function (e.g., mean squared error for regression or cross-entropy for classification).

Example: If the model predicted house prices, the loss would measure the difference between the predicted and actual prices.

2. Compute Gradients: Backpropagation calculates how much each weight in the network contributed to the error. It uses the chain rule from calculus to determine how a small change in each weight would affect the loss.

3. Update Weights: Using Gradient Descent, the model adjusts the weights in the direction that minimizes the loss. Each weight is updated by subtracting the gradient multiplied by a learning rate.

4. Repeat: This process of feedforward prediction and backpropagation weight adjustment is repeated many times (iterations), allowing the network to learn from its mistakes.

In simpler terms, feedforward is about making predictions using weights, while backpropagation is about adjusting those weights based on the errors in predictions. Together, these steps enable the neural network to gradually improve its accuracy.

Summary of the Steps:

Collect Data: Prepare the dataset and define features (X) and target (y).
Initialize Inputs: Define input features (X) and output (y).
Initialize Weights and Biases: Randomly initialize weights (w) and biases (b).
Apply Activation Function: Add non-linearity with activation functions (e.g., ReLU, Sigmoid, Softmax).
Feed Forward: Pass inputs through the network and get predictions.
Compute Loss: Measure how far predictions are from actual outputs.
Backpropagation and Gradient Descent: Adjust weights to minimize the loss.
Train the Network: Iterate over multiple epochs, updating weights each time.
Evaluate the Model: Measure accuracy, confusion matrix, and other metrics.
Make Predictions: Use the trained model to predict new outcomes.