If you are an AI enthusiast then you probably would have thought of training a neural network to predict/classify something, where most people use existing frameworks like Keras, PyTorch, or TensorFlow to train their model and save a good amount of time, but have you ever wondered what happens under the hood? If yes, then this article is very much for you. Here, we will be making a simple model which will predict the XOR of two numbers, which may seem quite irrelevant but will help you understand how internally the model adjusts its parameters and gets better with training.


  1. Basic Python Programming
  2. Numpy Operations
  3. Basic Matrix Operations & Calculus

The required libraries – numpy and matplotlib can be easily installed using pip or conda.

Let's first understand,

What is a Machine Learning/Deep Learning model?

These are functional approximators. By saying this what I mean is that this can replicate the behavior of a function and perform accordingly. They may not produce the exact value but will produce values that are very close to the exact value of the function's output. In the case of Deep Learning, the results get better every time it gets better with training on new unseen data. This means if you have large enough data you can make a robust model.

Now let's focus on a basic skeleton of a model and later we will be expanding each part.

  • First you need a dataset to train your model, let's call it X
  • Design your model, you can think of the model as a mathematical function F(X)

in a linear model, the relation is F(X) = W.X + b, where W is the weight of the layer and b is the biased value.

  • Calculate Loss by making a loss function(L) to track the model performance

loss = L(Y_predicted, Y_target), here Y_target is the desired output and Y_predicted is F(X) i.e. the predicted output

  • An Optimizer function that would update the parameters(W, b) by taking the partial derivative of loss of your model. let's call it O(loss, W, b)


Now let's see the basic training loop,

do until 'n' epochs:

Y_predicted = F(X) # feed forward step

loss = L(Y_predicted, Y_target) # calculate loss

O(loss, W, b)   # backpropagation and updation


Let's dive into the code now.

We will need numpy (for easier computation) and matplotlib (for visualizing our model's performance) First, let's generate some random data for our XOR model.

In ML, we consider the number of inputs to the model as features of data, and usually, they are stored row/column-wise, and the number of columns/rows says the number of data(records) we want to feed to our network. Thus X : our dataset is of dimension(features, n)/dimension(n, features). Here the output (Y of dimension(1,n)/dimension(n,1) ) will be 0/1 so we can use a regression approach or a binary classification would also work. Here we will be storing the features row-wise.

# import the required libraries

import numpy as np

import matplotlib.pyplot as plt

import random


# generate dataset

X = np.random.randint(0, 2, size=(2, 1000))

y = np.bitwise_xor(X[0:1], X[1:]).reshape(1,-1)

features = X.shape[0]


Next, we will describe our neural net, here we will make a simple deep neural network with 1 hidden layer. A neural network can be described by its parameters (weights and bias). The weights and bias are the arrays of numbers and here we will be making a dictionary of these parameters. From here onwards you may encounter a term called Hyperparameter which means that it will affect the model prediction and we usually find the best set of hyperparameters by comparing the loss function curve and validation accuracy curve or we can also use methods like grid search, random search, etc. to find the optimal set of hyperparameters.

def make_model(features, n_hidden_nodes, output_dim):

# layer 1 weights

W1 = np.random.randn(n_hidden_nodes, features) * 0.01

# layer 1 biases

b1 = np.zeros(shape=(n_hidden_nodes, 1))

# layer 2 weights

W2 = np.random.randn(output_dim, n_hidden_nodes) * 0.01

# layer 2 biases

b2 = np.zeros(shape = (output_dim, 1))

return {

   "W1": W1,

    "b1": b1,

    "W2": W2,

    "b2": b2



hidden_nodes = 2 # Hyperparameter

output_dim = 1

# architecture of our model is : 2 nodes in input layer,

# 4 nodes in hidden layer, and one node in output layer

model_parameters = make_model(features, hidden_nodes, output_dim)

Now let’s make a feed-forward function for our model. It will be simple as we will be performing some basic calculations only. Presently if you propagate forward by using F i.e., F(X) = W.X + b, it is only a set of linear equations we are solving and it's difficult to say where our model is moving and becomes unpredictable. So now we will be looking into some of the activation functions, which make the model more reliable and stable. Here we will be using the sigmoid activation function, which helps to give non-linear property to the model. First, let's look at the sigmoid function, it is a function whose output is in between (0,1) But there are many others like ReLU, tanh, etc. you can try

# sigmoid function

def sigmoid(x):

return 1/(1+ np.exp(-x))


# feed forward function

def forward(x, model_parameters):

W1, b1, W2, b2 = model_parameters["W1"], model_parameters["b1"], model_parameters["W2"], model_parameters["b2"]  

Z1 =, x) + b1

A1 = sigmoid(Z1)

# A1 = np.tanh(Z1)

Z2 =, A1) + b2

A2 = sigmoid(Z2)

cache = { "Z1" : Z1, "A1" : A1, "Z2" : Z2, "A2" : A2 }

return A2, cache


out the others, and remember activation function is also a hyperparameter. Here the output of F(X) is taken as the input for the activation layer and produces input for the next layer. Another thing here, we will also be keeping track of the values of the different layers as they will be used to compute the gradients in backpropagation.


Now our prediction model is ready, together with linear weights, bias, and activation layer, we have made our prediction model. Congrats you reached your first milestone.

Now we will be making a loss function where we will be calculating how close our model is to target values. Next, we will calculate the loss and our goal would be minimizing this loss with each epoch(step). Some well-known loss functions are mean squared difference, L1 loss, cross-entropy loss, and many more. Here as we have 2 outputs and we are treating this as a classification so there are two classes and we can use a Binary cross-entropy function which is a cross-entropy loss function for 2-classes output. In general,


Here the loss function isn’t that good as it is not converging to any value.

This was the generated loss function of this model using BCE. Hoping this was helpful for you guys, you can experiment with hyperparameters to find the best model. That’s all and keep exploring.




Components and Supplies

    You may also like to read

    Frequently Asked Questions

    Back to blog

    Leave a comment

    Please note, comments need to be approved before they are published.

    Components and Supplies

      You may also like to read