Hello, this is MiTornAve.
In our last session, we witnessed the magic of Backpropagation, where trillions of weights are carved by the sophisticated blade of derivatives. While the Multi-Layer Perceptron (MLP) is undoubtedly a powerful logic machine, the moment you command this brilliant machine to "look at a picture of a cat and tell if it's a cat," it suffers a disastrous failure.
Why is that? It is because the MLP perceives the world only through a "one-dimensional narrow slit."
1. The Destruction of the Plane: Why Can't Perceptrons "See" Images?
To a computer, an image is a 2D matrix of numbers arranged in a grid of horizontal and vertical pixels. The human eye intuitively grasps the "distance" and "positional relationships" between pixels on this 2D plane. We understand the "Spatial Context"—that a cat’s two ears are located on either side of the top of the head, and the eyes and nose form a triangle below them.
However, the input layer of an MLP consists only of a single, long 1D line. To feed a small 28x28 pixel cat photo into an MLP, you must ruthlessly tear this plane apart and flatten it into a sequence of 784 (28x28) numbers.
In this process, pixels that once formed the nose and those that formed the eyes become separated by hundreds of spaces in the 1D array. The precious 2D spatial information—up, down, left, and right—is completely shattered. Asking the machine to identify a shape from this destroyed data is as cruel as asking someone to reconstruct the appearance of a cow just by looking at ground beef.
To overcome this, scientists in the 1950s drew inspiration from David Hubel and Torsten Wiesel’s experiments on the feline visual cortex. They discovered that cat brain cells do not process the entire visual field at once; instead, specific cells respond only to local regions, such as "lines" or "edges" in certain directions, and are connected hierarchically. This was the birth of the Convolutional Neural Network (CNN).
2. Convolution: The Mathematical Magnifying Glass Scanning the World
A CNN does not force images into a single line. It preserves the 2D structure and observes the image by sliding a tiny magnifying glass called a "Filter" or "Kernel" across it.
The Principle: Convolutional Operation
This filter is not just glass; it is a "matrix of weights," typically 3x3 or 5x5 in size. When the filter lands on a specific pixel area of the original image, it multiplies the numbers in the same positions of both matrices (pixel values and weights) and then adds them all up.
Mathematically, it looks like this:
S(i, j) = (I * K)(i, j) = \sum_{m} \sum_{n} I(i+m, j+n)K(m, n)
(Where $I$ is the image matrix and $K$ is the filter matrix.)
What does this operation mean? Suppose we apply a filter designed to find "vertical lines" (e.g., the left column is 1 and the right column is -1).
When the filter passes over a plain area like a clear sky, there is no change in pixel values, resulting in a value near 0 (no response).
But the moment it scans a strong vertical boundary, like a building wall or a tree trunk, it produces a dramatically large positive or negative value (strong response).
The trail left by the filter as it scans the entire image creates a new map where only specific features (vertical lines, in this case) shine brightly. We call this summarized map a Feature Map.
3. Hierarchy of Features: From Points to Lines, from Lines to Objects
A CNN uses more than just one filter. A single layer employs dozens or hundreds of different filters to churn out various feature maps—horizontal lines, diagonals, specific colors, textures, and more. As the network gets deeper, something amazing happens.
Early Layers: They look at extremely narrow Receptive Fields. Visualization reveals simple, primitive geometric features like vertical/horizontal lines and gradients.
Intermediate Layers: They combine the "line" information from previous layers. Lines meet to form corners, circles, and textures (e.g., the texture of fur or the round shape of an eye).
Deep Layers: At the end of the network, the receptive field widens to grasp the context of the entire image. Now, it extracts abstract, high-level object-level features like a "dog's snout" or a "car's wheel."
4. Why Is CNN Superior? (Mathematical Dominance)
CNNs dominate image processing not just because they maintain 2D structures, but because of two decisive structural masterstrokes.
Weight Sharing: The Miracle Diet
Imagine an MLP processing a high-resolution 1000x1000 pixel image. A single neuron in the first hidden layer would require one million (1,000,000) weights. The computer's memory would explode, and training would be nearly impossible.
However, a CNN reuses (shares) the same 3x3 filter (only 9 weights) across the entire image. Just as a magnifying glass doesn't change its lens when you move it, the weights remain constant. This drastically reduces the number of parameters, allowing even massive networks to be trained quickly and efficiently.
Translation Invariance: "Finding It Anywhere"
A dog in a photo is still a dog, whether it’s in the center or the bottom-right corner. For an MLP, a change in position reshuffles the sequence of input numbers, making it look like entirely new data. But since a CNN's filter scans the entire image, it can catch a feature regardless of where it appears. This provides a robust vision that is unshaken by changes in position.
[Hands-on Section] Applying Filters in Google Colab
It's hard to fully appreciate this process through formulas alone. Open Google Colab and try the following code. With just a few lines of PyTorch, you can see an image transform into numbers and be reborn as a "Feature Map."
Python
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from skimage import data
# 1. Load a test grayscale image (Cameraman)
image = data.camera()
# Convert to Tensor for PyTorch (Batch 1, Channel 1, Height 512, Width 512)
image_tensor = torch.tensor(image, dtype=torch.float32).unsqueeze(0).unsqueeze(0)
# 2. Craft a 3x3 filter to extract 'Vertical Edges'
# Left is bright (1), Middle is ignored (0), Right is dark (-1)
vertical_edge_filter = torch.tensor([[[[ 1., 0., -1.],
[ 1., 0., -1.],
[ 1., 0., -1.]]]])
# 3. Create a Convolutional (Conv2d) layer
conv = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, bias=False)
# Overwrite the model weights with our custom filter
conv.weight.data = vertical_edge_filter
# 4. Apply filter to the image (Sliding!)
feature_map = conv(image_tensor)
# 5. Visualize Results
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
axes[0].imshow(image, cmap='gray')
axes[0].set_title("Original Image")
axes[1].imshow(feature_map.squeeze().detach().numpy(), cmap='gray')
axes[1].set_title("Feature Map (Vertical Edges)")
for ax in axes:
ax.axis('off')
plt.show()
[Interactive Simulation] Try Convolution Yourself
If the code feels a bit daunting, use the interactive guides on your blog to visualize the math.
Hover over an output cell to see the calculation.
Hover over an output cell to highlight the calculation area.
5. Summary: The Birth of a Giant Retina
Today, we explored the visual mechanism of CNNs, where AI moves beyond 1D limitations to scan 2D spaces and extract key features using a mathematical magnifying glass (filters). The resulting feature map is highly refined visual information tailored for AI comprehension.
However, data in our world isn't always static like an image. There are continuous sequences where meaning shifts over time, such as human speech, stock market trends, or real-time sensor signals.
In the next session, we’ll move beyond a fixed gaze to explore the AI’s memory system: RNN and LSTM. We will finally see how artificial intelligence begins to grasp the dimension of 'Time' by projecting past memories into present decisions.