Image classification is the task of taking an input image and outputting a class (a cat, dog, etc) or a probability of classes that best describes the image. We humans learn how to do this task within the first month of us being born, and for the rest of our lives it comes naturally and effortlessly to us. We’re able to quickly and seamlessly identify the environment we are in as well as the objects that surround us, all without even consciously noticing. These skills of being able to quickly recognize patterns, generalize from prior knowledge, and adapt to different image environments are ones that we do not share with our fellow machines.
When a computer sees an image (takes an image as input), it will see an array of pixel values. Depending on the resolution and size of the image, it will see a 32 x 32 x 3 array of numbers (the 3 refers to RGB values). Just to drive home the point, let's say we have a color image in JPG form and its size is 480 x 480. The representative array will be 480 x 480 x 3. Each of these numbers is given a value from 0 to 255 which describes the pixel intensity at that point (for more detail on this, see How do computers see an image?). These numbers, while meaningless to us when we perform image classification, are the only inputs available to the computer. The computer gets an this array of numbers and it will output numbers that describe the probability of the image being a certain class (say, .80 for cat, .15 for dog, .05 for bird, etc).
In the object localization task, our job is not only to produce a class label as in image recognition but also a bounding box that describes where the object is in the picture.
We also have the task of object detection, where localization needs to be done on all of the objects in the image. Therefore, you will have multiple bounding boxes and multiple class labels.
Finally, we also have object segmentation where the task is to output a class label as well as an outline of every object in the input image.