Raccoon Or Raccoon Dog?

What Is This?

This is a run-through of some of the ideas from Lesson 0 of the FastAI Practical Deep Learning for Coders course (sort of, there's a 2022 version that I'm using which doesn't seem to exactly match the lectures on the website). In it we search for photos using a search engine and build a neural-network to classify the images that belong to one of the two classes of photos that we use. This is an image classification example, like the Cats vs Dogs post but it has the added feature of demonstrating how to build your own dataset using a search-engine. I'll be using Tanuki (the Japanese Raccoon Dog) and Raccoon images as the categories to classify.

Imports

For the search engine we'll use DuckDuckGo via the duckduckgo-search package (from pypi) and its ddg_images function..

# python
from functools import partial
from pathlib import Path
from time import sleep

import os, warnings

# pypi
from duckduckgo_search import ddg_images
from dotenv import load_dotenv

import torch

# fastai
from fastai.data.all import (
    CategoryBlock,
    DataBlock,
    parent_label,
    RandomSplitter,
)
from fastai.vision.all import (
    download_images,
    get_image_files,
    ImageBlock,
    Resize,
    resnet18,
    resize_images,
    verify_images,
    vision_learner,
    error_rate,
    PILImage,
)

from fastcore.net import urlsave

# monkey shines
from graeae import Timer

TIMER = Timer()
load_dotenv()

DATA_PATH = Path(os.environ["FASTAI_DATA"])/"raccoon-vs-tanuki"
assert torch.cuda.is_available()

Note: The DATA_PATH is where we're going to store the images we download. We are going to use a function (parent_label) that uses the folders within this directory to label the images within the folders (so images in a folder named "herbert" will be labeled "herbert"). This means that it should only have the folders that we are going to use to build the model. I originally set it to the fastai root root data path which then made the data loader think that all the other data folders were labels as well, so I created a sub-folder named "raccoon-vs-tanuki" to isolate the images I need to train the model.

Getting the Images

We're going to create an alias for the ddg_images function to make the search and then return only the URLs of the images (or their thumbnails) that DuckDuckGo finds.

def get_image_urls(keywords: str,
                   max_images: int=200,
                   license_image: str="any",
                   key="image") -> list:
    """Search duckduckgo images

    Args:
     keywords: A string with keywords to give to duckduckgo
     max_images: the upper limit for how many images to return

    Returns:
     a list-like object with the URLs of the images found
    """
    return [output.get(key) for output in 
            ddg_images(keywords,
                       type_image="photo",
                       license_image=license_image,
                       max_results=max_images)]

A Test Of Tanuki

We'll start by checking that our searcher is working using the keywords "tanuki" and "racoon". First, what does ddg_images return when we search for tanuki?

o = ddg_images("tanuki", type_image="photo", max_results=1)
print(o)
[{'title': 'Tanuki | Animal Jam Fanon Wiki | Fandom', 'image': 'https://vignette.wikia.nocookie.net/ajfanideas/images/6/6a/God_damnit.png/revision/latest?cb=20190222141158', 'thumbnail': 'https://tse3.mm.bing.net/th?id=OIP.74LPltCuN75QxFq2RHLhywHaFj&pid=Api', 'url': 'https://ajfanideas.fandom.com/wiki/Tanuki', 'height': 1200, 'width': 1600, 'source': 'Bing'}]

So it looks like it returns a list of json/dict objects. I'll print it out in a table to maybe make it easier to see. First, the title contains 'pipes' that break the table so I'll replace them with dashes.

o[0]["title"] = o[0]["title"].replace("|", ",")

Now the table.

print("|Key | Value|")
print("|-+-|")
for key, value in o[0].items():
    print(f"|{key}| {value}|")
Key Value
title Tanuki , Animal Jam Fanon Wiki , Fandom
image latest?cb=20190222141158
thumbnail https://tse3.mm.bing.net/th?id=OIP.74LPltCuN75QxFq2RHLhywHaFj&pid=Api
url https://ajfanideas.fandom.com/wiki/Tanuki
height 1200
width 1600
source Bing

Looking at the image URL you might not guess that it was an image of a tanuki (is the tanuki named "God damnit"?), but the title suggests that it is. Interestingly, if you follow the URL to the page where the image comes from you'll see that it's a wiki dedicated to a game called "Animal Jam" but the author of the page says that they couldn't find an image of the tanuki from the game so it is, indeed, a photo of a real tanuki, not a game character.

That's the output of ddg_images but we created get_image_urls to make it a little simpler to get just the URLs so let's search for "tanuki" images again but this time I'm going to download and show the image so I'll specify that I want to pull the image from the Public Domain.

TANUKI_URLS = get_image_urls("tanuki", max_images=1, license_image="Public")
print(TANUKI_URLS[0])
https://c.pxhere.com/photos/04/ff/animal_marten_raccoon_dog_tanuki_enok_obstfuchs_omnivore_fur-705793.jpg!d

The images are usually pretty big so let's download a thumbnail of the image and take a look at it to make sure we're getting the image we expect. The original fastai notebook uses the fastai download_url function (which appears to come from another fastai project called fastdownload (github, documentation)) but, it looks like all this function is doing is starting a progress bar (which I can't use here) and then calling urlsave from another fastai library called fastcore (documentation) so I'll use urlsave instead.

THUMBS = get_image_urls("tanuki", max_images=1, license_image="Public", key="thumbnail")
TANUKI_OUTPUT = "/tmp/tanuki_thumb.jpg"
urlsave(url=THUMBS[0], dest=TANUKI_OUTPUT)

And Here's the thumbnail we downloaded.

tanuki_thumb.jpg

Seems to work. One disadvantage to using the get_image_urls to alias the ddg_images function is that we end up throwing away the other information, so to get the source URL to see the page where the image comes from we have to make another function call.

print(get_image_urls("tanuki", max_images=1, license_image="Public", key="url")[0])
https://pxhere.com/en/photo/705793

This image comes from pxhere.com which appears to be a public domain image hosting site.

Now for the raccoon.

RACCOON_OUTPUT = "/tmp/raccoon_thumb.jpg"
RACCOON_URLS = get_image_urls("raccoon",
                             max_images=1,
                             license_image="Public", key="thumbnail")

urlsave(url=RACCOON_URLS[0],
        dest=RACCOON_OUTPUT)

raccoon_thumb.jpg

print(get_image_urls("raccoon",
                     max_images=1,
                     license_image="Public", key="url")[0])
http://www.publicdomainpictures.net/view-image.php?image=33712&picture=raccoon-4&large=1

Build A Data Set

Now that we've done a little check of what our function does we can move on to creating our dataset using it. When you download an archived dataset from fastai it saves it to the ~/.fastai/ directory, so I'll put this dataset there too. I'll use fastai's download_images function to do the actual downloading.

print(download_images.__doc__)
Download images listed in text file `url_file` to path `dest`, at most `max_pics`

These are the arguments it takes.

Argument Meaning Default
dest Folder Path to save files to None (required)
url_file Text file with one URL per line to use as source None (only used if urls is None)
urls Iterable collection of URLs to download None
max_pics Limit on the number of images to download 1000
n_workers Number of parallel threads to use 8
timeout Seconds to allow for a download 4
preserve_filename Whether to use the filename in the URL False

We'll add two extra keywords - "sun" and "shade" to the search to hopefully get images that match those conditions and between each search query I'll put in a sleep so that we aren't hitting the server too hard. We'll also use fastai's resize_images to make sure that none of the images are too big. The argument max_size gives the maximum number of pixels either dimension (height or width) can have.

def download_and_resize(destination: Path, search_terms: str, max_size: int=400) -> None:
    """Download images and resize them

    Args:
     destination: path to parent folder
     search_terms: keywords to use to search for images
     max_size: maximum size for height and width of images
    """
    download_images(
        dest=destination,
        urls=get_image_urls(SEARCH_TERMS)
    )

    resize_images(path=destination,
                  max_size=max_size,
                  dest=destination)
    return

The path argument is the source of the images and the dest is where you want to put the resized images. Normally I don't suppose you'd want to remove the original images, but in this case I do so they're set to the same folder.

ANIMALS = ("tanuki", "raccoon")

PAUSE = 10
PAUSE_BETWEEN_SEARCHES = partial(sleep, PAUSE)
CONDITIONS = tuple(("", "sun ", "shade "))

print(f"Estimated Run Time: {len(CONDITIONS) * len(ANIMALS) * PAUSE + 15} seconds")

with TIMER:
    print("Searching for:")
    for animal in ANIMALS:
        destination = DATA_PATH/animal
        destination.mkdir(exist_ok=True, parents=True)

        for condition in CONDITIONS:
            SEARCH_TERMS = f"{animal} {condition}"
            print(f" - '{SEARCH_TERMS}'")
            download_and_resize(destination, SEARCH_TERMS)
            PAUSE_BETWEEN_SEARCHES()
Estimated Run Time: 75 seconds
Started: 2022-12-08 18:24:00.371278
Searching for:
 - 'tanuki '
 - 'tanuki sun '
 - 'tanuki shade '
 - 'raccoon '
 - 'raccoon sun '
 - 'raccoon shade '
Ended: 2022-12-08 18:27:07.360951
Elapsed: 0:03:06.989673

Verify the Dataset

Some of the images might be invalid for whatever reason, we'll use a fastai builtin function (verify_images) to check them and Path.unlnk to delete the files that were deemed invalid. verify_images works by trying to open each file as an image. It adds some parallelism to speed it up but isn't doing anything fancy and, depending on how many files you have and their size, might take a little while to cmoplete.

with TIMER:
    failed = verify_images(get_image_files(DATA_PATH))
    failed.map(Path.unlink)
print(f"{len(failed)} images were deemed failures.")
Started: 2022-12-08 18:28:17.277347
Ended: 2022-12-08 18:28:22.806488
Elapsed: 0:00:05.529141
18 images were deemed failures.

Training the Model

loaders = DataBlock(
    blocks=(ImageBlock, CategoryBlock), 
    get_items=get_image_files, 
    splitter=RandomSplitter(valid_pct=0.2, seed=42),
    get_y=parent_label,
    item_tfms=[Resize(192, method='squish')]
).dataloaders(DATA_PATH)
Parameter Argument Description
blocks (ImageBlock, CategoryBlock) Defines the inputs as images and outputs as categories
get_items get_image_files A function to search for image files.
splitter RandomSplitter A class to split the data into training (80%) and validation (20%)
get_y parent_label A function that grabs the name of the folder to use as an image label.
item_tfms Resize Resize all the images to a uniform size (192 x 192) by squishing them.

And now we train the categorizer.

with warnings.catch_warnings() as catcher:
    warnings.simplefilter("ignore")
    learner = vision_learner(loaders, resnet18, metrics=error_rate)

with learner.no_bar() as nobar, Timer() as timmy:
    learner.fine_tune(3)
Started: 2022-12-08 18:28:55.158226
[0, 0.29742079973220825, 0.0608036108314991, 0.020555555820465088, '00:12']
[0, 0.05481018126010895, 0.0367087759077549, 0.009444444440305233, '00:16']
[1, 0.029521549120545387, 0.0178204495459795, 0.004999999888241291, '00:17']
[2, 0.012186083942651749, 0.013458597473800182, 0.006666666828095913, '00:17']
Ended: 2022-12-08 18:29:59.272703
Elapsed: 0:01:04.114477

I put the supression of the warnings in because somebody (I assume FastAI) is calling pytorch with deprecated arguments.

Some Examples

A Helper

def predict_category(path: str, learner) -> tuple:

    with learner.no_bar():
        prediction, probability_index, probabilities = learner.predict(
            PILImage.create(path))
    print(f"This is a {prediction}.")
    print(f"Probability it's a {prediction}: {float(probabilities[int(probability_index)]):.2f}")
    return prediction, probability_index, probabilities

predict = partial(predict_category, learner=learner)

A Tanuki

Let's look at the output of the learner.predict method when we pass the model the picture of a raccoon dog that we looked at when we were looking at the duckduckgo search example.

TANUKI_PATH = "/tmp/tanuki_image.jpg"
urlsave(url=TANUKI_URLS[0], dest=TANUKI_PATH)
prediction, probability_index, probabilities = predict(TANUKI_PATH)
This is a raccoon.
Probability it's a raccoon: 0.99

tanuki_image.jpg

Prediction

The prediction returned by learner.predict is a string version of whatever your labeling function (parent_label in this case) returns.

print(prediction)
raccoon

In this case it thinks it's a raccoon, not a raccoon dog, so our model probably isn't ready for prime-time, but let's look at rest of the output anyway.

Probabilities

The probabilities is a TensorBase which, for our purposes, acts like a list of the probabilities that our image belongs to one of the classifications.

print(probabilities)
TensorBase([0.9894, 0.0106])

There are two probabilities because we have two classifications (raccoon and tanuki). When I first encountered fastai one of the things I couldn't figure out is which probability matches which classification. To figure that out you need our next value, the probability_index.

Probability Index

The probability_index tells you which one of the probabilities matches the predicted classification.

print(probability_index)
TensorBase(0)

Our model predicted that the image was a raccon and since the probability index is 0, the "raccoon" category matches the first entry in the probabilities collection, and looking back at the probabilities this means that the model is 99% sure that this is a raccoon.

Now, a Raccoon

RACCOON_PATH = "/tmp/raccoon_image.jpg"
urlsave(url=RACCOON_URLS[0], dest=RACCOON_PATH)
prediction, probability_index, probabilities = predict(RACCOON_PATH)
This is a raccoon.
Probability it's a raccoon: 1.00

raccoon_image.jpg

It's really sure this is a raccoon.

And Then, the End

The final loss for the model during training was pretty low (less than 1%) but it wasn't able to identify our one tanuki test image. On the one hand, less than 1% loss isn't 0 loss, so I might just have chosen one example that is particularly hard. It might also be important that tanuki and raccoons do look quite a bit alike, so this is a harder problem than, say, cats versus dogs. Also, our method for gathering images isn't checking that the images are unique (although the URLs are, they might be redundant postings), and tanuki might be obscure enough that there aren't a huge variety of images out there, making it harder for the model to train to identify them.