Raccoon Or Raccoon Dog?
Table of Contents
What Is This?
This is a run-through of some of the ideas from Lesson 0 of the FastAI Practical Deep Learning for Coders course (sort of, there's a 2022 version that I'm using which doesn't seem to exactly match the lectures on the website). In it we search for photos using a search engine and build a neural-network to classify the images that belong to one of the two classes of photos that we use. This is an image classification example, like the Cats vs Dogs post but it has the added feature of demonstrating how to build your own dataset using a search-engine. I'll be using Tanuki (the Japanese Raccoon Dog) and Raccoon images as the categories to classify.
Imports
For the search engine we'll use DuckDuckGo via the duckduckgo-search package (from pypi) and its ddg_images
function..
# python
from functools import partial
from pathlib import Path
from time import sleep
import os, warnings
# pypi
from duckduckgo_search import ddg_images
from dotenv import load_dotenv
import torch
# fastai
from fastai.data.all import (
CategoryBlock,
DataBlock,
parent_label,
RandomSplitter,
)
from fastai.vision.all import (
download_images,
get_image_files,
ImageBlock,
Resize,
resnet18,
resize_images,
verify_images,
vision_learner,
error_rate,
PILImage,
)
from fastcore.net import urlsave
# monkey shines
from graeae import Timer
TIMER = Timer()
load_dotenv()
DATA_PATH = Path(os.environ["FASTAI_DATA"])/"raccoon-vs-tanuki"
assert torch.cuda.is_available()
Note: The DATA_PATH
is where we're going to store the images we download. We are going to use a function (parent_label
) that uses the folders within this directory to label the images within the folders (so images in a folder named "herbert" will be labeled "herbert"). This means that it should only have the folders that we are going to use to build the model. I originally set it to the fastai root root data path which then made the data loader think that all the other data folders were labels as well, so I created a sub-folder named "raccoon-vs-tanuki" to isolate the images I need to train the model.
Getting the Images
We're going to create an alias for the ddg_images
function to make the search and then return only the URLs of the images (or their thumbnails) that DuckDuckGo finds.
def get_image_urls(keywords: str,
max_images: int=200,
license_image: str="any",
key="image") -> list:
"""Search duckduckgo images
Args:
keywords: A string with keywords to give to duckduckgo
max_images: the upper limit for how many images to return
Returns:
a list-like object with the URLs of the images found
"""
return [output.get(key) for output in
ddg_images(keywords,
type_image="photo",
license_image=license_image,
max_results=max_images)]
A Test Of Tanuki
We'll start by checking that our searcher is working using the keywords "tanuki" and "racoon". First, what does ddg_images
return when we search for tanuki?
o = ddg_images("tanuki", type_image="photo", max_results=1)
print(o)
[{'title': 'Tanuki | Animal Jam Fanon Wiki | Fandom', 'image': 'https://vignette.wikia.nocookie.net/ajfanideas/images/6/6a/God_damnit.png/revision/latest?cb=20190222141158', 'thumbnail': 'https://tse3.mm.bing.net/th?id=OIP.74LPltCuN75QxFq2RHLhywHaFj&pid=Api', 'url': 'https://ajfanideas.fandom.com/wiki/Tanuki', 'height': 1200, 'width': 1600, 'source': 'Bing'}]
So it looks like it returns a list of json/dict objects. I'll print it out in a table to maybe make it easier to see. First, the title contains 'pipes' that break the table so I'll replace them with dashes.
o[0]["title"] = o[0]["title"].replace("|", ",")
Now the table.
print("|Key | Value|")
print("|-+-|")
for key, value in o[0].items():
print(f"|{key}| {value}|")
Key | Value |
---|---|
title | Tanuki , Animal Jam Fanon Wiki , Fandom |
image | |
thumbnail | https://tse3.mm.bing.net/th?id=OIP.74LPltCuN75QxFq2RHLhywHaFj&pid=Api |
url | https://ajfanideas.fandom.com/wiki/Tanuki |
height | 1200 |
width | 1600 |
source | Bing |
Looking at the image URL you might not guess that it was an image of a tanuki (is the tanuki named "God damnit"?), but the title suggests that it is. Interestingly, if you follow the URL to the page where the image comes from you'll see that it's a wiki dedicated to a game called "Animal Jam" but the author of the page says that they couldn't find an image of the tanuki from the game so it is, indeed, a photo of a real tanuki, not a game character.
That's the output of ddg_images
but we created get_image_urls
to make it a little simpler to get just the URLs so let's search for "tanuki" images again but this time I'm going to download and show the image so I'll specify that I want to pull the image from the Public Domain.
TANUKI_URLS = get_image_urls("tanuki", max_images=1, license_image="Public")
print(TANUKI_URLS[0])
https://c.pxhere.com/photos/04/ff/animal_marten_raccoon_dog_tanuki_enok_obstfuchs_omnivore_fur-705793.jpg!d
The images are usually pretty big so let's download a thumbnail of the image and take a look at it to make sure we're getting the image we expect. The original fastai notebook uses the fastai download_url
function (which appears to come from another fastai project called fastdownload (github, documentation)) but, it looks like all this function is doing is starting a progress bar (which I can't use here) and then calling urlsave
from another fastai library called fastcore (documentation) so I'll use urlsave
instead.
THUMBS = get_image_urls("tanuki", max_images=1, license_image="Public", key="thumbnail")
TANUKI_OUTPUT = "/tmp/tanuki_thumb.jpg"
urlsave(url=THUMBS[0], dest=TANUKI_OUTPUT)
And Here's the thumbnail we downloaded.
Seems to work. One disadvantage to using the get_image_urls
to alias the ddg_images
function is that we end up throwing away the other information, so to get the source URL to see the page where the image comes from we have to make another function call.
print(get_image_urls("tanuki", max_images=1, license_image="Public", key="url")[0])
https://pxhere.com/en/photo/705793
This image comes from pxhere.com which appears to be a public domain image hosting site.
Now for the raccoon.
RACCOON_OUTPUT = "/tmp/raccoon_thumb.jpg"
RACCOON_URLS = get_image_urls("raccoon",
max_images=1,
license_image="Public", key="thumbnail")
urlsave(url=RACCOON_URLS[0],
dest=RACCOON_OUTPUT)
print(get_image_urls("raccoon",
max_images=1,
license_image="Public", key="url")[0])
http://www.publicdomainpictures.net/view-image.php?image=33712&picture=raccoon-4&large=1
Build A Data Set
Now that we've done a little check of what our function does we can move on to creating our dataset using it. When you download an archived dataset from fastai it saves it to the ~/.fastai/
directory, so I'll put this dataset there too. I'll use fastai's download_images
function to do the actual downloading.
print(download_images.__doc__)
Download images listed in text file `url_file` to path `dest`, at most `max_pics`
These are the arguments it takes.
Argument | Meaning | Default |
---|---|---|
dest | Folder Path to save files to | None (required) |
url_file | Text file with one URL per line to use as source | None (only used if urls is None) |
urls | Iterable collection of URLs to download | None |
max_pics | Limit on the number of images to download | 1000 |
n_workers | Number of parallel threads to use | 8 |
timeout | Seconds to allow for a download | 4 |
preserve_filename | Whether to use the filename in the URL | False |
We'll add two extra keywords - "sun" and "shade" to the search to hopefully get images that match those conditions and between each search query I'll put in a sleep so that we aren't hitting the server too hard. We'll also use fastai's resize_images
to make sure that none of the images are too big. The argument max_size
gives the maximum number of pixels either dimension (height or width) can have.
def download_and_resize(destination: Path, search_terms: str, max_size: int=400) -> None:
"""Download images and resize them
Args:
destination: path to parent folder
search_terms: keywords to use to search for images
max_size: maximum size for height and width of images
"""
download_images(
dest=destination,
urls=get_image_urls(SEARCH_TERMS)
)
resize_images(path=destination,
max_size=max_size,
dest=destination)
return
The path
argument is the source of the images and the dest
is where you want to put the resized images. Normally I don't suppose you'd want to remove the original images, but in this case I do so they're set to the same folder.
ANIMALS = ("tanuki", "raccoon")
PAUSE = 10
PAUSE_BETWEEN_SEARCHES = partial(sleep, PAUSE)
CONDITIONS = tuple(("", "sun ", "shade "))
print(f"Estimated Run Time: {len(CONDITIONS) * len(ANIMALS) * PAUSE + 15} seconds")
with TIMER:
print("Searching for:")
for animal in ANIMALS:
destination = DATA_PATH/animal
destination.mkdir(exist_ok=True, parents=True)
for condition in CONDITIONS:
SEARCH_TERMS = f"{animal} {condition}"
print(f" - '{SEARCH_TERMS}'")
download_and_resize(destination, SEARCH_TERMS)
PAUSE_BETWEEN_SEARCHES()
Estimated Run Time: 75 seconds Started: 2022-12-08 18:24:00.371278 Searching for: - 'tanuki ' - 'tanuki sun ' - 'tanuki shade ' - 'raccoon ' - 'raccoon sun ' - 'raccoon shade ' Ended: 2022-12-08 18:27:07.360951 Elapsed: 0:03:06.989673
Verify the Dataset
Some of the images might be invalid for whatever reason, we'll use a fastai builtin function (verify_images
) to check them and Path.unlnk to delete the files that were deemed invalid. verify_images
works by trying to open each file as an image. It adds some parallelism to speed it up but isn't doing anything fancy and, depending on how many files you have and their size, might take a little while to cmoplete.
with TIMER:
failed = verify_images(get_image_files(DATA_PATH))
failed.map(Path.unlink)
print(f"{len(failed)} images were deemed failures.")
Started: 2022-12-08 18:28:17.277347 Ended: 2022-12-08 18:28:22.806488 Elapsed: 0:00:05.529141 18 images were deemed failures.
Training the Model
loaders = DataBlock(
blocks=(ImageBlock, CategoryBlock),
get_items=get_image_files,
splitter=RandomSplitter(valid_pct=0.2, seed=42),
get_y=parent_label,
item_tfms=[Resize(192, method='squish')]
).dataloaders(DATA_PATH)
Parameter | Argument | Description |
---|---|---|
blocks | (ImageBlock, CategoryBlock ) |
Defines the inputs as images and outputs as categories |
get_items | get_image_files |
A function to search for image files. |
splitter | RandomSplitter |
A class to split the data into training (80%) and validation (20%) |
get_y | parent_label |
A function that grabs the name of the folder to use as an image label. |
item_tfms | Resize |
Resize all the images to a uniform size (192 x 192) by squishing them. |
And now we train the categorizer.
with warnings.catch_warnings() as catcher:
warnings.simplefilter("ignore")
learner = vision_learner(loaders, resnet18, metrics=error_rate)
with learner.no_bar() as nobar, Timer() as timmy:
learner.fine_tune(3)
Started: 2022-12-08 18:28:55.158226 [0, 0.29742079973220825, 0.0608036108314991, 0.020555555820465088, '00:12'] [0, 0.05481018126010895, 0.0367087759077549, 0.009444444440305233, '00:16'] [1, 0.029521549120545387, 0.0178204495459795, 0.004999999888241291, '00:17'] [2, 0.012186083942651749, 0.013458597473800182, 0.006666666828095913, '00:17'] Ended: 2022-12-08 18:29:59.272703 Elapsed: 0:01:04.114477
I put the supression of the warnings in because somebody (I assume FastAI) is calling pytorch with deprecated arguments.
Some Examples
A Helper
def predict_category(path: str, learner) -> tuple:
with learner.no_bar():
prediction, probability_index, probabilities = learner.predict(
PILImage.create(path))
print(f"This is a {prediction}.")
print(f"Probability it's a {prediction}: {float(probabilities[int(probability_index)]):.2f}")
return prediction, probability_index, probabilities
predict = partial(predict_category, learner=learner)
A Tanuki
Let's look at the output of the learner.predict
method when we pass the model the picture of a raccoon dog that we looked at when we were looking at the duckduckgo search example.
TANUKI_PATH = "/tmp/tanuki_image.jpg"
urlsave(url=TANUKI_URLS[0], dest=TANUKI_PATH)
prediction, probability_index, probabilities = predict(TANUKI_PATH)
This is a raccoon. Probability it's a raccoon: 0.99
Prediction
The prediction
returned by learner.predict
is a string version of whatever your labeling function (parent_label
in this case) returns.
print(prediction)
raccoon
In this case it thinks it's a raccoon, not a raccoon dog, so our model probably isn't ready for prime-time, but let's look at rest of the output anyway.
Probabilities
The probabilities
is a TensorBase
which, for our purposes, acts like a list of the probabilities that our image belongs to one of the classifications.
print(probabilities)
TensorBase([0.9894, 0.0106])
There are two probabilities because we have two classifications (raccoon and tanuki). When I first encountered fastai one of the things I couldn't figure out is which probability matches which classification. To figure that out you need our next value, the probability_index
.
Probability Index
The probability_index
tells you which one of the probabilities matches the predicted classification.
print(probability_index)
TensorBase(0)
Our model predicted that the image was a raccon and since the probability index is 0, the "raccoon" category matches the first entry in the probabilities
collection, and looking back at the probabilities
this means that the model is 99% sure that this is a raccoon.
Now, a Raccoon
RACCOON_PATH = "/tmp/raccoon_image.jpg"
urlsave(url=RACCOON_URLS[0], dest=RACCOON_PATH)
prediction, probability_index, probabilities = predict(RACCOON_PATH)
This is a raccoon. Probability it's a raccoon: 1.00
It's really sure this is a raccoon.
And Then, the End
The final loss for the model during training was pretty low (less than 1%) but it wasn't able to identify our one tanuki test image. On the one hand, less than 1% loss isn't 0 loss, so I might just have chosen one example that is particularly hard. It might also be important that tanuki and raccoons do look quite a bit alike, so this is a harder problem than, say, cats versus dogs. Also, our method for gathering images isn't checking that the images are unique (although the URLs are, they might be redundant postings), and tanuki might be obscure enough that there aren't a huge variety of images out there, making it harder for the model to train to identify them.