Pulling Images For the Abecedarium

Imports and Setup

Imports

# python
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv
from google_images_download.google_images_download import googleimagesdownload as GoogleImages

Set Up

ENV_PATH = Path("~/projects/ape-iron/.env").expanduser()
assert ENV_PATH.is_file()

load_dotenv(ENV_PATH, override=True)

OUTPUT = Path(os.environ["ABECEDARIUM"]).expanduser()
google_images = GoogleImages()

Ape

A First (Failed Attempt)

APE_PATH = OUTPUT/"Ape/"
keywords = dict(
    keywords="gorilla",
    limit=20,
    type="photo",
    output_directory=str(APE_PATH),
    chromedriver="/usr/bin/chromedriver",
)
paths = google_images.download(keywords)

Item no.: 1 --> Item name = pig
Evaluating...
Starting Download...


Unfortunately all 20 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!

Errors: 0

Oops. It looks like the google-images-download project has been abandoned and no longer works. There are multiple bug reports for it. One mentions a fork that looks like it has more activity, but this bug report mentions another project called icrawlerwhich seems like a more general web-crawler that also says it has a google images downloader so maybe I'll try that instead.

Once Again With icrawler

from icrawler.builtin import GoogleImageCrawler
storage = dict(root_dir=APE_PATH)
crawler = GoogleImageCrawler(storage=storage)

Besides the keyword (keywords?) you can pass some other arguments. One notable option is a `filters` dictionary that helps specify the type of image.

Key Options
type “photo”, “face”, “clipart”, “linedrawing”, “animated”
color “color”, “blackandwhite”, “transparent”, “red”, “orange”, “yellow”, “green”
  “teal”, “blue”, “purple”, “pink”, “white”, “gray”, “black”, “brown”
size “large”, “medium”, “icon”
  larger than a given size (e.g. “>640x480”)
  exactly is a given size (“=1024x768”)
license “noncommercial”(labeled for noncommercial reuse)
  “commercial”(labeled for reuse)
  “noncommercial,modify”(labeled for noncommercial reuse with modification)
  “commercial,modify”(labeled for reuse with modification)
date “pastday”, “pastweek”
  tuple of dates, e.g. ((2016, 1, 1), (2017, 1, 1)) or ((2016, 1, 1), None)

The icrawler documentation suggests using the date-ranges to get past the 1,000 (or however many) image limit.

The arguments to the `crawl` method are:

Argument Default Description
keyword   The search term for the images
filters None A dictionary of image filter options (see above)
offset 0 Where to start in the search (e.g. skip first 10)
max_num 1000 Maximum number of images to pull (the service will set some limit, usually 1,000)
min_size None Minimum pixel size as a tuple (x-pixels, y-pixels)
max_size None Maximum image size as a tuple (x-pixels, y-pixels)
language None If you're not using English, I assume
file_idx_offset 0 icrawler names the downloaded files as numbers, this will be the next number to use

The code says that you can't set max_num to anything greater than 1,000. I think the offset is the amount to skip within that 1,000. Maybe. The icrawler ignores the names of the files given in the URLs and names them with numbers (e.g. 000001.jpg). If you make another crawl and don't want to clobber an earlier set of files you probably need to change the file_idx_offset (I think. I haven't tried it yet).

Let's see what happens if we pull gorilla photos.

crawler.crawl(keyword="gorilla",
              filters=dict(
                  type="photo"
              ))
2023-05-29 16:20:21,749 - INFO - icrawler.crawler - start crawling...
2023-05-29 16:20:21,752 - INFO - icrawler.crawler - starting 1 feeder threads...
2023-05-29 16:20:21,756 - INFO - icrawler.crawler - starting 1 parser threads...
2023-05-29 16:20:21,759 - INFO - icrawler.crawler - starting 1 downloader threads...
2023-05-29 16:20:22,776 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=0&start=0&tbs=itp%3Aphoto&tbm=isch
2023-05-29 16:20:23,223 - ERROR - downloader - Response status code 404, file https://upload.wikimedia.org/wikipedia/commons/thumb/b/bb/Gorille_des_plaines_de_l%27ouest_%C3%A0_l%27Espace_Zoologique.jpg
2023-05-29 16:20:23,871 - INFO - downloader - image #1  https://files.worldwildlife.org/wwfcmsprod/images/Mountain_Gorilla_Silverback_WW22557/story_full_width/36fcoamev0_Mountain_Gorilla_Silverback_WW22557.jpg
2023-05-29 16:20:24,774 - INFO - downloader - image #2  https://i.natgeofe.com/n/2d706180-e778-4110-9c15-1a7435b72114/mountain-gorillas-rwanda-02_3x4.jpg
2023-05-29 16:20:25,115 - INFO - downloader - image #3  https://www.nczoo.org/sites/default/files/2020-05/Gorilla-5.jpg
2023-05-29 16:20:25,465 - INFO - downloader - image #4  https://cdn.britannica.com/79/20279-050-ECDF21A7/mountain-gorilla-Virunga-National-Park-Democratic-Republic.jpg
2023-05-29 16:20:25,651 - ERROR - downloader - Response status code 401, file https://optimise2.assets-servd.host/maniacal-finch/production/animals/WL_Gorilla.jpg
2023-05-29 16:20:25,859 - INFO - downloader - image #5  https://media.npr.org/assets/img/2021/10/08/ap21280523738198-1--cc4c958352f2bd1a4a45c203f1f7807fc0193457-s1100-c50.jpg
2023-05-29 16:20:26,463 - INFO - downloader - image #6  https://media-cldnry.s-nbcnews.com/image/upload/rockcms/2022-01/220125-atlanta-zoo-ozzie-male-gorilla-obit-ac-849p-0cdfbe.jpg
2023-05-29 16:20:26,510 - INFO - downloader - image #7  https://files.worldwildlife.org/wwfcmsprod/images/HERO_Mountain_Gorilla_Silverback_WW22557/hero_small/17l7fosr27_Mountain_Gorilla_Silverback_WW22557.jpg
2023-05-29 16:20:27,116 - INFO - downloader - image #8  https://images.immediate.co.uk/production/volatile/sites/23/2014/07/GettyImages-157862378-7432ede.jpg
2023-05-29 16:20:27,314 - INFO - downloader - image #9  https://t4.ftcdn.net/jpg/05/65/55/03/360_F_565550348_QbLFP5eFniY4cGy230zuhGcz0pcG56YC.jpg
2023-05-29 16:20:30,876 - INFO - downloader - image #10 https://www.wwf.org.uk/sites/default/files/styles/hero_m/public/2019-08/mountain_gorilla_Rwanda.jpg
2023-05-29 16:20:31,178 - INFO - downloader - image #11 https://detroitzoo.org/wp-content/uploads/2015/08/Gorilla-Pende.jpg
2023-05-29 16:20:31,792 - INFO - downloader - image #12 https://cbsaustin.com/resources/media/eab965d5-d9d1-4739-90ad-7b029219bd7c-large16x9_ScreenShot20230407at12.47.20PM.png
2023-05-29 16:20:32,583 - INFO - downloader - image #13 https://newschannel20.com/resources/media/eab965d5-d9d1-4739-90ad-7b029219bd7c-large3x4_ScreenShot20230407at12.47.20PM.png
2023-05-29 16:20:33,167 - ERROR - downloader - Response status code 400, file https://th-thumbnailer.cdn-si-edu.com/J_03GIL6RkkkoUYxjX1bZ2uhArg\u003d/1072x720/filters:no_upscale()/https://tf-cmsv2-smithsonianmag-media.s3.amazonaws.com/filer/45/e8/45e81e74-8044-4a89-a679-6d0eaa70d6fc/caters_gorilla_punch_03.jpg
2023-05-29 16:20:33,883 - INFO - downloader - image #14 https://upload.wikimedia.org/wikipedia/commons/5/50/Male_gorilla_in_SF_zoo.jpg
2023-05-29 16:20:34,899 - INFO - downloader - image #15 https://i.natgeofe.com/n/e180e488-e4e1-472d-9d4f-b2138978903c/01-gorilla-stare_4x3.jpg
2023-05-29 16:20:35,491 - INFO - downloader - image #16 https://cdn.britannica.com/79/126579-050-17FD9CF2/lowland-gorilla.jpg
2023-05-29 16:20:36,169 - INFO - downloader - image #17 https://foxchattanooga.com/resources/media/44a66631-49af-4355-ad52-845f43e9411d-pittsburghzoobabygorilla2.png
2023-05-29 16:20:36,550 - INFO - downloader - image #18 https://gorillafund.org/wp-content/uploads/2022/09/Ubwitange.jpg
2023-05-29 16:20:36,964 - INFO - downloader - image #19 https://www.rainforest-alliance.org/wp-content/uploads/2021/06/baby-mountain-gorilla-1.jpg
2023-05-29 16:20:40,643 - INFO - downloader - image #20 https://ichef.bbci.co.uk/news/976/cpsprodpb/2533/production/_119132590_2.60637068.jpg
2023-05-29 16:20:40,986 - INFO - downloader - image #21 https://images.theconversation.com/files/246210/original/file-20181119-76137-1c4570v.jpg
2023-05-29 16:20:41,194 - INFO - downloader - image #22 https://m.media-amazon.com/images/I/81uzcxVyiaL._AC_UF894,1000_QL80_.jpg
2023-05-29 16:20:41,621 - INFO - downloader - image #23 https://zooatlanta.org/wp-content/uploads/gorilla_baby2023_230425_shalia_baby_ZA_2T0A6711.jpg
2023-05-29 16:20:41,878 - ERROR - downloader - Response status code 403, file https://gray-whsv-prod.cdn.arcpublishing.com/resizer/5SrqNB2HFQ5OH7-D5Da0plW2zH4\u003d/1200x675/smart/filters:quality(85)/cloudfront-us-east-1.images.arcpublishing.com/gray/TB36MKZ2ENACBLSWNP6INGD3HA.jpg
2023-05-29 16:20:42,635 - INFO - downloader - image #24 https://ichef.bbci.co.uk/news/976/cpsprodpb/C173/production/_119132594_2.60637064.jpg
2023-05-29 16:20:43,442 - INFO - downloader - image #25 https://www.cmzoo.org/wp-content/uploads/Roxy-1024x680.jpg
2023-05-29 16:20:43,747 - INFO - downloader - image #26 https://media.npr.org/assets/img/2022/01/26/western-lowland-gorilla-ozzie_zoo-atlanta-1b335bebb3e98145cd3216e20b668c617b370a58.jpg
2023-05-29 16:20:44,376 - INFO - downloader - image #27 https://www.columbuszoo.org/sites/default/files/styles/square_large/public/assets/animals/Gorilla%20%28Ktembe%29%2009956%20-%20Grahm%20S.%20Jones%2C%20Columbus%20Zoo%20and%20Aquarium.jpg
2023-05-29 16:20:47,970 - ERROR - downloader - Response status code 500, file https://npr.brightspotcdn.com/dims4/default/4ef2621/2147483647/strip/true/crop/1995x1047+0+117/resize/1200x630!/quality/90/?url\u003dhttp%3A%2F%2Fnpr-brightspot.s3.amazonaws.com%2F5e%2F11%2Fe4ca0a63450c8922417ecba1048e%2F20230308-pz2-8574-cpaulselvaggio-web.jpg
2023-05-29 16:20:48,432 - INFO - downloader - image #28 https://i.natgeofe.com/n/8fa82b8d-0110-48d4-9e01-2926d359c784/mountain-gorillas-rwanda-05_square.jpg
2023-05-29 16:20:48,823 - INFO - downloader - image #29 https://assets3.thrillist.com/v1/image/2782161/1020x765/scale;webp\u003dauto;jpeg_quality\u003d60.jpg
2023-05-29 16:20:49,148 - INFO - downloader - image #30 https://www.awf.org/sites/default/files/Website_SpeciesPage_MountainGorilla01_Hero.jpg
2023-05-29 16:20:49,639 - INFO - downloader - image #31 https://www.jsonline.com/gcdn/presto/2022/07/28/PMJS/9508fe0d-e41f-4c90-922a-f738e6254964-GorillaStare2_4k.jpg
2023-05-29 16:20:50,000 - ERROR - downloader - Response status code 403, file https://www.bostonglobe.com/resizer/UdvN4BJ3rNxChcQaeBBvfctPC6k\u003d/arc-anglerfish-arc2-prod-bostonglobe/public/XLLV4FEUWWSBUJGFQCPXXOEOSY.jpg
2023-05-29 16:20:50,418 - INFO - downloader - image #32 https://louisvillezoo.org/wp-content/uploads/2019/06/Kindi.jpg
2023-05-29 16:20:50,991 - INFO - downloader - image #33 https://sdzsafaripark.org/sites/default/files/styles/hero_with_nav_gradient/public/hero/hero-gorilla.jpg
2023-05-29 16:20:51,377 - INFO - downloader - image #34 https://media.11alive.com/assets/WXIA/images/dba319a2-1f5b-475c-abc2-2070f5695624/dba319a2-1f5b-475c-abc2-2070f5695624_1920x1080.jpg
2023-05-29 16:20:51,790 - INFO - downloader - image #35 https://cbsaustin.com/resources/media/8c872d35-142a-497b-8e40-c7318cf731ce-medium16x9_ScreenShot20230407at1.29.59PM.png
2023-05-29 16:20:54,454 - INFO - downloader - image #36 https://1.bp.blogspot.com/-BSQMRiOCqII/YElP3Sr0qUI/AAAAAAAATEo/E2SvBKQFnmEIjywIvw7oZQ4g5jgLChvkwCLcBGAsYHQ/s2048/Zuna%2B4x6%2Bhigh%2Bres.jpg
2023-05-29 16:20:54,771 - INFO - downloader - image #37 https://d21yqjvcoayho7.cloudfront.net/wp-content/uploads/2022/08/31/Mashika1.jpg
2023-05-29 16:20:55,344 - INFO - downloader - image #38 https://media.cnn.com/api/v1/images/stellar/prod/190422153943-gorillas-selfie-virunga-national-park.jpg
2023-05-29 16:20:55,950 - INFO - downloader - image #39 https://whc.vetmed.ucdavis.edu/sites/g/files/dgvnsk5261/files/styles/sf_landscape_16x9/public/images/article/1-infant%20mtn%20gorilla-Katwe%20Group-OCT%2019%20Bwindi-copyright%20Gorilla%20Doctors-compressed.jpg
2023-05-29 16:20:56,185 - INFO - downloader - image #40 https://media.npr.org/assets/img/2018/01/11/pasikanpr-188adc923bb86ae3a69237493c0aa76f70e75a4a-s1100-c50.jpg
2023-05-29 16:20:56,665 - INFO - downloader - image #41 https://images.newscientist.com/wp-content/uploads/2016/07/13163431/lead_whittier20150521001-5.jpg
2023-05-29 16:20:57,702 - INFO - downloader - image #42 https://good-nature-blog-uploads.s3.amazonaws.com/uploads/2018/01/silverback-gorilla-africa-Benjamin_Thomas.jpg
2023-05-29 16:20:58,455 - INFO - downloader - image #43 https://comozooconservatory.org/wp-content/uploads/2023/04/Gorilla-credit-Steve-Solmonson.jpg
2023-05-29 16:20:59,291 - INFO - downloader - image #44 https://nationalzoo.si.edu/sites/default/files/styles/768_scale/public/newsroom/20230527-valschultz-012-gorilla-infant.jpg
2023-05-29 16:20:59,632 - ERROR - downloader - Response status code 400, file https://cdn.theatlantic.com/thumbor/zN3P8Eg5R2KCRWXgbG3B9VUqxB0\u003d/243x0:3243x2250/1200x900/media/img/mt/2016/11/RTR3NO4M/original.jpg
2023-05-29 16:20:59,940 - INFO - downloader - image #45 https://gorillafund.org/wp-content/uploads/2022/05/Silverback-gorilla-Mafunzo-1024x768.jpg
2023-05-29 16:21:00,189 - ERROR - downloader - Response status code 400, file https://people.com/thmb/zJuWJJxNflt35JwcK9ifuDmcHV4\u003d/1500x0/filters:no_upscale():max_bytes(150000):strip_icc():focal(749x0:751x2)/prince-charles-baby-mountain-gorilla-ubwuzuzanye-090222-231cd500ae6a48aab86d28b031c64d1a.jpg
2023-05-29 16:21:00,523 - ERROR - downloader - Response status code 500, file https://ca-times.brightspotcdn.com/dims4/default/dd8d9f5/2147483647/strip/true/crop/5500x2888+0+518/resize/1200x630!/quality/80/?url\u003dhttps%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F6c%2Fb9%2Fc874005242be8d9bab87c9b81f4a%2Fla-zoo-gorilla-94162.jpg
2023-05-29 16:21:00,884 - ERROR - downloader - Response status code 500, file https://ewscripps.brightspotcdn.com/dims4/default/4fe6984/2147483647/strip/true/crop/1230x646+0+84/resize/1200x630!/quality/90/?url\u003dhttp%3A%2F%2Fewscripps-brightspot.s3.amazonaws.com%2Fe2%2F03%2F781609d74315a69123afb278c6b4%2Fscreen-shot-2022-10-10-at-2.08.29%20PM.png
2023-05-29 16:21:01,469 - INFO - downloader - image #46 https://cloudfront-us-east-1.images.arcpublishing.com/advancelocal/ELEEDHM3YVEYZGD5KZNQZS2ZSY.jpg
2023-05-29 16:21:01,892 - INFO - downloader - image #47 https://www.first5la.org/wp-content/uploads/2020/08/baby-gorilla-tuena-3950983001_534.jpg
2023-05-29 16:21:05,250 - INFO - downloader - image #48 https://images.csmonitor.com/csm/2013/03/gorilla.jpg
2023-05-29 16:21:06,130 - INFO - downloader - image #49 https://cdn.hswstatic.com/gif/gorillas.jpg
2023-05-29 16:21:06,905 - INFO - downloader - image #50 https://files.worldwildlife.org/wwfcmsprod/images/mountain_gorilla_tom_deuitch/story_full_width/3tsywpr4hc___Tom_Deuitch.jpg
2023-05-29 16:21:07,576 - INFO - downloader - image #51 https://www.northeastohioparent.com/wp-content/uploads/2022/10/Kayembe-on-Freddy.jpg
2023-05-29 16:21:08,096 - INFO - downloader - image #52 https://media.apenheul.nl/aphl-cache/0/a/a/9/f/5/0aa9f5cdc1936be4e7937b33e6d625c2b46d4eff.jpg
2023-05-29 16:21:08,182 - ERROR - downloader - Response status code 400, file https://people.com/thmb/9KajO2h4En_kPrf_ZstX4qa0tyQ\u003d/1500x0/filters:no_upscale():max_bytes(150000):strip_icc():focal(688x459:690x461)/kiki-the-gorilla-2000-d0450f98d6894f47892c9e8e088aec61.jpg
2023-05-29 16:21:08,512 - INFO - downloader - image #53 https://9b16f79ca967fd0708d1-2713572fef44aa49ec323e813b06d2d9.ssl.cf2.rackcdn.com/1140x_a10-7_cTC/pittsburgh-zoo-gorilla-1681372480.jpg
2023-05-29 16:21:09,233 - INFO - downloader - image #54 https://cdnph.upi.com/ph/st/th/7381444250950/2015/upi/e113f94e3f13ce7e262fd708f10b66e8/v1.5/8-things-you-didnt-know-about-baby-gorillas.jpg
2023-05-29 16:21:09,824 - INFO - downloader - image #55 https://assets-fortworthbusiness-com.s3-accelerate.amazonaws.com/2022/11/baby-gorillajpg-2-scaled.jpg
2023-05-29 16:21:10,084 - ERROR - downloader - Response status code 500, file https://npr.brightspotcdn.com/dims4/default/efcc501/2147483647/strip/true/crop/1200x739+0+124/resize/880x542!/quality/90/?url\u003dhttp%3A%2F%2Fnpr-brightspot.s3.amazonaws.com%2Flegacy%2Fsites%2Fkera%2Ffiles%2F201807%2Fbaby.jpg
2023-05-29 16:21:11,683 - INFO - downloader - image #56 https://images.thdstatic.com/productImages/31840f43-31cb-42c1-94d2-104e9c4ac138/svn/design-toscano-garden-statues-ne110088-44_600.jpg
2023-05-29 16:21:12,344 - INFO - downloader - image #57 https://d.newsweek.com/en/full/2012821/mambie-gorilla.jpg
2023-05-29 16:21:12,793 - INFO - downloader - image #58 https://npr.brightspotcdn.com/legacy/sites/wjct/files/201902/gandai_by_lynde_nunn__1_.jpg
2023-05-29 16:21:14,785 - INFO - downloader - image #59 https://cloudfront-us-east-1.images.arcpublishing.com/bostonglobe/GZCMY47RIA323XUKPTLSPEPAPY.jpg
2023-05-29 16:21:15,175 - INFO - downloader - image #60 https://cbsaustin.com/resources/media/d86faa81-97a5-4cb5-9dbf-6c50944fdf15-medium16x9_ScreenShot20230407at1.58.11PM.png
2023-05-29 16:21:20,475 - ERROR - downloader - Exception caught when downloading file https://dl0.creation.com/articles/p150/c15079/Gorilla.jpg, error: HTTPSConnectionPool(host='dl0.creation.com', port=443): Read timed out. (read timeout=5), remaining retry times: 2
2023-05-29 16:21:20,660 - INFO - downloader - image #61 https://dl0.creation.com/articles/p150/c15079/Gorilla.jpg
2023-05-29 16:21:21,628 - INFO - downloader - image #62 https://i0.wp.com/sitn.hms.harvard.edu/wp-content/uploads/2021/04/Picture1.jpg
2023-05-29 16:21:22,070 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=1&start=100&tbs=itp%3Aphoto&tbm=isch
2023-05-29 16:21:22,321 - INFO - downloader - image #63 https://www.pbs.org/wnet/nature/files/2021/07/amy-reed-XB5E4D-Ipco-unsplash-scaled-e1627330380270.jpg
2023-05-29 16:21:22,687 - INFO - downloader - image #64 https://images.theconversation.com/files/267521/original/file-20190404-131437-psnnwu.jpg
2023-05-29 16:21:22,841 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=2&start=200&tbs=itp%3Aphoto&tbm=isch
2023-05-29 16:21:23,133 - INFO - downloader - image #65 https://assets.rebelmouse.io/eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbWFnZSI6Imh0dHBzOi8vYXNzZXRzLnJibC5tcy8yNjM3OTM0Ny9vcmlnaW4uanBnIiwiZXhwaXJlc19hdCI6MTY5MjY4ODUxMH0.-R0AvhdriipRZSfYTpfq-CFuNPsCRhl6gd0Z9VNrA88/img.jpg
2023-05-29 16:21:23,522 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=3&start=300&tbs=itp%3Aphoto&tbm=isch
2023-05-29 16:21:23,593 - INFO - feeder - thread feeder-001 exit
2023-05-29 16:21:23,930 - INFO - downloader - image #66 https://i0.wp.com/eastafricanjunglesafaris.com/wp-content/uploads/2019/08/Header-1.jpg
2023-05-29 16:21:24,222 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=4&start=400&tbs=itp%3Aphoto&tbm=isch
2023-05-29 16:21:24,998 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=5&start=500&tbs=itp%3Aphoto&tbm=isch
2023-05-29 16:21:25,822 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=6&start=600&tbs=itp%3Aphoto&tbm=isch
2023-05-29 16:21:26,593 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=7&start=700&tbs=itp%3Aphoto&tbm=isch
2023-05-29 16:21:27,387 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=8&start=800&tbs=itp%3Aphoto&tbm=isch
2023-05-29 16:21:28,307 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=9&start=900&tbs=itp%3Aphoto&tbm=isch
2023-05-29 16:21:28,932 - INFO - downloader - downloader-001 is waiting for new download tasks
2023-05-29 16:21:30,411 - INFO - parser - no more page urls for thread parser-001 to parse
2023-05-29 16:21:30,412 - INFO - parser - thread parser-001 exit
2023-05-29 16:21:33,934 - INFO - downloader - no more download task for thread downloader-001
2023-05-29 16:21:33,935 - INFO - downloader - thread downloader-001 exit
2023-05-29 16:21:34,776 - INFO - icrawler.crawler - Crawling task done!
print(f"Downloaded: {sum(1 for path in APE_PATH.iterdir())}")
Downloaded: 66

So, even though the default maximum number of images is 1,000, it actually only pulled down 66. There are some error codes in the logger's output, but that seems like a big discrepancy. Oh, well, okay for a first try.

Bison

Looking at the output of the Ape crawl, it looks like it only used one thread per object (i.e. one feeder, one parser, and one downloader. Maybe I'll try and push that up and see what happens.

BISON_PATH = OUTPUT/"Bison"
storage["root_dir"] = BISON_PATH

crawler = GoogleImageCrawler(storage=storage,
                             feeder_threads=2,
                             parser_threads=2,
                             downloader_threads=4)

filters = dict(type="photo")

crawler.crawl(keyword="bison",
              filters=filters,
              )
print(f"Downloaded: {sum(1 for path in BISON_PATH.iterdir())}")
Downloaded: 97

I ran this twice. The first time I got 70 files, then for this run I disconnected the VPN, which seemed to increase the file count a little, but not a lot.

Chimpanzee

get_images("chimpanzee")

Cow

def get_images(keyword):
    PATH = OUTPUT/keyword.capitalize()
    crawler = GoogleImageCrawler(storage=dict(root_dir=PATH))
    crawler.crawl(keyword=keyword,
                  filters=dict(type="photo"))
    return
get_images("cow")

Despite putting type=photo in there, the cow output had some drawings in the images, I guess it isn't perfect.

get_images("cow horned")

Dragon

get_images("dragon temple statue")

Elephant

get_images("elephant")

Frog

get_images("frog")

Goat (Horned)

get_images("goat horned")

Hippo

get_images("hippopotamus")

Iguana

get_images("iguana")

Jackrabbit

get_images("jackrabbit")
get_images("antelope american")

Krampus

get_images("krampus mask")

Lobster

get_images("lobster animal")

Minotaur

get_images("minotaur statue")

Namahage

get_images("namahage costume")

Octopus

get_images("octopus ocean")

Pig

get_images("pig european adult")

Quetzacoatl

get_images("quetzalcoatl statue")

Rhinoceros

get_images("rhinoceros")

Samurai

get_images("samurai armor")

Tortoise

get_images("tortoise")

Unicorn (horse)

get_images("horse")

Vampire Fish

get_images("vamprie fish -lamprey")

Wasp

get_images("wasp")

Chicken (Rooster)

get_images("rooster")
get_images("x-ray specs")

Yak

get_images("yak")
get_images("uncle sam hat")

Zebu

get_images("zebu")