Pulling Images For the Abecedarium
Table of Contents
Imports and Setup
Imports
# python
from pathlib import Path
import os
# pypi
from dotenv import load_dotenv
from google_images_download.google_images_download import googleimagesdownload as GoogleImages
Set Up
ENV_PATH = Path("~/projects/ape-iron/.env").expanduser()
assert ENV_PATH.is_file()
load_dotenv(ENV_PATH, override=True)
OUTPUT = Path(os.environ["ABECEDARIUM"]).expanduser()
google_images = GoogleImages()
Ape
A First (Failed Attempt)
APE_PATH = OUTPUT/"Ape/"
keywords = dict(
keywords="gorilla",
limit=20,
type="photo",
output_directory=str(APE_PATH),
chromedriver="/usr/bin/chromedriver",
)
paths = google_images.download(keywords)
Item no.: 1 --> Item name = pig Evaluating... Starting Download... Unfortunately all 20 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter! Errors: 0
Oops. It looks like the google-images-download project has been abandoned and no longer works. There are multiple bug reports for it. One mentions a fork that looks like it has more activity, but this bug report mentions another project called icrawlerwhich seems like a more general web-crawler that also says it has a google images downloader so maybe I'll try that instead.
Once Again With icrawler
from icrawler.builtin import GoogleImageCrawler
storage = dict(root_dir=APE_PATH)
crawler = GoogleImageCrawler(storage=storage)
Besides the keyword (keywords?) you can pass some other arguments. One notable option is a `filters` dictionary that helps specify the type of image.
Key | Options |
---|---|
type | “photo”, “face”, “clipart”, “linedrawing”, “animated” |
color | “color”, “blackandwhite”, “transparent”, “red”, “orange”, “yellow”, “green” |
“teal”, “blue”, “purple”, “pink”, “white”, “gray”, “black”, “brown” | |
size | “large”, “medium”, “icon” |
larger than a given size (e.g. “>640x480”) | |
exactly is a given size (“=1024x768”) | |
license | “noncommercial”(labeled for noncommercial reuse) |
“commercial”(labeled for reuse) | |
“noncommercial,modify”(labeled for noncommercial reuse with modification) | |
“commercial,modify”(labeled for reuse with modification) | |
date | “pastday”, “pastweek” |
tuple of dates, e.g. ((2016, 1, 1), (2017, 1, 1)) or ((2016, 1, 1), None) |
The icrawler documentation suggests using the date-ranges to get past the 1,000 (or however many) image limit.
The arguments to the `crawl` method are:
Argument | Default | Description |
---|---|---|
keyword | The search term for the images | |
filters | None | A dictionary of image filter options (see above) |
offset | 0 | Where to start in the search (e.g. skip first 10) |
max_num | 1000 | Maximum number of images to pull (the service will set some limit, usually 1,000) |
min_size | None | Minimum pixel size as a tuple (x-pixels, y-pixels) |
max_size | None | Maximum image size as a tuple (x-pixels, y-pixels) |
language | None | If you're not using English, I assume |
file_idx_offset | 0 | icrawler names the downloaded files as numbers, this will be the next number to use |
The code says that you can't set max_num
to anything greater than 1,000. I think the offset
is the amount to skip within that 1,000. Maybe. The icrawler ignores the names of the files given in the URLs and names them with numbers (e.g. 000001.jpg
). If you make another crawl and don't want to clobber an earlier set of files you probably need to change the file_idx_offset
(I think. I haven't tried it yet).
Let's see what happens if we pull gorilla photos.
crawler.crawl(keyword="gorilla",
filters=dict(
type="photo"
))
2023-05-29 16:20:21,749 - INFO - icrawler.crawler - start crawling... 2023-05-29 16:20:21,752 - INFO - icrawler.crawler - starting 1 feeder threads... 2023-05-29 16:20:21,756 - INFO - icrawler.crawler - starting 1 parser threads... 2023-05-29 16:20:21,759 - INFO - icrawler.crawler - starting 1 downloader threads... 2023-05-29 16:20:22,776 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=0&start=0&tbs=itp%3Aphoto&tbm=isch 2023-05-29 16:20:23,223 - ERROR - downloader - Response status code 404, file https://upload.wikimedia.org/wikipedia/commons/thumb/b/bb/Gorille_des_plaines_de_l%27ouest_%C3%A0_l%27Espace_Zoologique.jpg 2023-05-29 16:20:23,871 - INFO - downloader - image #1 https://files.worldwildlife.org/wwfcmsprod/images/Mountain_Gorilla_Silverback_WW22557/story_full_width/36fcoamev0_Mountain_Gorilla_Silverback_WW22557.jpg 2023-05-29 16:20:24,774 - INFO - downloader - image #2 https://i.natgeofe.com/n/2d706180-e778-4110-9c15-1a7435b72114/mountain-gorillas-rwanda-02_3x4.jpg 2023-05-29 16:20:25,115 - INFO - downloader - image #3 https://www.nczoo.org/sites/default/files/2020-05/Gorilla-5.jpg 2023-05-29 16:20:25,465 - INFO - downloader - image #4 https://cdn.britannica.com/79/20279-050-ECDF21A7/mountain-gorilla-Virunga-National-Park-Democratic-Republic.jpg 2023-05-29 16:20:25,651 - ERROR - downloader - Response status code 401, file https://optimise2.assets-servd.host/maniacal-finch/production/animals/WL_Gorilla.jpg 2023-05-29 16:20:25,859 - INFO - downloader - image #5 https://media.npr.org/assets/img/2021/10/08/ap21280523738198-1--cc4c958352f2bd1a4a45c203f1f7807fc0193457-s1100-c50.jpg 2023-05-29 16:20:26,463 - INFO - downloader - image #6 https://media-cldnry.s-nbcnews.com/image/upload/rockcms/2022-01/220125-atlanta-zoo-ozzie-male-gorilla-obit-ac-849p-0cdfbe.jpg 2023-05-29 16:20:26,510 - INFO - downloader - image #7 https://files.worldwildlife.org/wwfcmsprod/images/HERO_Mountain_Gorilla_Silverback_WW22557/hero_small/17l7fosr27_Mountain_Gorilla_Silverback_WW22557.jpg 2023-05-29 16:20:27,116 - INFO - downloader - image #8 https://images.immediate.co.uk/production/volatile/sites/23/2014/07/GettyImages-157862378-7432ede.jpg 2023-05-29 16:20:27,314 - INFO - downloader - image #9 https://t4.ftcdn.net/jpg/05/65/55/03/360_F_565550348_QbLFP5eFniY4cGy230zuhGcz0pcG56YC.jpg 2023-05-29 16:20:30,876 - INFO - downloader - image #10 https://www.wwf.org.uk/sites/default/files/styles/hero_m/public/2019-08/mountain_gorilla_Rwanda.jpg 2023-05-29 16:20:31,178 - INFO - downloader - image #11 https://detroitzoo.org/wp-content/uploads/2015/08/Gorilla-Pende.jpg 2023-05-29 16:20:31,792 - INFO - downloader - image #12 https://cbsaustin.com/resources/media/eab965d5-d9d1-4739-90ad-7b029219bd7c-large16x9_ScreenShot20230407at12.47.20PM.png 2023-05-29 16:20:32,583 - INFO - downloader - image #13 https://newschannel20.com/resources/media/eab965d5-d9d1-4739-90ad-7b029219bd7c-large3x4_ScreenShot20230407at12.47.20PM.png 2023-05-29 16:20:33,167 - ERROR - downloader - Response status code 400, file https://th-thumbnailer.cdn-si-edu.com/J_03GIL6RkkkoUYxjX1bZ2uhArg\u003d/1072x720/filters:no_upscale()/https://tf-cmsv2-smithsonianmag-media.s3.amazonaws.com/filer/45/e8/45e81e74-8044-4a89-a679-6d0eaa70d6fc/caters_gorilla_punch_03.jpg 2023-05-29 16:20:33,883 - INFO - downloader - image #14 https://upload.wikimedia.org/wikipedia/commons/5/50/Male_gorilla_in_SF_zoo.jpg 2023-05-29 16:20:34,899 - INFO - downloader - image #15 https://i.natgeofe.com/n/e180e488-e4e1-472d-9d4f-b2138978903c/01-gorilla-stare_4x3.jpg 2023-05-29 16:20:35,491 - INFO - downloader - image #16 https://cdn.britannica.com/79/126579-050-17FD9CF2/lowland-gorilla.jpg 2023-05-29 16:20:36,169 - INFO - downloader - image #17 https://foxchattanooga.com/resources/media/44a66631-49af-4355-ad52-845f43e9411d-pittsburghzoobabygorilla2.png 2023-05-29 16:20:36,550 - INFO - downloader - image #18 https://gorillafund.org/wp-content/uploads/2022/09/Ubwitange.jpg 2023-05-29 16:20:36,964 - INFO - downloader - image #19 https://www.rainforest-alliance.org/wp-content/uploads/2021/06/baby-mountain-gorilla-1.jpg 2023-05-29 16:20:40,643 - INFO - downloader - image #20 https://ichef.bbci.co.uk/news/976/cpsprodpb/2533/production/_119132590_2.60637068.jpg 2023-05-29 16:20:40,986 - INFO - downloader - image #21 https://images.theconversation.com/files/246210/original/file-20181119-76137-1c4570v.jpg 2023-05-29 16:20:41,194 - INFO - downloader - image #22 https://m.media-amazon.com/images/I/81uzcxVyiaL._AC_UF894,1000_QL80_.jpg 2023-05-29 16:20:41,621 - INFO - downloader - image #23 https://zooatlanta.org/wp-content/uploads/gorilla_baby2023_230425_shalia_baby_ZA_2T0A6711.jpg 2023-05-29 16:20:41,878 - ERROR - downloader - Response status code 403, file https://gray-whsv-prod.cdn.arcpublishing.com/resizer/5SrqNB2HFQ5OH7-D5Da0plW2zH4\u003d/1200x675/smart/filters:quality(85)/cloudfront-us-east-1.images.arcpublishing.com/gray/TB36MKZ2ENACBLSWNP6INGD3HA.jpg 2023-05-29 16:20:42,635 - INFO - downloader - image #24 https://ichef.bbci.co.uk/news/976/cpsprodpb/C173/production/_119132594_2.60637064.jpg 2023-05-29 16:20:43,442 - INFO - downloader - image #25 https://www.cmzoo.org/wp-content/uploads/Roxy-1024x680.jpg 2023-05-29 16:20:43,747 - INFO - downloader - image #26 https://media.npr.org/assets/img/2022/01/26/western-lowland-gorilla-ozzie_zoo-atlanta-1b335bebb3e98145cd3216e20b668c617b370a58.jpg 2023-05-29 16:20:44,376 - INFO - downloader - image #27 https://www.columbuszoo.org/sites/default/files/styles/square_large/public/assets/animals/Gorilla%20%28Ktembe%29%2009956%20-%20Grahm%20S.%20Jones%2C%20Columbus%20Zoo%20and%20Aquarium.jpg 2023-05-29 16:20:47,970 - ERROR - downloader - Response status code 500, file https://npr.brightspotcdn.com/dims4/default/4ef2621/2147483647/strip/true/crop/1995x1047+0+117/resize/1200x630!/quality/90/?url\u003dhttp%3A%2F%2Fnpr-brightspot.s3.amazonaws.com%2F5e%2F11%2Fe4ca0a63450c8922417ecba1048e%2F20230308-pz2-8574-cpaulselvaggio-web.jpg 2023-05-29 16:20:48,432 - INFO - downloader - image #28 https://i.natgeofe.com/n/8fa82b8d-0110-48d4-9e01-2926d359c784/mountain-gorillas-rwanda-05_square.jpg 2023-05-29 16:20:48,823 - INFO - downloader - image #29 https://assets3.thrillist.com/v1/image/2782161/1020x765/scale;webp\u003dauto;jpeg_quality\u003d60.jpg 2023-05-29 16:20:49,148 - INFO - downloader - image #30 https://www.awf.org/sites/default/files/Website_SpeciesPage_MountainGorilla01_Hero.jpg 2023-05-29 16:20:49,639 - INFO - downloader - image #31 https://www.jsonline.com/gcdn/presto/2022/07/28/PMJS/9508fe0d-e41f-4c90-922a-f738e6254964-GorillaStare2_4k.jpg 2023-05-29 16:20:50,000 - ERROR - downloader - Response status code 403, file https://www.bostonglobe.com/resizer/UdvN4BJ3rNxChcQaeBBvfctPC6k\u003d/arc-anglerfish-arc2-prod-bostonglobe/public/XLLV4FEUWWSBUJGFQCPXXOEOSY.jpg 2023-05-29 16:20:50,418 - INFO - downloader - image #32 https://louisvillezoo.org/wp-content/uploads/2019/06/Kindi.jpg 2023-05-29 16:20:50,991 - INFO - downloader - image #33 https://sdzsafaripark.org/sites/default/files/styles/hero_with_nav_gradient/public/hero/hero-gorilla.jpg 2023-05-29 16:20:51,377 - INFO - downloader - image #34 https://media.11alive.com/assets/WXIA/images/dba319a2-1f5b-475c-abc2-2070f5695624/dba319a2-1f5b-475c-abc2-2070f5695624_1920x1080.jpg 2023-05-29 16:20:51,790 - INFO - downloader - image #35 https://cbsaustin.com/resources/media/8c872d35-142a-497b-8e40-c7318cf731ce-medium16x9_ScreenShot20230407at1.29.59PM.png 2023-05-29 16:20:54,454 - INFO - downloader - image #36 https://1.bp.blogspot.com/-BSQMRiOCqII/YElP3Sr0qUI/AAAAAAAATEo/E2SvBKQFnmEIjywIvw7oZQ4g5jgLChvkwCLcBGAsYHQ/s2048/Zuna%2B4x6%2Bhigh%2Bres.jpg 2023-05-29 16:20:54,771 - INFO - downloader - image #37 https://d21yqjvcoayho7.cloudfront.net/wp-content/uploads/2022/08/31/Mashika1.jpg 2023-05-29 16:20:55,344 - INFO - downloader - image #38 https://media.cnn.com/api/v1/images/stellar/prod/190422153943-gorillas-selfie-virunga-national-park.jpg 2023-05-29 16:20:55,950 - INFO - downloader - image #39 https://whc.vetmed.ucdavis.edu/sites/g/files/dgvnsk5261/files/styles/sf_landscape_16x9/public/images/article/1-infant%20mtn%20gorilla-Katwe%20Group-OCT%2019%20Bwindi-copyright%20Gorilla%20Doctors-compressed.jpg 2023-05-29 16:20:56,185 - INFO - downloader - image #40 https://media.npr.org/assets/img/2018/01/11/pasikanpr-188adc923bb86ae3a69237493c0aa76f70e75a4a-s1100-c50.jpg 2023-05-29 16:20:56,665 - INFO - downloader - image #41 https://images.newscientist.com/wp-content/uploads/2016/07/13163431/lead_whittier20150521001-5.jpg 2023-05-29 16:20:57,702 - INFO - downloader - image #42 https://good-nature-blog-uploads.s3.amazonaws.com/uploads/2018/01/silverback-gorilla-africa-Benjamin_Thomas.jpg 2023-05-29 16:20:58,455 - INFO - downloader - image #43 https://comozooconservatory.org/wp-content/uploads/2023/04/Gorilla-credit-Steve-Solmonson.jpg 2023-05-29 16:20:59,291 - INFO - downloader - image #44 https://nationalzoo.si.edu/sites/default/files/styles/768_scale/public/newsroom/20230527-valschultz-012-gorilla-infant.jpg 2023-05-29 16:20:59,632 - ERROR - downloader - Response status code 400, file https://cdn.theatlantic.com/thumbor/zN3P8Eg5R2KCRWXgbG3B9VUqxB0\u003d/243x0:3243x2250/1200x900/media/img/mt/2016/11/RTR3NO4M/original.jpg 2023-05-29 16:20:59,940 - INFO - downloader - image #45 https://gorillafund.org/wp-content/uploads/2022/05/Silverback-gorilla-Mafunzo-1024x768.jpg 2023-05-29 16:21:00,189 - ERROR - downloader - Response status code 400, file https://people.com/thmb/zJuWJJxNflt35JwcK9ifuDmcHV4\u003d/1500x0/filters:no_upscale():max_bytes(150000):strip_icc():focal(749x0:751x2)/prince-charles-baby-mountain-gorilla-ubwuzuzanye-090222-231cd500ae6a48aab86d28b031c64d1a.jpg 2023-05-29 16:21:00,523 - ERROR - downloader - Response status code 500, file https://ca-times.brightspotcdn.com/dims4/default/dd8d9f5/2147483647/strip/true/crop/5500x2888+0+518/resize/1200x630!/quality/80/?url\u003dhttps%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F6c%2Fb9%2Fc874005242be8d9bab87c9b81f4a%2Fla-zoo-gorilla-94162.jpg 2023-05-29 16:21:00,884 - ERROR - downloader - Response status code 500, file https://ewscripps.brightspotcdn.com/dims4/default/4fe6984/2147483647/strip/true/crop/1230x646+0+84/resize/1200x630!/quality/90/?url\u003dhttp%3A%2F%2Fewscripps-brightspot.s3.amazonaws.com%2Fe2%2F03%2F781609d74315a69123afb278c6b4%2Fscreen-shot-2022-10-10-at-2.08.29%20PM.png 2023-05-29 16:21:01,469 - INFO - downloader - image #46 https://cloudfront-us-east-1.images.arcpublishing.com/advancelocal/ELEEDHM3YVEYZGD5KZNQZS2ZSY.jpg 2023-05-29 16:21:01,892 - INFO - downloader - image #47 https://www.first5la.org/wp-content/uploads/2020/08/baby-gorilla-tuena-3950983001_534.jpg 2023-05-29 16:21:05,250 - INFO - downloader - image #48 https://images.csmonitor.com/csm/2013/03/gorilla.jpg 2023-05-29 16:21:06,130 - INFO - downloader - image #49 https://cdn.hswstatic.com/gif/gorillas.jpg 2023-05-29 16:21:06,905 - INFO - downloader - image #50 https://files.worldwildlife.org/wwfcmsprod/images/mountain_gorilla_tom_deuitch/story_full_width/3tsywpr4hc___Tom_Deuitch.jpg 2023-05-29 16:21:07,576 - INFO - downloader - image #51 https://www.northeastohioparent.com/wp-content/uploads/2022/10/Kayembe-on-Freddy.jpg 2023-05-29 16:21:08,096 - INFO - downloader - image #52 https://media.apenheul.nl/aphl-cache/0/a/a/9/f/5/0aa9f5cdc1936be4e7937b33e6d625c2b46d4eff.jpg 2023-05-29 16:21:08,182 - ERROR - downloader - Response status code 400, file https://people.com/thmb/9KajO2h4En_kPrf_ZstX4qa0tyQ\u003d/1500x0/filters:no_upscale():max_bytes(150000):strip_icc():focal(688x459:690x461)/kiki-the-gorilla-2000-d0450f98d6894f47892c9e8e088aec61.jpg 2023-05-29 16:21:08,512 - INFO - downloader - image #53 https://9b16f79ca967fd0708d1-2713572fef44aa49ec323e813b06d2d9.ssl.cf2.rackcdn.com/1140x_a10-7_cTC/pittsburgh-zoo-gorilla-1681372480.jpg 2023-05-29 16:21:09,233 - INFO - downloader - image #54 https://cdnph.upi.com/ph/st/th/7381444250950/2015/upi/e113f94e3f13ce7e262fd708f10b66e8/v1.5/8-things-you-didnt-know-about-baby-gorillas.jpg 2023-05-29 16:21:09,824 - INFO - downloader - image #55 https://assets-fortworthbusiness-com.s3-accelerate.amazonaws.com/2022/11/baby-gorillajpg-2-scaled.jpg 2023-05-29 16:21:10,084 - ERROR - downloader - Response status code 500, file https://npr.brightspotcdn.com/dims4/default/efcc501/2147483647/strip/true/crop/1200x739+0+124/resize/880x542!/quality/90/?url\u003dhttp%3A%2F%2Fnpr-brightspot.s3.amazonaws.com%2Flegacy%2Fsites%2Fkera%2Ffiles%2F201807%2Fbaby.jpg 2023-05-29 16:21:11,683 - INFO - downloader - image #56 https://images.thdstatic.com/productImages/31840f43-31cb-42c1-94d2-104e9c4ac138/svn/design-toscano-garden-statues-ne110088-44_600.jpg 2023-05-29 16:21:12,344 - INFO - downloader - image #57 https://d.newsweek.com/en/full/2012821/mambie-gorilla.jpg 2023-05-29 16:21:12,793 - INFO - downloader - image #58 https://npr.brightspotcdn.com/legacy/sites/wjct/files/201902/gandai_by_lynde_nunn__1_.jpg 2023-05-29 16:21:14,785 - INFO - downloader - image #59 https://cloudfront-us-east-1.images.arcpublishing.com/bostonglobe/GZCMY47RIA323XUKPTLSPEPAPY.jpg 2023-05-29 16:21:15,175 - INFO - downloader - image #60 https://cbsaustin.com/resources/media/d86faa81-97a5-4cb5-9dbf-6c50944fdf15-medium16x9_ScreenShot20230407at1.58.11PM.png 2023-05-29 16:21:20,475 - ERROR - downloader - Exception caught when downloading file https://dl0.creation.com/articles/p150/c15079/Gorilla.jpg, error: HTTPSConnectionPool(host='dl0.creation.com', port=443): Read timed out. (read timeout=5), remaining retry times: 2 2023-05-29 16:21:20,660 - INFO - downloader - image #61 https://dl0.creation.com/articles/p150/c15079/Gorilla.jpg 2023-05-29 16:21:21,628 - INFO - downloader - image #62 https://i0.wp.com/sitn.hms.harvard.edu/wp-content/uploads/2021/04/Picture1.jpg 2023-05-29 16:21:22,070 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=1&start=100&tbs=itp%3Aphoto&tbm=isch 2023-05-29 16:21:22,321 - INFO - downloader - image #63 https://www.pbs.org/wnet/nature/files/2021/07/amy-reed-XB5E4D-Ipco-unsplash-scaled-e1627330380270.jpg 2023-05-29 16:21:22,687 - INFO - downloader - image #64 https://images.theconversation.com/files/267521/original/file-20190404-131437-psnnwu.jpg 2023-05-29 16:21:22,841 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=2&start=200&tbs=itp%3Aphoto&tbm=isch 2023-05-29 16:21:23,133 - INFO - downloader - image #65 https://assets.rebelmouse.io/eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbWFnZSI6Imh0dHBzOi8vYXNzZXRzLnJibC5tcy8yNjM3OTM0Ny9vcmlnaW4uanBnIiwiZXhwaXJlc19hdCI6MTY5MjY4ODUxMH0.-R0AvhdriipRZSfYTpfq-CFuNPsCRhl6gd0Z9VNrA88/img.jpg 2023-05-29 16:21:23,522 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=3&start=300&tbs=itp%3Aphoto&tbm=isch 2023-05-29 16:21:23,593 - INFO - feeder - thread feeder-001 exit 2023-05-29 16:21:23,930 - INFO - downloader - image #66 https://i0.wp.com/eastafricanjunglesafaris.com/wp-content/uploads/2019/08/Header-1.jpg 2023-05-29 16:21:24,222 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=4&start=400&tbs=itp%3Aphoto&tbm=isch 2023-05-29 16:21:24,998 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=5&start=500&tbs=itp%3Aphoto&tbm=isch 2023-05-29 16:21:25,822 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=6&start=600&tbs=itp%3Aphoto&tbm=isch 2023-05-29 16:21:26,593 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=7&start=700&tbs=itp%3Aphoto&tbm=isch 2023-05-29 16:21:27,387 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=8&start=800&tbs=itp%3Aphoto&tbm=isch 2023-05-29 16:21:28,307 - INFO - parser - parsing result page https://www.google.com/search?q=gorilla&ijn=9&start=900&tbs=itp%3Aphoto&tbm=isch 2023-05-29 16:21:28,932 - INFO - downloader - downloader-001 is waiting for new download tasks 2023-05-29 16:21:30,411 - INFO - parser - no more page urls for thread parser-001 to parse 2023-05-29 16:21:30,412 - INFO - parser - thread parser-001 exit 2023-05-29 16:21:33,934 - INFO - downloader - no more download task for thread downloader-001 2023-05-29 16:21:33,935 - INFO - downloader - thread downloader-001 exit 2023-05-29 16:21:34,776 - INFO - icrawler.crawler - Crawling task done!
print(f"Downloaded: {sum(1 for path in APE_PATH.iterdir())}")
Downloaded: 66
So, even though the default maximum number of images is 1,000, it actually only pulled down 66. There are some error codes in the logger's output, but that seems like a big discrepancy. Oh, well, okay for a first try.
Bison
Looking at the output of the Ape crawl, it looks like it only used one thread per object (i.e. one feeder
, one parser
, and one downloader
. Maybe I'll try and push that up and see what happens.
BISON_PATH = OUTPUT/"Bison"
storage["root_dir"] = BISON_PATH
crawler = GoogleImageCrawler(storage=storage,
feeder_threads=2,
parser_threads=2,
downloader_threads=4)
filters = dict(type="photo")
crawler.crawl(keyword="bison",
filters=filters,
)
print(f"Downloaded: {sum(1 for path in BISON_PATH.iterdir())}")
Downloaded: 97
I ran this twice. The first time I got 70 files, then for this run I disconnected the VPN, which seemed to increase the file count a little, but not a lot.
Chimpanzee
get_images("chimpanzee")
Cow
def get_images(keyword):
PATH = OUTPUT/keyword.capitalize()
crawler = GoogleImageCrawler(storage=dict(root_dir=PATH))
crawler.crawl(keyword=keyword,
filters=dict(type="photo"))
return
get_images("cow")
Despite putting type=photo
in there, the cow output had some drawings in the images, I guess it isn't perfect.
get_images("cow horned")
Dragon
get_images("dragon temple statue")
Elephant
get_images("elephant")
Frog
get_images("frog")
Goat (Horned)
get_images("goat horned")
Hippo
get_images("hippopotamus")
Iguana
get_images("iguana")
Jackrabbit
get_images("jackrabbit")
get_images("antelope american")
Krampus
get_images("krampus mask")
Lobster
get_images("lobster animal")
Minotaur
get_images("minotaur statue")
Namahage
get_images("namahage costume")
Octopus
get_images("octopus ocean")
Pig
get_images("pig european adult")
Quetzacoatl
get_images("quetzalcoatl statue")
Rhinoceros
get_images("rhinoceros")
Samurai
get_images("samurai armor")
Tortoise
get_images("tortoise")
Unicorn (horse)
get_images("horse")
Vampire Fish
get_images("vamprie fish -lamprey")
Wasp
get_images("wasp")
Chicken (Rooster)
get_images("rooster")
get_images("x-ray specs")
Yak
get_images("yak")
get_images("uncle sam hat")
Zebu
get_images("zebu")