DuckDuckGo Image Search

DuckDuckGo Image Search

This is a post for some notes on the ddg_images function from the duckduckgo-search library (link to GitHub) which downloads images using duckduckgo (and thus bing in a way).

The Parameters

Let's start by looking at the arguments that the function takes.

from duckduckgo_search import ddg_images

print(ddg_images.__doc__)
DuckDuckGo images search. Query params: https://duckduckgo.com/params

    Args:
        keywords (str): keywords for query.
        region (str, optional): wt-wt, us-en, uk-en, ru-ru, etc. Defaults to "wt-wt".
        safesearch (str, optional): On, Moderate, Off. Defaults to "Moderate".
        time (Optional[str], optional): Day, Week, Month, Year. Defaults to None.
        size (Optional[str], optional): Small, Medium, Large, Wallpaper. Defaults to None.
        color (Optional[str], optional): color, Monochrome, Red, Orange, Yellow, Green, Blue,
            Purple, Pink, Brown, Black, Gray, Teal, White. Defaults to None.
        type_image (Optional[str], optional): photo, clipart, gif, transparent, line.
            Defaults to None.
        layout (Optional[str], optional): Square, Tall, Wide. Defaults to None.
        license_image (Optional[str], optional): any (All Creative Commons), Public (PublicDomain),
            Share (Free to Share and Use), ShareCommercially (Free to Share and Use Commercially),
            Modify (Free to Modify, Share, and Use), ModifyCommercially (Free to Modify, Share, and
            Use Commercially). Defaults to None.
        max_results (int, optional): maximum number of results, max=1000. Defaults to 100.
        output (Optional[str], optional): csv, json, print. Defaults to None.
        download (bool, optional): if True, download and save images to 'keywords' folder.
            Defaults to False.

    Returns:
        Optional[List[dict]]: DuckDuckGo text search results.

Hopefully the arguments are pretty straight forward. I couldn't find an official help page for image searches but they seem pretty straight-forward.

The Output

Let's do a search for images of lop-eared rabbits and then take a look at what ddg_images returns.

output = ddg_images("rabbit lop", type_image="photo",
                    license_image="public", 
                    max_results=1)
pprint(output[0])
{'height': 848,
 'image': 'https://cdn.pixabay.com/photo/2015/12/22/20/27/animals-1104748_1280.jpg',
 'source': 'Bing',
 'thumbnail': 'https://tse1.mm.bing.net/th?id=OIP.lEqFD_LPmRGbc1GyIUoNygHaE6&pid=Api',
 'title': 'Flemish Lop Rabbit Very · Free photo on Pixabay',
 'url': 'https://pixabay.com/en/flemish-lop-rabbit-rabbit-1104748/',
 'width': 1280}

I thought that the output argument would change the format of the returned values but it instead seems to direct the values to a file (counting stdout as a file) and then return the same function output as before. Here's what "print" does'.

output = ddg_images("rabbit lop", type_image="photo",
                    license_image="public",
                    output="print",
                    max_results=1)

print(output[0])
1. {
    "title": "Flemish Lop Rabbit Very · Free photo on Pixabay",
    "image": "https://cdn.pixabay.com/photo/2015/12/22/20/27/animals-1104748_1280.jpg",
    "thumbnail": "https://tse1.mm.bing.net/th?id=OIP.lEqFD_LPmRGbc1GyIUoNygHaE6&pid=Api",
    "url": "https://pixabay.com/en/flemish-lop-rabbit-rabbit-1104748/",
    "height": 848,
    "width": 1280,
    "source": "Bing"
}
{'title': 'Flemish Lop Rabbit Very · Free photo on Pixabay', 'image': 'https://cdn.pixabay.com/photo/2015/12/22/20/27/animals-1104748_1280.jpg', 'thumbnail': 'https://tse1.mm.bing.net/th?id=OIP.lEqFD_LPmRGbc1GyIUoNygHaE6&pid=Api', 'url': 'https://pixabay.com/en/flemish-lop-rabbit-rabbit-1104748/', 'height': 848, 'width': 1280, 'source': 'Bing'}

Under The Hood

I was hoping that I'd be able to link to some official documentation to explain what's going on, but from what I can tell duckduckgo doesn't have an advertised API for its image search, and duckduckgo-search isn't too heavily documented, but if you look at the duckduckgo-search code it appears to be using a special token (vqd) that you can use to make queries to a special endpoint (https://www.duckduckgo.com/i.js) that you can use to get the search results from, which I couldn't find documented in the duckduckgo-search repository but is mentioned in this StackOverflow answer. I don't know how they figured it out, but using the vqd parameter and o=json makes it work pretty much like an API, although the code is also handing pagination so it seems almost like a hybrid web-scraping and API request.

The VQD

I'll show a request for "rabbit lop" using URL parameters. The actual code is using the requests library and sends them as a payload dictionary instead, but I thought it might be more familiar to see them as URL parameters (and it makes it so you can copy and paste the output into a browser to see what happens).

The first thing that ddg_images does is make a POST request to "https://duckduckgo.com/" using the search terms as data ({q="rabbit lop"}).

This returns some HTML, within which is some javascript that contains the "vqd" value that we need to pass in as an argument to make the proper search query.

<script type="text/JavaScript">
  function nrji() {
    nrj('/t.js?q=rabbit%20lop&l=us-en&s=0&dl=en&ct=US&ss_mkt=us&p_ent=&ex=-1&dfrsp=1')
    DDG.deep.initialize('/d.js?q=rabbit%20lop&l=us-en&s=0&dl=en&ct=US&ss_mkt=us&vqd=3-175223187338608511244788076450682226312-294342034741290420994864096532891255767&p_ent=&ex=-1&sp=1&dfrsp=1');;
  }
  DDG.ready(nrji, 1)
</script>

The "vqd" is one of the arguments to the DDG.deep.initialize function in the javascript/HTML that's returned from the request and the duckduckgo-search code extracts it using substring searching, giving us just the vqd.

vqd=3-175223187338608511244788076450682226312-294342034741290420994864096532891255767

Then a second request is made using the VQD token as one of the parameters along with whatever other parameters you want - in this case we're setting the region (l=wt-wt meaning "no region"), asking that the response be JSON (o=json), using the keywords "rabbit lop" (q="rabbit lop"), and turning safe-search off (p=-1). To make it work you also need to send the request to a special endpoint https://duckduckgo.com/i.js. So we end up with a request that looks like this:

https://duckduckgo.com/i.js?q=rabbit+lop&o=json&l=wt-wt&s=0&f=%2C%2C%2Ctype%3Aphoto%2C%2Clicense%3Apublic&p=-1&vqd=3-175223187338608511244788076450682226312-294342034741290420994864096532891255767

The payload to the response is a JSON blob that gets converted by the python code into a dictionary (if you paste the example url into a browser address bar you should be able to see the response).

Python Translation

The previous section was my attempt to explain to myself more or less how the code works, but I thought it might be easier to understand if we steal some of the code from duckduckgo-search and modify it to get a single output more or less the way ddg_images does it..

First we set up our requests session.

import requests

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; rv:102.0) Gecko/20100101 Firefox/102.0",
    "Referer": "https://duckduckgo.com/",
}

SESSION = requests.Session()
SESSION.headers.update(HEADERS)

Now we get the vqd by doing a POST and then searching the returned HTML for the string.

payload = dict(q="rabbit lop")

response = SESSION.post("https://duckduckgo.com", data=payload, timeout=10)

PREFIX = b"vqd='"

vqd_index_start = response.content.index(PREFIX) + len(PREFIX)
vqd_index_end = response.content.index(b"'", start=vqd_index_start)

vqd_bytes = response.content[vqd_index_start:vqd_index_end]

# convert it from bytes back to a string before using it as a payload value
vqd = vqd_bytes.decode()
print(vqd)
3-175223187338608511244788076450682226312-294342034741290420994864096532891255767

Now that we have the VQD (whatever it is) we can make our actual search request by building up the payload dictionary and sending the request to duckduckgo.

payload["o"] = "json"
payload["l"] = "wt-wt"
payload["s"] = 0
payload["f"] = ",,,type:photo,,license:public"
payload["p"] = -1
payload["vqd"] = vqd

response = SESSION.get("https://duckduckgo.com/i.js", params=payload)

The response contains multiple search results but let's unpack the first one as a demonstration.

page_data = response.json()["results"]

row = page_data[0]

output = {
    "title": row["title"],
    "image": row["image"],
    "thumbnail": row["thumbnail"],
    "url": row["url"],
    "height": row["height"],
    "width": row["width"],
    "source": row["source"],
}

SESSION.close()
pprint(output)
{'height': 848,
 'image': 'https://cdn.pixabay.com/photo/2015/12/22/20/27/animals-1104748_1280.jpg',
 'source': 'Bing',
 'thumbnail': 'https://tse1.mm.bing.net/th?id=OIP.lEqFD_LPmRGbc1GyIUoNygHaE6&pid=Api',
 'title': 'Flemish Lop Rabbit Very · Free photo on Pixabay',
 'url': 'https://pixabay.com/en/flemish-lop-rabbit-rabbit-1104748/',
 'width': 1280}

The ddg_images function is doing more than this, but for future reference, here's the basics of what's going on and how the author made the search work.