Web Scraper | Download Instagram Images using Tag(s)

Introduction

Web scraping has become an important part of data collection and analysis in today's data-driven world. Instagram is one of the most popular social media platforms, and there are many use cases where one may need to scrape Instagram data, particularly images. In this blog, we will introduce PYNSTA-SCRAPER, a Selenium-based web scraper that can fetch/download Instagram images within a few seconds.

Pynsta Scraper

It is a web scraper built using Selenium, a popular automation testing tool. The scraper is capable of fetching and downloading Instagram images based on tags. The user can search for images based on tags like memes, cats, dogs, morning, quotes, and more. The scraper can be used for model training in machine learning/deep learning domains, Instagram web scraping for recent images, tag-based scraping, Selenium automation, and in data science. To use the scraper, you need to install the following Python libraries:

Selenium
Requests
Shutil
Time
Gecko Webdriver

Source code - Here

Code Walkthrough

import time
import requests
import shutil

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.service import Service

In this section, we import the necessary Python libraries: time, requests, shutil, and the webdriver module from the Selenium library, as well as the By and Service modules from the Selenium library.

tags = input("Enter the tags: ")
tag_list = tags.split(" ")

all_urls = []
number = 0

serv = Service("PATH_TO_GECKO_WEBDRIVER")

driver = webdriver.Firefox(service = serv)

In this section, we prompt the user to enter the tags they want to search for on Instagram. We then split the tags entered by the user into a list called tag_list. We then create an empty list called all_urls to store the URLs of all the images we find. We set the number variable to 0, which we will use to ensure that each image we download has a unique file name.

We then create a Service object that points to the path of the Gecko web driver. This is necessary because we will be using Firefox as our browser. We then create a webdriver object using the Firefox browser and the Service object we created earlier.

for tag in tag_list:

    print("Searching for tag #" + tag)

    driver.get("https://www.instagram.com/explore/tags/" + tag)
    time.sleep(10)

    images = driver.find_elements(By.CLASS_NAME, "_aagt")

    print("Total Results found for #" + tag + " : " + str(len(images)))

    for image in images:
        url = image.get_attribute('src')
        all_urls.append(str(url))

In this section, we iterate through each tag in the tag_list. For each tag, we print a message indicating that we are searching for that tag. We then navigate to the Instagram tags page for that tag and wait for 10 seconds to allow the page to load fully.

We then use the find_elements method of the web driver object to find all the images on the page. We use the By.CLASS_NAME method to specify that we are looking for elements with the class name "_aagt". This class name is specific to Instagram's HTML structure for images. We then print a message indicating how many images we found for that tag, and we use a for loop to iterate through each image and extract its URL using the get_attribute method of the image object.

    try:
        for i in range(len(all_urls)):
            file_name = "Image-" + str(i+number) + ".jpg"
            res = requests.get(all_urls[i+number], stream = True)

            if res.status_code == 200:
                with open(file_name, "wb") as f:
                    shutil.copyfileobj(res.raw, f)
            else:
                print("Download failed for " + file_name + ".")
    except:
        print("Oops! Something went wrong.")

    number = number + len(images)

In this section, we use a try block to handle any errors that may occur during the image download process. We use a for loop to iterate through each URL in the all_urls list. We create a unique file name for each image by appending the value of i+number to the string "Image-" and the file extension ".jpg". We then use the requests library to send a GET request to the image URL and set the stream parameter of the request to True. This allows us to stream the image data instead of loading the entire image into memory at once.

We then check the status code of the response object to make sure that the request was successful. If the status code is 200 (which indicates a successful request), we open a new file with the unique file name we created earlier, and we use the shuttle library to copy the image data from the response object to the file. If the status code is not 200, we print a message indicating that the download failed for that image.

If an exception is raised during the image download process (for example, if the URL is invalid or the image is no longer available), we print a message indicating that something went wrong. Finally, we update the value of the number by adding the length of the image list to it. This ensures that each image we download has a unique file name.

driver.quit()

In this section, we close the web driver object to free up system resources.

Conclusion

Overall, this Python code is a script that uses the Selenium library to automate web browsing on Instagram, specifically to search for images based on tags entered by the user and download them. It uses the requests and shutil libraries to download and save images respectively. The code can be useful for machine learning and data science projects, where large amounts of image data may be needed for training models. However, it is important to note that web scraping can be against the terms of service of some websites, so it is always best to check the website's policies before using any web scraping tools.