← Back to blog

Building an Upwork Job Scraper Bot

·
javascriptautomationpuppeteerweb-scraping

Automation is not about replacing human intelligence—it’s about enhancing it. Let the machines do the heavy lifting, and focus on what you do best.

Check out the code: github.com/RidaEn-nasry/upwork-bot

Why Javascript?

None of your business!!

Doing it the simple way

I like simplicity! Who doesn’t, right? So one way to tackle the problem is:

  1. Using a built-in https module to make request directly to pre-defined url
  2. As data comes in chunks, we need to use a stream to collect the data, once collected save in an html file
  3. Parse the thing
  4. Send a local notification or api request or a goddamn rpc or however the heck you want to alert with the new jobs

It would look something like this:

const fs = require('fs');
const http = require('https');
const url = 'https://www.upwork.com/ab/jobs/search/?q=javascript&sort=recency'
http.get(url, (res) => {
    let data = '';
    res.on('data', (chunk) => {
        data += chunk;
    });
    res.on('end', () => {
    // save it as html file
        fs.writeFile('upwork.html', data, (err) => {
            if (err) throw err;
            // if a match (your jobs keyword matches ) found send a notification
            if (parse(data)) {
                sendNotification();
            }
        });
    });
}).on("error", (err) => {
    // handle error
});

Hmmmm! Not too fast!!! In step 3 comes Cloudflare!!

Cloudflare Waiting Room

Upwork is a Cloudflare client and Cloudflare doesn’t like bots as you may have noticed! If you’re not familiar with Cloudflare, it’s one of the biggest CDN providers in the world—or at least that’s what they’re most known for—but they also provide a shit load of other services like load balancing, firewalls, etc.

The service that’s blocked us now is WAF (Web Application Firewall)—a firewall that protects web applications from bad stuff (DDoS attacks, cross-site scripting) and provides useful features like performance optimization, cache control, and more. Bot management is one of the features.

Bots, if not controlled, can cause a lot of damage to web properties by consuming resources and potentially causing a denial of service attack. So Cloudflare manages these matrix creatures using behavioral analysis and machine learning.

But wait! What about the good bots? Search engine crawlers (Google, Bing), performance monitoring tools and other stuff that’s necessary for the web to function properly! Well, Cloudflare differentiates between good bots and bad bots by whitelisting them and blacklisting the bad ones.

You and me, as you may have figured out, are bad bots!

How to bypass Cloudflare waiting room

We should look like real humans. We need our request to look like it’s coming from a real human in a real browser! Puppeteer to the rescue!

Puppeteer is a node library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium. The thing is you can control a headless browser and make it do all sorts of things that a real human might do, like clicking links, scrolling pages, and filling out forms. And if Cloudflare or any bot manager is giving you a hard time and thinks you’re a bot, you can use Puppeteer to add some randomness to your script, by injecting delays and mouse movements that make your browser look less robotic.

So we gonna use Puppeteer to lie to Cloudflare and tell it that we’re a real human and not a bot.

After installing Puppeteer let’s put it to work:

const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36');
    await page.goto('https://www.upwork.com/ab/jobs/search/?q=javascript&sort=recency');
    await page.waitForTimeout(1000);
    // wait for the page to load
    const html = await page.content();
    fs.writeFile('upwork.html', html, (err) => {
        if (err) throw err;
        console.log('The file has been saved!');
    });
    await browser.close();
})();

Our code actually didn’t change much:

  • line 3: we define an async function that will execute our logic
  • line 4: then we create a new instance of a headless browser using the puppeteer.launch() method
  • line 5: then we create a new page in the browser using the browser.newPage() method, which will be used to navigate to the Upwork website and scrape the job listings

Here’s the important part:

  • line 7: we set the user agent for the page using the page.setUserAgent() method, to make the browser appear to be a real web browser and not a bot

The user agent is a string of text that is sent by a web browser to identify itself and its capabilities. This information is used by the server to determine which content and features to serve to the browser. In this case, the user agent is set to a string that identifies the browser as a recent version of Google Chrome on a Linux operating system. This is done to make the headless browser appear to be a real web browser and not a bot, which may help to bypass any anti-bot measures that the website has in place.

  • line 8..9: The code navigates to the Upwork website using the page.goto() method, and waits for the page to load using the page.waitForTimeout() method to imitate real human behavior (not so human.. but still)
  • rest..: Once the page has loaded, we use the page.content() method to get the HTML content of the page, and then write the HTML to a file using the fs.writeFile() method. Finally, we close the browser using the browser.close() method

We actually successfully bypassed the Cloudflare waiting room! But besides the Cloudflare waiting room, Upwork seems to have some sort of other bot detection mechanism. These guys are really serious about bots. Hmmmm! I think we should be more serious about it too.

After some tinkering here and there I came up with the following script:

(async () => {
    // reading keywords from keywords.txt file
    let keywords = fs.readFileSync('keywords.txt', 'utf-8');
    keywords = keywords.split('\n');
    for (let i = 0; i < keywords.length; i++) {
        keywords[i] = keywords[i].trim();
    }
    // Launch the browser in non-headless mode
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    // Set a realistic user agent string
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36');

    // set the viewport to 1920x1080 to avoid the cookie banner
    await page.setViewport({
        width: 1920,
        height: 1080
    });
    await page.goto('https://www.upwork.com/ab/account-security/login');
    // Wait for the page to load
    await page.waitForTimeout(1000);

    // getting the email and password from the .env file
    const email = process.env.EMAIL;
    const password = process.env.PASSWORD;

    // enter the email and password
    await page.type('#login_username', email);
    // click the "continue with email" button
    await page.click('#login_password_continue');
    // some randomness to the mouse movement
    for (let i = 0; i < 10; i++) {
        await page.mouse.move(getRndm(0, 10000), getRndm(0, 1000));
        await page.waitForTimeout(1000);
    }
    // password
    await page.type('#login_password', password);
    await page.click('#login_control_continue');
    // move the mouse randomly to be more human
    for (let i = 0; i < 10; i++) {
        await page.mouse.move(getRndm(0, 20000), getRndm(0, 10000));
        await page.waitForTimeout(1000);
    }

    let allJobs = [];

    for (let i = 0; i < keywords.length; i++) {
        for (let j = 0; j < 5; j++) {
            // scrolling through 5 pages
            await page.goto('https://www.upwork.com/ab/jobs/search/?q=' + keywords[i] + '&page=' + j + '&sort=recency');
            await page.waitForTimeout(3000);
            await page.waitForSelector('div[data-test="main-tabs-index"]', { visible: true });
            const listings = await page.$('section[data-test="JobTile"]');

            let jobs = await Promise.all(listings.map(async (listing) => {
                let posted = await getTime(listing);
                if (tooOld(posted) === true) return;
                let title = await getTitle(listing);
                let link = await getLink(listing);
                let description = await getDescription(listing);
                let typeOfJob = await getTypeOfJob(listing);
                if (tooCheap(typeOfJob) === true) return;
                let paymentverified = await isVerified(listing);
                return { posted, title, link, description, typeOfJob, paymentverified };
            }));

            jobs = jobs.filter((job) => job !== undefined);
            allJobs.push(...jobs);
        }
    }

    const randomDelay = Math.random() * 2000;
    await page.waitForTimeout(randomDelay);
    await browser.close();
    fs.writeFileSync('jobs.json', JSON.stringify(allJobs, null, 2));
})();

It starts off by reading in a list of keywords from a file called keywords.txt. These are our magic words (separated by line feed ‘\n’) that the code uses to search for jobs.

Next, the code fires up its trusty web browser and creates a new page. It sets the user agent string to pretend to be a real web browser because it’s too cool to be a robot. It also sets the viewport to 1920x1080 to avoid the annoying cookie banner.

The code then heads over to the Upwork login page and waits patiently for it to load. It snags your email and password from the .env file and enters them into the appropriate fields on the login page. (The reason I chose to login before scraping is that I noticed that the quality of jobs while being authenticated versus non-authenticated is noticeably different.)

To make things more interesting, the code moves the mouse around randomly to mimic a human user. It’s like a little dance to pass the time while the page loads. Too smart haaaaa… not really! Ok.

Once the page is loaded, the code navigates to the Upwork job search page and starts searching for jobs using each keyword in the keywords array. It’s like a treasure hunt for your dream projects! For each keyword, it loops through up to five pages of search results.

For each job listed on a search results page, the code extracts the job title, link, description, and type (e.g. hourly or fixed-price) and budget. It stores all of this juicy information in an array called allJobs.

The tooOld() and tooCheap() functions filter out jobs that are too old or too cheap. Too old means that the job was posted more than 20 minutes ago. And too cheap means that the job is less than $500 if fixed-price or less than $15/hr if hourly. You could edit those functions to meet your preferences.

After running the script, we’ll get a jobs.json file that looks something like this:

[
  {
    "posted": "5 minutes ago",
    "title": "Senior Software and App Engineer",
    "link": "https://www.upwork.com/jobs/...",
    "description": "We need an absolute ninja to go through and clean up our entire platform...",
    "typeOfJob": {
      "type": "Hourly: ",
      "budget": "$35.00-$46.00"
    },
    "paymentverified": true
  },
  {
    "posted": "7 minutes ago",
    "title": "Build a web and mobile application",
    "link": "https://www.upwork.com/jobs/...",
    "description": "You can read the specification document attached...",
    "typeOfJob": {
      "type": "Fixed-price",
      "budget": "$1000"
    },
    "paymentverified": false
  }
]

A simple array of objects. Each object represents a job. Now that we have our list, we can do with it whatever we like—use discord.js to send it to a Discord channel, or use nodemailer to send it to your email, or send it to your grandma telepathically. The choice is yours.

I chose to display it as notifications on my lovely Mac.

Choosing your endpoint

I used jq to parse the JSON and extract all the juicy details about the jobs, like the title, posting date, type, budget, and link.

Then I looped through each job and plucked out the individual details. I used alerter which is just a wrapper for osascript. For each job I call alerter with the job details and it displays a notification with the job title, posting date, type, budget, and link. And if I’m feeling adventurous I can click the “open” button to check out the job link in the browser.

Automating with Cronjobs

Now that we have our bot and notification scripts all set up, it’s time to automate the process so we can sit back and let the bot do all the work for us. To do this, we’ll use cronjobs—little robots that run scripts at scheduled times.

Open up a terminal and type in crontab -e. Since our script will scrape the Upwork website and fill in the jobs.json file, I’ll run it every 7 minutes to make sure I get the latest jobs:

*/7 * * * * node /path/to/bot.js

Next, let’s schedule our notification script to run after the bot.js script—3 minutes after:

*/10 * * * * /path/to/notifyjobs.sh

And with that, we’re done! We’ve successfully created a bot to extract job listings from Upwork and display them as macOS notifications. Now we can get the latest job updates from the comfort of our own desktop screens. No more endlessly scrolling through job listings—the bot does all the hard work for us!

Happy job hunting!