I am doing my graduation internship at CBYTE. They had an interesting assignment: building a scraping platform. CBYTE needed TikTok data from specific profiles for a client. In the past, they tried building their own scraper using existing open-source code, but it kept breaking. CBYTE then switched to using Apify, but due to the high costs, they still wanted to develop their own scraping platform—one that would host multiple scrapers, remain stable, and use shared bypassing methods. Large companies often attempt to block bot traffic (and thus scrapers), after all. So it was up to me and a fellow student to build a solution for this.
We researched existing anti-scraping measures and techniques to bypass them. We then translated the results of that research into a functional scraping solution, where new sources could be added quickly and easily. We implemented techniques to automatically bypass anti-scraping mechanisms and developed helper methods to simplify the process of adding new sources.
The final scraping platform consists of three components:
- ScrapeUI (Nuxt): A web application, similar to Apify, where users can easily set up a ScrapeJob. For example, you can specify which TikTok usernames need to be scraped and later view the results of a completed scrape. The UI makes managing scrape tasks accessible—even for non-technical users.
- ScrapeAPI (.NET): A REST API that handles communication between ScrapeUI and ScrapeCore. This API manages the creation, launching, and tracking of scrape tasks. It also serves as a secure gateway to the collected data. By decoupling the API from the UI and ScrapeCore, the platform is built to be flexible and scalable.
- ScrapeCore (Python): The execution engine of the platform where the actual scraping takes place. ScrapeCore includes a framework for easily adding new scrapers. It also features built-in techniques to automatically detect and bypass anti-bot measures.
I find it a challenging and educational project. We had to research complex technologies and come up with a smart strategy to bypass anti-scraping measures—a constant game of evasion and detection with the ‘anti-scraping police’.
What I learned:
- How companies combat automated traffic (anti-scraping measures)
- Techniques to detect and bypass such blockades
- Effective communication with a client and translating their needs into technical solutions