Case Study: Web Scraping electronic devices database

Case study about web scraping comprehensive database specifically for one type of electronic device available in Poland.

Web scraping is an extraordinary comprehensive tool that can be easily adjusted and used in many different markets and niches. In this case study, we dive deep into building a comprehensive electronic devices database.

About the client

This information describes a real project, but we can’t name our partner at the time

A company specializing in the electronic devices market approached us. Our client was a relatively young company that was focused on developing a machine learning-based tool for the electronic devices’ industry.

The Challenge of building electronic devices database

The main objective of our client was to create a comprehensive database specifically for one type of electronic device available in Poland.

The initial phase of the project involved scraping all technical specification data about these devices. To achieve this, we identified several websites that served as aggregators for the available devices’ specifications. By scraping data from these sources, we were able to compile a table containing information about different models, basic technical details, and more.

The next step was about getting current prices of models from collected database across different retailers.

Issue #1: Model Identification

The initial challenge revolved around accurately identifying the correct model from the retailer’s e-commerce store. The model names are typically represented by the manufacturer’s name, such as Philips, followed by a string of characters and numbers. These alphanumeric strings might not hold any apparent meaning for individuals outside the manufacturer’s company. Moreover, there were instances where the model identification strings were inaccurately presented on the retailer’s website or where several very similar models existed, differing only by one character in the identification string.

To address this specific challenge, we developed a custom solution for model matching. Through research, we identified several technical parameters that, when combined with the provided model identification string, proved to be highly effective in matching most of the scraped models from retailers’ websites.

The model matching process was integrated just before inserting the data into the database, resulting in an impressive 90% accuracy rate. This achievement greatly improved the precision and reliability of our database, ensuring that the correct models were accurately identified and recorded.

Following the initial matching process, the second step was executed after scraping all data from various shops. In this phase, the model considered not only technical specifications and data from a single shop, but also information from other retailers. As a result of the second phase, we achieved an 96% accuracy rate.

The final step involved manual matching, where our team carefully verified and validated the remaining cases. Through this manual review, we achieved an almost perfect matching rate, successfully identifying the correct models in nearly 100% of the cases.

This comprehensive three-step approach ensured the highest level of accuracy and completeness in our model identification process, resulting in a robust and reliable database for our client in the electronic devices’ industry.

Issue #2: Extracting data from expert reviews

The third step of the project involved extracting specialized technical parameters that were not provided in the manufacturer’s specifications but were often measured and mentioned by professional reviewers. Additionally, we needed to extract the pros and cons mentioned in these reviews.

This presented a significant challenge as the technical data measured by reviewers were always embedded in plain text without any tables or special formatting. The absence of structured data made it particularly difficult to extract the required information from numerous review sites.

To address this challenge, we first scraped all the articles’ text from relevant categories on reviewers’ blogs. We stored the entire plain text along with metadata from each post.

For the scraped reviews, we implemented a matching algorithm, which was relatively easier due to the presence of referral links to retailers in almost every review. Fortunately, we had already scraped the links to retailers, and this facilitated the matching process. Once the reviews were matched to the proper models, we proceeded with extracting the necessary technical parameters, as well as the mentioned advantages and disadvantages.

To tackle this challenge, we harnessed the power of NLP (Natural Language Processing) algorithms and pattern matching regular expressions. The diverse nature of each reviewer’s blog necessitated the creation of custom parsing solutions for extracting the required information. By employing these advanced parsing technologies, we achieved a high success rate in matching most of the needed values.

For certain blogs that presented pros and cons in plain text without any structured formatting, an even more intricate solution was essential. We developed a highly sophisticated approach to extract the pros and cons efficiently from such reviews, ensuring that no valuable insights were overlooked.

Through the utilization of these complex NLP and parsing techniques, we successfully mined and analyzed expert reviews, extracting valuable technical data, as well as the pros and cons associated with each device. This comprehensive analysis provided our client with a wealth of qualitative information that enriched their database and strengthened their competitive edge in the electronic devices market.

Overcoming those problems allowed us to gather invaluable insights from expert reviews and supplement our database with essential data not available in standard manufacturer specifications. This enriched database provided our client with a comprehensive and well-rounded view of the electronic devices, enabling them to provide much better and unique data for model’s training.

Issue #3: Parsing Polish language

Parsing the Polish language presented a notably greater challenge compared to English or other widely spoken European languages like Spanish or Italian. Polish is not as widely used as English, which means there are fewer language resources and tools available for its processing.

Polish words

Furthermore, being a Slavic language, Polish possesses intricate and complex structures that can vary significantly based on context, tense, and other factors. This context-based nature makes it particularly challenging for automated parsing systems to accurately interpret and understand the meaning of sentences.

The shortage of NLP (Natural Language Processing) resources tailored specifically for Polish further compounded the difficulty of parsing. As a result, we had to develop and implement custom solutions that could handle the anomalies of Polish language grammar and syntax.

Despite these challenges, we successfully devised specialized parsing algorithms that could navigate the difficulties of Polish language and extract relevant information from various sources. Our commitment to overcoming linguistic barriers allowed us to deliver a robust and effective solution for our client, ensuring the accuracy and integrity of the data.

Building electronic devices database: What Was Achieved

Through our efforts and innovative solutions, we achieved the following key outcomes for our client:

  1. Comprehensive Database: We successfully created a comprehensive database of electronic devices for the specific type requested by the client. This database contained not only standard specification data but also additional valuable information.
  2. Price Tracking: We implemented a price tracking mechanism that allowed our client to monitor and record the price fluctuations of electronic devices over time. This information enabled the client to make informed decisions and adapt to market changes effectively.
  3. Specialized Parameters: By utilizing advanced NLP algorithms and pattern matching, we extracted specialized technical parameters from expert reviews that were not available in standard manufacturer specifications. This enriched the database with valuable insights from professional reviewers.
  4. Pros and Cons from reviews: We successfully mined and extracted pros and cons mentioned in expert reviews, even when presented in plain text format. This provided our client with a well-rounded view of each electronic device’s strengths and weaknesses.
  5. Model Identification: We developed a custom model matching solution that accurately identified different device models, even in cases of ambiguous or improperly stated model identification strings on retailers’ websites.
  6. Language Parsing: Despite the complexities of the Polish language, we overcame the challenges associated with parsing it, enabling us to process Polish text and extract essential data from various sources.
  7. High Accuracy: Our meticulous matching and parsing techniques, combined with the use of complex technologies, resulted in an impressive matching accuracy rate of 96% in both the initial and final steps of the process.

In conclusion, we successfully delivered a useful and comprehensive database for our client in the electronic devices market. Our implemented solutions ensured the accuracy and richness of the data, empowering our client to create a unique and well-trained machine learning model.

Do you need our help?

Are you looking to revolutionize your market insights? Want access to a comprehensive database that goes beyond standard specifications? Look no further! At ScrapingZone, we have the expertise and cutting-edge solutions to help your business grow.

Our team has successfully crafted custom solutions for model identification, price tracking, specialized parameter extraction, and even complex language parsing for various clients.

Contact us today and schedule a consultation and discover how our solutions can transform your business.