Learn to extract valuable data from Melmod effortlessly with step-by-step guidance. Read now for a web scraping journey!
Introduction:
As I mentioned ealier, I want to crawl some valuable data from Mods For Melon Playground. This guide provides a step-by-step walkthrough. Let’s get stared!
Setting Up the Project
Begin by creating a new Kotlin project and adding Selenium WebDriver as a dependency in your build file. If you missed the initial setup, check out the details in the previous article.
Scraping Data from Melmod
Get List of All Article Links
Our first objective is to crawl a list of article links from the Melmod website. To achieve this, create a new class, MelModGetLinks.kt, within the test module.
Structure Project
Inside this class, initialize the WebDriver and CSVWriter. The detailed code for this can be found in the provided snippet:
classMelModGetLinks{privatelateinitvardriver:WebDriverprivatelateinitvarcsvWriter:CSVWriter@BeforeEachfunsetup(){// Create ChromeOptions
valchromeOptions=ChromeOptions()// Disable images loading
valprefs:MutableMap<String,Any>=HashMap()prefs["profile.managed_default_content_settings.images"]=2chromeOptions.setExperimentalOption("prefs",prefs)// It doesn't render the UI if running the browser in headless mode
chromeOptions.addArguments("--headless")// Initialize the WebDriver
driver=ChromeDriver(chromeOptions)// Navigate to the website
driver.get("https://melmod.com/mods/")// The output CSV file
csvWriter=CSVWriter(FileWriter("src/test/resources/melmod-link.csv"))}@AfterEachfuntearDown(){driver.quit()csvWriter.close()}}
We have everything to do. Get back to the MelMod website. As we can see, every article has post inside class attribute:
https://melmod.com/mods/
So we can get all article names and detail links by get all elements have post in className. We can achieve it through the following code snippet:
@Testfun`Get list links from pages - success`(){// Add header to csv file
addHeaderForCSV()// Get link from pages
valfirstPage=1vallastPage=2valpages=(firstPage..lastPage).toList()for(pageinpages){// Navigate to the website in specific page
driver.get("https://melmod.com/mods/page/$page/")// Get all articles available with class `post`
valarticles=driver.findElements(By.className("post"))Assertions.assertEquals(articles.size,10)// Get link of each article
articles.forEachIndexed{index,article->valh2=article.findElement(By.className("entry-title"))vala=h2.findElement(By.tagName("a"))vallink=a.getAttribute("href")insertToCSV(10*(page-1)+(index+1),link,h2.text)}}}privatefunaddHeaderForCSV(){valheader=arrayOf("Index","Link","Name")csvWriter.writeNext(header)}privatefuninsertToCSV(index:Int,link:String,name:String){valrow=arrayOf(index.toString(),link,name)csvWriter.writeNext(row)println("CSV: $index, $link, $name")}
This code navigates to the Melmod website and extracts all article name and detail links using a className. Result will appear in src/test/resources:
melmod-link.csv
Extract Valuable Data from Each Article
Once we have the article links stored in src/test/resources/melmod-link.csv, the next step is to delve into each article and the desired data—image, name, and mod file link:
Image: by finding the div element has featured-image in class attribute.
Name: by finding the h1 tag has entry-title in class attribute. But I have a name in the first step. So I don’t need to get the name again in this step
Mod File Link: by finding the button has wp-block-button in class attribute.
For more information, you can see the image below:
classMelModGetFiles{privatelateinitvardriver:WebDriverprivatelateinitvarcsvReader:CSVReaderprivatelateinitvarcsvWriter:CSVWriter@BeforeEachfunsetup(){// Create ChromeOptions
valchromeOptions=ChromeOptions()// Disable images loading
valprefs:MutableMap<String,Any>=HashMap()prefs["profile.managed_default_content_settings.images"]=2chromeOptions.setExperimentalOption("prefs",prefs)// It doesn't render the UI if running the browser in headless mode
chromeOptions.addArguments("--headless")// Initialize the WebDriver
driver=ChromeDriver(chromeOptions)// Initialize the CSVReader
csvReader=CSVReader(FileReader("src/test/resources/melmod-link.csv"))// Initialize the CSVWriter
csvWriter=CSVWriter(FileWriter("src/test/resources/melmod-fileMods.csv"))}@AfterEachfuntearDown(){driver.quit()csvReader.close()csvWriter.close()}@Testfun`Get all file links from melmod-link csv - success`(){// Write header for output
addHeader()// Read header of input file. Don't need to care the header
csvReader.readNext()// Start reading the input data
varnextRecord:Array<String>?while(csvReader.readNext().also{nextRecord=it}!=null){// Process data for each row
valindex=nextRecord!![0]vallink=nextRecord!![1]valname=nextRecord!![2]findFileLinkAndAddToCSV(index,link,name)}}privatefunaddHeader(){valheader=arrayOf("Index","Name","Image","File")csvWriter.writeNext(header)}privatefunfindFileLinkAndAddToCSV(index:String,link:String,name:String){// Navigate to mod detail
driver.get(link)// Get the mod image
valimageDiv=driver.findElement(By.className("featured-image"))valimageTag=imageDiv.findElement(By.tagName("img"))valimageLink=imageTag.getAttribute("src")// Get the mod file link
valdownloadButton=driver.findElement(By.className("wp-block-button"))assertEquals(downloadButton.text,"download")vala=downloadButton.findElement(By.tagName("a"))valfileLink=a.getAttribute("href")insertToCSV(index,name,imageLink,fileLink)}privatefuninsertToCSV(index:String,name:String,image:String,file:String){valrow=arrayOf(index,name,image,file)csvWriter.writeNext(row)println("CSV: $index, $name, $image, $file")}}
This code iterates through each article link, navigates to the corresponding page, and extracts the image and file link.
Result
After completing these 2 steps, you’ll have successfully scraped all the data you need. The resulting information will be stored in src/test/resources/melmod-fileMods.csv, as illustrated in the provided image:
Scraping Result
Drawbacks
While web scraping is a powerful tool, it comes with certain drawbacks that should be considered:
Performance: Currently, the extraction of data from each article takes approximately 2 minutes, which might be deemed sluggish. I will find a way to improve it later.
Conclusion
Armed with the ability to fetch article links and extract valuable data, you are now equipped to scrape essential information from Mods For Melon Playground.
Remember to check the website’s terms of service and policies before scraping to ensure compliance. Feel free to customize the code according to your specific scraping needs. Happy coding!