Forum in maintenance, we will back soon 🙂
Web Scraping
Hi I am getting the following errors when trying to import beautifulsoup4, please can you assist?
Â
You don't pip install in the .py file. You do that in the terminal window.
Regards,
Earnie Boyd, CEO
Seasoned Solutions Advisor LLC
Schedule 1-on-1 help
Join me on Slack
this command is to install the package, you run in terminal, and not in the python file
Th solution for scraping the Python job board job titles says:
This script finds all the h2
elements of the class listing-company
(which are the job titles on this site) and prints the text inside each one.
Here is an example H2 from the page source:
Â
<li> <h2 class="listing-company"> <span class="listing-company-name"> <a href="/jobs/7507/">Python Sr Dev Urgent requirement</a><br/> Techunting is looking for a Python Sr Dev for a very important client in the US full time staff dev </span> <span class="listing-location"><a href="/jobs/location/remote-from-argentina-argentina-argentina-or-rest-of-latinamerica/" title="More jobs in Remote from Argentina, Argentina, Argentina or rest of Latinamerica">Remote from Argentina, Argentina, Argentina or rest of Latinamerica</a></span> </h2> <span class="listing-job-type"> AWS, Python </span> <span class="listing-posted">Posted: <time datetime="2024-02-15T16:02:53.761441+00:00">15 February 2024</time></span> <span class="listing-company-category"><a href="/jobs/category/developer-engineer/" title="More jobs in Developer / Engineer">Developer / Engineer</a></span> </li>
Can someone please explain how we equated h2 to the job title, it is confusing for me not being an HTML expert? Many Thanks
@google-rayazsiddiqi the <h2> is the html tag, there can be many <h2> tags. The named class within the tag refers to a CSS class but the named CSS class doesn't need to exist but if it does it can modify the appearance of the <h2>BODY</h2>. So the script is looking for all of the <h2> tags with a class references of "listing-company".
Does this help?
Regards,
Earnie Boyd, CEO
Seasoned Solutions Advisor LLC
Schedule 1-on-1 help
Join me on Slack
@ssadvisor yes it does, but why are we looking for listing company when the exercise is to scrape the job title?
@google-rayazsiddiqi that's what was chosen by the developer. Can you give me a link to your reference?
Regards,
Earnie Boyd, CEO
Seasoned Solutions Advisor LLC
Schedule 1-on-1 help
Join me on Slack
@ssadvisor https://learnwithhasan.com/lessons/web-scraping/ look at the solution section at the bottom
@google-rayazsiddiqi The important thing you may have missed is:
Hint: Inspect the website and find out the HTML element and class that corresponds to the job titles.
So if you inspect (it's the right mouse click menu option) the https://www.python.org/jobs/ webpage you'll see that the <h2> tag class name is what you want to scrape which is "listing-company". We don't have control over what the webpage is giving us, we only have control over what we can obtain from it.
Since we don't have control over the inspected webpage you will also need to recover from unexpected changes if the code is in production.
Regards,
Earnie Boyd, CEO
Seasoned Solutions Advisor LLC
Schedule 1-on-1 help
Join me on Slack
Regards,
Earnie Boyd, CEO
Seasoned Solutions Advisor LLC
Schedule 1-on-1 help
Join me on Slack
@ssadvisor I'm sorry I dont get it I see the job title below, apologies for my ignorance. But once I understand it I can be comfortable on web scraping
Just to further elaborate on this, I can see h2 with class=listing-company, class=listing-company-name, class=listing-new, so where do we get the job title from?
@google-rayazsiddiqi It's the actual display text (E.G. "SENIOR SOFTWARE ENGINEER").Â
- Find the <h2> tags with the class name "listing-company"
- Find the <span> tag with the class name "listing-company-name"
- Find the <a> tag and get the display data between the beginning and ending <a> tag.
This will be the Job Title.
Regards,
Earnie Boyd, CEO
Seasoned Solutions Advisor LLC
Schedule 1-on-1 help
Join me on Slack
@ssadvisor great thanks very much it makes sense when you describe it like this. is there any chance you can relate what you said in your post with the solution code below:
Â