Skip to content

Forum in maintenance, we will back soon 🙂

Notifications
Clear all

Web Scraping

32 Posts
4 Users
8 Reactions
591 Views
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

Hi I am getting the following errors when trying to import beautifulsoup4, please can you assist?

 

 
Posted : 04/05/2024 6:24 pm
SSAdvisor
(@ssadvisor)
Posts: 1139
Noble Member
 

You don't pip install in the .py file. You do that in the terminal window.

Regards,
Earnie Boyd, CEO
Seasoned Solutions Advisor LLC
Schedule 1-on-1 help
Join me on Slack

 
Posted : 04/05/2024 9:12 pm
Hasan Aboul Hasan
(@admin)
Posts: 1276
Member Admin
 

this command is to install the package, you run in terminal, and not in the python file

 
Posted : 04/06/2024 8:36 am
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

Th solution for scraping the Python job board job titles says:

This script finds all the h2 elements of the class listing-company (which are the job titles on this site) and prints the text inside each one.

Here is an example H2 from the page source:

 

<li>
            <h2 class="listing-company">
                <span class="listing-company-name">
                    
                    
                    <a href="/jobs/7507/">Python Sr Dev Urgent requirement</a><br/>
		    Techunting is looking for a Python Sr Dev for a very important client in the US full time staff dev
                </span>
                <span class="listing-location"><a href="/jobs/location/remote-from-argentina-argentina-argentina-or-rest-of-latinamerica/" title="More jobs in Remote from Argentina, Argentina, Argentina or rest of Latinamerica">Remote from Argentina, Argentina, Argentina or rest of Latinamerica</a></span>
            </h2>
            
            <span class="listing-job-type">
                
                AWS, Python
                
            </span>
            
            <span class="listing-posted">Posted: <time datetime="2024-02-15T16:02:53.761441+00:00">15 February 2024</time></span>
            <span class="listing-company-category"><a href="/jobs/category/developer-engineer/" title="More jobs in Developer / Engineer">Developer / Engineer</a></span>
            
        </li>

Can someone please explain how we equated h2 to the job title, it is confusing for me not being an HTML expert? Many Thanks

 
Posted : 04/08/2024 4:12 pm
SSAdvisor
(@ssadvisor)
Posts: 1139
Noble Member
 

@google-rayazsiddiqi the <h2> is the html tag, there can be many <h2> tags. The named class within the tag refers to a CSS class but the named CSS class doesn't need to exist but if it does it can modify the appearance of the <h2>BODY</h2>. So the script is looking for all of the <h2> tags with a class references of "listing-company".

Does this help?

Regards,
Earnie Boyd, CEO
Seasoned Solutions Advisor LLC
Schedule 1-on-1 help
Join me on Slack

 
Posted : 04/08/2024 6:32 pm
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

@ssadvisor yes it does, but why are we looking for listing company when the exercise is to scrape the job title?

 
Posted : 04/08/2024 6:39 pm
SSAdvisor
(@ssadvisor)
Posts: 1139
Noble Member
 

@google-rayazsiddiqi that's what was chosen by the developer. Can you give me a link to your reference?

Regards,
Earnie Boyd, CEO
Seasoned Solutions Advisor LLC
Schedule 1-on-1 help
Join me on Slack

 
Posted : 04/08/2024 9:47 pm
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

@ssadvisor https://learnwithhasan.com/lessons/web-scraping/ look at the solution section at the bottom

 
Posted : 04/08/2024 10:08 pm
SSAdvisor
(@ssadvisor)
Posts: 1139
Noble Member
 

@google-rayazsiddiqi The important thing you may have missed is:

Hint: Inspect the website and find out the HTML element and class that corresponds to the job titles.

So if you inspect (it's the right mouse click menu option) the https://www.python.org/jobs/ webpage you'll see that the <h2> tag class name is what you want to scrape which is "listing-company". We don't have control over what the webpage is giving us, we only have control over what we can obtain from it.

Since we don't have control over the inspected webpage you will also need to recover from unexpected changes if the code is in production.

Regards,
Earnie Boyd, CEO
Seasoned Solutions Advisor LLC
Schedule 1-on-1 help
Join me on Slack

 
Posted : 04/09/2024 12:30 pm
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

@ssadvisor

 
Posted : 04/09/2024 12:42 pm
SSAdvisor
(@ssadvisor)
Posts: 1139
Noble Member
 

@google-rayazsiddiqi 

Regards,
Earnie Boyd, CEO
Seasoned Solutions Advisor LLC
Schedule 1-on-1 help
Join me on Slack

 
Posted : 04/09/2024 1:42 pm
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

@ssadvisor I'm sorry I dont get it I see the job title below, apologies for my ignorance. But once I understand it I can be comfortable on web scraping

 
Posted : 04/09/2024 1:59 pm
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

Just to further elaborate on this, I can see h2 with class=listing-company, class=listing-company-name, class=listing-new, so where do we get the job title from?

This post was modified 9 months ago by Rayaz Siddiqi
 
Posted : 04/09/2024 5:32 pm
SSAdvisor
(@ssadvisor)
Posts: 1139
Noble Member
 

@google-rayazsiddiqi It's the actual display text (E.G. "SENIOR SOFTWARE ENGINEER"). 

  1. Find the <h2> tags with the class name "listing-company"
  2. Find the <span> tag with the class name "listing-company-name"
  3. Find the <a> tag and get the display data between the beginning and ending <a> tag.

This will be the Job Title.

Regards,
Earnie Boyd, CEO
Seasoned Solutions Advisor LLC
Schedule 1-on-1 help
Join me on Slack

 
Posted : 04/09/2024 6:51 pm
(@google-rayazsiddiqi)
Posts: 95
Estimable Member
Topic starter
 

@ssadvisor great thanks very much it makes sense when you describe it like this. is there any chance you can relate what you said in your post with the solution code below:

 

job_posts = soup.find_all('h2', class_='listing-company')
 
# Print the title of each job post
for job_post in job_posts:
title = job_post.a.text
print(title)
 
Posted : 04/09/2024 7:59 pm
Page 1 / 3
Share:
[the_ad_group id="312"]