Web Scraping TikTok with Python
Web
Scraping TikTok with Python
This tutorial demonstrates web scraping with Python. It is divided into two parts: one for scraping TikTok video URLs and the other for scraping hashtags.
Table of Contents
Introduction to Web Scraping
Web scraping is the process of extracting data from websites. It
involves sending HTTP requests to a website’s server, receiving the response,
and parsing the HTML content to extract the desired information.
Web scraping can be done manually by a human user, but it is usually
automated using software tools such as web crawlers or bots. These tools can
send requests and parse responses much faster than a human user, allowing large
amounts of data to be extracted in a short amount of time.
Web Scraping vs. Web Crawling
Web scraping and web crawling are related but distinct concepts. Web
crawling refers to the process of automatically navigating through a website by
following links, usually for the purpose of indexing its content. Web scraping,
on the other hand, refers to the extraction of specific data from a website.
In other words, web crawling is about discovering and navigating through
web pages, while web scraping is about extracting data from those pages.
Installing BeautifulSoup
BeautifulSoup is a Python library that makes it easy to parse HTML and
XML documents. It provides methods to search, navigate, and modify the parse
tree.
To install BeautifulSoup, you
can use the pip command:
pip install beautifulsoup4
This command installs the latest version of BeautifulSoup from the
Python Package Index (PyPI). You can find more information on how to install
BeautifulSoup in the official documentation.
After installing BeautifulSoup, you can import it in your Python code
using the following statement:
from bs4 import BeautifulSoup
Part 1: Scraping for TikTok
Video Links
In the first part, the code
imports necessary libraries such as urllib.request, requests, re, time, and BeautifulSoup from the bs4 library. Then, it opens a
file named tiktoknaillinks.txt in write mode with utf-8
encoding.
The code then reads a file
named tiktokmainlinks.txt line by line. For each
line, it treats the line as a URL and uses urllib.request.urlopen(url).read() to get the HTML content of the page. The HTML
content is then parsed using BeautifulSoup with the html.parser.
The code then uses the find_all method of the soup object to find all strong tags with a specific class name (tiktok-23vhki-StrongText ejg0rhn2). This class name is obtained by right-clicking on
the video and looking for the tag strong that has the link of the
video as text. If you are confused, you might have to watch YouTube videos on
web scraping.
For each link found, the code
writes its text to the file tiktoknaillinks.txt. After all links are processed,
the file is closed and a message “done” is printed.
Here is a detailed explanation of each line of code in this part:
import urllib.request
import requests
import re
import time
from bs4 import BeautifulSoup
These lines import necessary
libraries for web scraping such as urllib.request, requests, re, time, and BeautifulSoup from the bs4 library.
f = open('tiktoknaillinks.txt','w',encoding ="utf-8")
This line opens a file
named tiktoknaillinks.txt in write mode with utf-8
encoding. The file will be used to store links to TikTok videos.
with open('tiktokmainlinks.txt','r') as fa:
for line in fa:
url = line
html =
urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,
'html.parser')
links = soup.find_all('strong',class_='tiktok-23vhki-StrongText
ejg0rhn2')
These lines read a file
named tiktokmainlinks.txt line by line. For each
line, it treats the line as a URL and uses urllib.request.urlopen(url).read() to get the HTML content of the page. The HTML
content is then parsed using BeautifulSoup with the html.parser.
The code then uses the find_all method of the soup object to find all strong tags with a specific class name (tiktok-23vhki-StrongText ejg0rhn2). This class name is obtained by right-clicking on
the video and looking for the tag strong that has the link of the
video as text. If you are confused, you might have to watch YouTube videos on
web scraping.
for i in links:
f.write(i.text)
f.write('\n')
These lines iterate over each
link found and write its text to the file tiktoknaillinks.txt. A newline character is also written after each link.
f.close()
print('done')
These lines close the file and print a message “done” to indicate that
the process is complete.
Part 2: Scraping for TikTok
Hashtags
In the second part, the code
again imports necessary libraries and opens a file named tiktokmainlinks.txt in write mode with utf-8 encoding. It then defines a URL for a
TikTok hashtag page and uses urllib.request.urlopen(url).read() to get its HTML content.
The HTML content is again parsed
using BeautifulSoup with the html.parser. The code then uses the find_all method of the soup object
to find all div tags with a specific class
name (tiktok-yvmafn-DivVideoFeedV2
ecyq5ls0). This
class name is obtained by right-clicking on the video and looking for the
tag div that has the hashtag of
the video as text. If you are confused, you might have to watch YouTube videos
on web scraping.
For each link found, the code
finds all a tags with an attribute href that
matches a regular expression for URLs starting with “https:”. For each matching
tag, it gets its href attribute value and writes it to the file tiktokmainlinks.txt. After all links are processed, the file is closed and a message “done”
is printed.
Here is a detailed explanation of each line of code in this part:
import urllib.request
import requests
import re
import time
from bs4 import BeautifulSoup
These lines again import
necessary libraries for web scraping such as urllib.request, requests, re, time, and BeautifulSoup from the bs4 library.
f = open('tiktokmainlinks.txt','w',encoding ="utf-8")
This line opens a file
named tiktokmainlinks.txt in write mode with utf-8
encoding. The file will be used to store links to TikTok hashtags.
url = "https://www.tiktok.com/tag/nailart"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
link = soup.find_all('div',{'class':'tiktok-yvmafn-DivVideoFeedV2
ecyq5ls0'})
These lines define a URL for a
TikTok hashtag page and use urllib.request.urlopen(url).read() to get its HTML content.
The HTML content is then parsed using BeautifulSoup with the html.parser.
The code then uses the find_all method of the soup object to find all div tags
with a specific class name (tiktok-yvmafn-DivVideoFeedV2
ecyq5ls0). This
class name is obtained by right-clicking on the video and looking for the
tag div that has the hashtag of
the video as text. If you are confused, you might have to watch YouTube videos
on web scraping.
for i in link:
iref = i.find_all('a',attrs ={'href':re.compile("https:")})
for t in iref :
f.write(t.get('href')
f.write('\n')
These lines iterate over each
link found and find all a tags with an
attribute href that matches a regular
expression for URLs starting with “https:”. For each matching tag, it gets its
href attribute value and writes it to the file tiktokmainlinks.txt. A newline character is also written after each link.
f.close()
print('done')
These lines again close the file and print a message “done” to indicate
that the process is complete.
In summary, this code demonstrates how to use web scraping techniques in
Python to extract information from web pages. In this case, it extracts links
to TikTok videos and their associated hashtags. <
Tiktokscrapmain:
import
urllib.request
import
requests
import
re
import
time
from
bs4 import BeautifulSoup
f
= open('tiktoknaillinks.txt','w',encoding ="utf-8")
with
open('tiktokmainlinks.txt','r') as fa:
for line in fa:
url = line
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
links =
soup.find_all('strong',class_='tiktok-23vhki-StrongText ejg0rhn2')/* get the class by right clicking
on the video,look for the tag strong ,that has the link of the video as text,
if you confused , you might have to watch youtube videos on webscraping*/
for i in links:
f.write(i.text)
f.write('\n')
f.close()
print('done')
tiktok
scrape video links
import
urllib.request
import
requests
import
re
import
time
from
bs4 import BeautifulSoup
f
= open('tiktokmainlinks.txt','w',encoding ="utf-8")
url
= "https://www.tiktok.com/tag/nailart"
html
= urllib.request.urlopen(url).read()
soup
= BeautifulSoup(html, 'html.parser')
link
= soup.find_all('div',{'class':'tiktok-yvmafn-DivVideoFeedV2 ecyq5ls0'})
/* get the class by right clicking
on the video,look for the tag div ,that has the hashtag of the video as text ,
if you confused , you might have to watch youtube videos on webscraping*/
for i in links:
for
i in link:
iref = i.find_all('a',attrs ={'href':re.compile("https:")})
for t in iref :
f.write(t.get('href')
f.write('\n')
f.close()
print('done')
I did two web scraping , one to scrape
for the links of the video and the other to scrape for the hashtags.
No comments: