also a prelude in Python
Objective - read in a webpage, find elements (score/song name, explanatory link) and create a dictionary.
The impressive source: http://silkqin.com/zh02qnpu.htm
1.extracting and renaming files
2.Correctly obtaining encoded characters - since the webpage contains Chinese characters, need to ensure they are captured properly
Detour INSIDE BASH (not Python)
pip install chardet
chardetect *html #(after navigating to the correct directory, saving html file)
confirmed to be utf-8
https://pypi.org/project/chardet/
Refer to – https://stackoverflow.com/questions/31027759/how-to-scrape-traditional-chinese-text-with-beautifulsoup
import requests
url = 'http://silkqin.com/zh06hear.htm'
response = requests.get(url)
page_content = requests.get(url).content # returns bytes <- this extra step allows detection for special (e.g. Chinese characters)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page_content, 'lxml')
#check how it looks
soup.contents
Results in: title聽絲弦古琴title
meta content=“text/html; charset=utf-8” http-equiv=“content-type”
meta content=“聽絲弦古琴” name=“description” meta content=“琴、古琴、聽琴、聽古琴、聽絲弦琴、聽絲弦古琴、絲弦、絲絃、絲線、丝弦、絲絃琴、絲弦琴、絲線琴、絲弦古琴、絲絃古琴、絲線古琴、絲桐、唐世璋、John Thompson” name=“keywords”
From looking into the text, tre is only a loose pattern - in the location,sually “Anoottion (Rfollowed by Recording F
r example, in 南風歌 (聽), the annotation page and recording are adjacent to each other and the naming is consistent.
http://silkqin.com/02qnpu/10tgyy/tg01nfg.htm http://silkqin.com/06hear/myrec/1511/tg01nfg.mp3
But sometimes, the naming is not consistent, for example in 墨子悲歌 (聽)
http://silkqin.com/02qnpu/32zczz/mozibei.htm http://silkqin.com/06hear/myrec/1589-1609/1609mozibeige.mp3
And certain annotations are shorter and exists as excerpts within a page collection; no consistency in file name either, e.g. 太簇意 (聽)
http://silkqin.com/02qnpu/07sqmp/sq01dsc.htm#taicouyifn http://silkqin.com/06hear/myrec/1525/xl101tcy102dhy.mp3
here is one consistent pattern though - all annotation pages seem to be under”02qnpu" directory.
import re
from urllib.request import urlretrieve
from urllib.request import urlopen
html = urlopen('http://silkqin.com')
baseurl='http://silkqin.com/'
#The 'a' tag in the html does not have any text directly, but it contains a 'h3' tag that has text.
all_links = [link.get("href") for link in soup("a")]
all_links
#get rid of none otherwise sub lists generated gives None type error
#to read on None type - https://stackoverflow.com/questions/3887381/typeerror-nonetype-object-is-not-iterable-in-python
clean = [x for x in all_links if x is not None]
# now filter for that directory
links_htm = [k for k in clean if 'htm' in k and '02qnpu' in k]
#and 'htm#' not in k]#and '\#' not in k and '\~' not in k]
There were 164 scores listed,not 234 that results from checking length of this list.
This is likely due to the presence of lyrics being separate links, but not filter-out-able since they live in the same ‘02qnpu’ subdirectory
e.g. 清商調 (聽)(看中文歌詞) http://silkqin.com/02qnpu/32zczz/daoyi.htm#qsdfn http://silkqin.com/06hear/myrec/1589-1609/1609qsdge.mp3 http://silkqin.com/02qnpu/32zczz/daoyi.htm#qsdlyr
Let’s grab them all for now, knowing some are just subsections of pages (e.g. #qsdfn above) and some are lyrics
#note-- don't reuse counter alink fromprevious= -> cannot force it into an integer as it becomes a list element of thelist in the for loop
#trying to use it as a counter yields TypeError: list indices must be integers or slices, not str
counter = 0
for alink in links_htm:
urlretrieve((baseurl + links_htm[counter]), (links_htm[counter].rsplit('/', 1)[-1]))
#the split takes all characters after last slash
#regex way -- re.sub(r'^.+/([^/]+)$', r'\1', 'dsf/we/sdfl.htm')
#more https://stackoverflow.com/questions/7253803/how-to-get-everything-after-last-slash-in-a-url
counter += 1
Result is an error message “NameError: name ‘links_htm’ is not defined”
What happened? Checking the directory, there are 210 of these annotation (and lyrics) html files downloaded.
Let’s collect the downloaded ones (anything with htm) using glob below, and compare against the annotation links list.
Since the glob collection has no subdirectory/nesting (besides the # page bookmarks), let’s strip those from the links list as well
import glob
downloadedhtmfiles = []
for file in glob.glob("*.htm"):
downloadedhtmfiles.append(file)
links_htm_temp=list(range(0,len(links_htm))) # there will be error otherwise if list is not intialized, since we don't use append below
counter = 0
for alink in links_htm:
links_htm_temp[counter] = re.sub(r'^.+/([^/]+)$', r'\1', links_htm[counter])
#the split takes all characters after last slash
#regex way -- re.sub(r'^.+/([^/]+)$', r'\1', 'dsf/we/sdfl.htm')
#more https://stackoverflow.com/questions/7253803/how-to-get-everything-after-last-slash-in-a-url
counter += 1
#links_htm_temp[0]
def Diff(li1, li2):
return (list(set(li1) - set(li2)))
print(Diff(links_htm_temp, downloadedhtmfiles))
Output:
[‘tg06gjq.htm#lyrchi’, ‘xl127ysc.htm#jzymusic’, ‘lh00toc.htm#p5’, ‘tingqinyin.htm#melody’, ‘daoyi.htm#qsdfn’, ‘yqwd.htm’, ‘xl028yyg.htm#chilyr’, ‘tg32cjq.htm#1525cjwt’, ‘xl132src.htm#linzhong’, ‘jiukuang.htm#chilyr’, ‘tg36kcyh.htm#music’, ‘03slgj.htm#kzhyy’, ‘tg01nfg.htm#lyrics’, ‘hw02qpy.htm’, ‘xl096yts.htm#lyrics’, ‘xl127ysc.htm#ysymusic’, ‘xl054cwy.htm#mjyfn’, ‘27wjctrans.htm#record’, ‘1709qfq.htm#1840muslyr’, ‘tg28frsg.htm#chilyr’, ‘daoyi.htm#lyrchi’, ‘tg10ysc.htm#chilyr’, ‘xl054cwy.htm#jy’, ‘qx14wywq.htm’, ‘tg32cjq.htm#clyrics’, ‘xl000toc.htm#p16’, ‘xl021fl.htm#feidianyinfn’, ‘fm23ygsd.htm#chilyr’, ‘xl098byd.htm#chilyrfn’, ‘cx38xsq.htm#lyrics’, ‘tg24hzd.htm#chilyr’, ‘zy13ygsd.htm#v1’, ‘fx33gg.htm#lyricsfn’, ‘fx42zwy.htm#chilyr’, ‘sq01dsc.htm#dinghuiyinfn’, ‘xl046yz.htm#1530’, ‘1709qfq.htm#1709muslyr’, ‘tg02sqc.htm#lyrics’, ‘daoyi.htm#qsdlyr’, ‘tg03xfy.htm#chilyr’, ‘sj03qjj.htm#chilyr’, ‘hw15fhts.htm’, ‘fx40dmyt.htm#chilyr’, ‘tg09wwq.htm#muslyr’, ‘tg32cjq.htm#1539cj’, ‘ty28skj.htm’, ‘lq12mss.htm’, ‘fm03qjwd.htm#chilyr’, ‘fx27wjc.htm#chilyr’, ‘xl041jyb.htm#qingyeyin’, ‘qx09lhxx.htm’, ‘ty28skj.htm#skjmp3’, ‘ylcx.htm#cgyfn’, ‘xl041jyb.htm#chilyr’, ‘yltrans.htm’, ‘fx32dyq.htm#chilyr’, ‘sq01dsc.htm#taicouyifn’, ‘tg16ysc.htm#chilyr’, ‘ylcx.htm#ylcxmusic’, ‘xl159qxb.htm#byyfn’, ‘sq18ghy.htm#daguanyinfn’, ‘fx45gjx.htm#xllyrfn’, ‘jiukuang.htm#lyrics’, ‘03slgj.htm#gd’, ‘fx31lsm.htm#chilyr’, ‘tg08ksc.htm#lyrics’, ‘ty6qcby.htm#gy’, ‘tg07wwc.htm#music’, ‘xl046yz.htm#chilyr’, ‘xl007gky.htm#chongheyinfn’, ‘xl159qxb.htm#qyyfn’, ‘xl155fqh.htm#chilyr’, ‘tg25gqlc.htm#chilyr’, ‘tg35gqf.htm#chilyr’, ‘sz03olwj.htm’]
take all htm, strip out #… match blurb.htm"> and First let’s reset the links list in case of any accidental changes before
#repeat of above code (in case run from this segment)
# Find links
all_links = [link.get("href") for link in soup("a")]
all_links
clean = [x for x in all_links if x is not None]
#links_htm = [k for k in clean if 'htm' in k and '02qnpu' in k and 'htm#' not in k]#and '\#' not in k and '\~' not in k]
#links_htm = [k for k in links_htm if '02qnpu' in k]
links_htm = [k for k in clean if 'htm' in k and '02qnpu' in k]
links_htm_clean=links_htm # if I reassign below directly in re.sub, it seems to overwrite the original as well
links_htm_clean[1] = re.sub(r'.*\/', r'', links_htm_clean[1]) #pat1.*pat2 any number of characters between pat1 and pat2
links_htm_clean[1] = re.sub(r'\#.*', r'', links_htm_clean[1])
print(links_htm_clean[1])
print(links_htm[3])
len(links_htm_clean)
Results:
yltrans.htm 02qnpu/03slgj.htm 234
#throwing the tested element into a loop
# links_htm = [k for k in clean if 'htm' in k and '02qnpu' in k]
# gets rid of slashes and anything preceding slash
links_htm_clean=links_htm
counter=0
for elem in links_htm_clean:
links_htm_clean[counter] = re.sub(r'.*\/', r'', links_htm_clean[counter]) #pat1.*pat2 any number of characters between pat1 and pat2
# links_htm_clean[counter] = re.sub(r'\#.*', r'', links_htm_clean[counter]) # works but don't want to remove the # bc sometime that signifies diff song
# print(links_htm_clean[counter])
counter +=1
links_htm_clean[1] = re.sub(r'\#.*', r'', links_htm_clean[1])
print(links_htm_clean[1])
print(links_htm[3])
len(links_htm_clean)
yltrans.htm 03slgj.htm 234
The annotation, recording and song name patterns are generally: <a href=“http://silkqin.com/02qnpu/16xltq/xl154lqy.htm”>臨邛吟</a> <a href=“http://silkqin.com/06hear/myrec/1525/xl154lqy.mp3”>聽</a>)</li> which means we are looking as song name, the text between xl154lqy.htm"> & </a> <a href=“http://silkqin.com/06hear/myrec/1525/xl154lqy.mp3” * Let’s play with splitting this string, called astring
# Variation 1 - cluster annotation / song name+record link
astring='<a href="http://silkqin.com/02qnpu/03slgj.htm#kzhyy">開指黃鶯吟</a>(<a href="http://silkqin.com/06hear/myrec/01tangsong/00kzhyy.mp3">聽</a>'
#m=re.split('(.htm\S+?>)',astring)
m=re.split('.htm\S+?>',astring) #cuts by end of first '>' for a list of two being the ahref annotation link, then song name & recording link
n=re.sub('(.mp3)\S+','',astring) #cuts everything from the '.mp3' onward to get rid of 聽</a>']
#\S = a non whitespace chara,
#+ multipe \S but add ? for as few as possible
#() keeps the separator within result
print(m)
print(n)
’<a href=“http://silkqin.com/02qnpu/03slgj‘, ’開指黃鶯吟(聽’ 開指黃鶯吟(<a href=”http://silkqin.com/06hear/myrec/01tangsong/00kzhyy
# Variation 2 - cluster annotation+song name / record link
astring='<a href="http://silkqin.com/02qnpu/03slgj.htm#kzhyy">開指黃鶯吟</a>(<a href="http://silkqin.com/06hear/myrec/01tangsong/00kzhyy.mp3">聽</a>'
#m=re.split('(.htm\S+?>)',astring)
m=re.split('</a>(<a href="',astring)
n=re.sub('(.mp3)\S+','',astring)
print(m) #a list of two
print(n)
‘開指黃鶯吟’, ’http://silkqin.com/06hear/myrec/01tangsong/00kzhyy.mp3“>聽’] 開指黃鶯吟(<a href=”http://silkqin.com/06hear/myrec/01tangsong/00kzhyy
Continuing with variation 2, let’s regex out the typical patterns surrounding the song name.
‘開指黃鶯吟’
p[0] # annotations
‘03slgj.htm#kzhyy’
Recall how to set up a dictionary. Then fit in song name and file name extracted above (and hope the pattern holds)
RecCatalogue={} RecCatalogue={p[1]:r.group(0)} RecCatalogue
#print the soup as string text for use
Ssoup=str(soup)
print(Ssoup, file=open('Ssoup.txt', 'w',encoding='utf-8-sig'))
#for song titles,extract from the mass of text in SSoup file
#sub-example "<br/><a href="06hear/myrec/1491/zy08ygd.mp3"><b>聽漁歌調</b></a>"
counter=0
sep=[i for i in links_rec if i in Ssoup]
for elem in sep:
sep = [i for i in links_rec if i in Ssoup]
lsep=len(sep[counter]) #length of the recording file name
idx = Ssoup.find(sep[counter]) #note where the recordingfile name is in the Ssoup string
_idx=idx-lsep #set starting index back the length of recording file name
TitleName[counter]=(Ssoup[_idx:idx-12])
TitleName[counter]=TitleName[counter].split(sep=">",maxsplit=1)[1] # cut everything before <b> which precedes title
TitleName[counter]=TitleName[counter].split(sep="</",maxsplit=1)[0] # cut everything behind </b> which follows title, keeping first element (Title)
counter+=1
Check if we obtained the name
TitleName[19] ‘廣寒秋’
5347
‘tingqinyin.htm’
‘ylcx.htm#cgyfn’
Readying for loop by setting up htm list “explan” and cutting out first element which is a blank for some reason
#example "<br/>"<a href=\"http://silkqin.com/02qnpu/32zczz/tingqinyin.htm\">聽琴吟</a>"
teststr="blob"
#remSsoup=Ssoup
sep = [i for i in TitleName if i in Ssoup]
counter=0
for elem in sep:
sep = [i for i in TitleName if i in Ssoup]
idx = Ssoup.find(elem) #find where songtitle is
idx_ = Ssoup[idx-50:idx].rfind('<a href=\"') #then skip 50 characters back and look for match of a href link with highest index (closest to song title)
idx_=idx-50+idx_
teststr=Ssoup[idx_+9:idx-2] #cut out the a href frames brackets
explan[counter]=re.sub(r'.*\/', r'', teststr) # get rid of everything before last slash in htm link
counter+=1
explan
pyt
{python, eval=FALSE,include=TRUE} #HtmCatalogue=dict(zip(TitleName, explan)) HtmCatalogue ``` hSuccessful output: