当前位置 博文首页 > 孤寒者的博客:爬虫实战之抓取猫眼电影排行TOP100(使用正则表达

    孤寒者的博客:爬虫实战之抓取猫眼电影排行TOP100(使用正则表达

    作者:[db:作者] 时间:2021-07-25 18:38

    1.目标:猫眼电影TOP100的电影名称,时间,评分,图片等信息。提取的站点URL为https://maoyan.com/board/4?offset=0,提取的结果会以文件形式保存下来。

    2.代码编写:

    import requests
    from requests.exceptions import RequestException
    from fake_useragent import UserAgent
    import re
    import json
    import time
    
    def get_one_page(url):
        """
        发送请求,获取响应!
        :param url:
        :return:
        """
        try:
            headers = {
                'User-Agent':UserAgent().random
            }
            response = requests.get(url, headers=headers)
            if response.status_code == 200:
                return response.text
            return None
        except RequestException:
            return None
    
    def parse_one_page(html):
        """
        利用正则表达式提取响应里的电影信息,并形成结构化数据!
        :param html:
        :return:
        """
        pattern = re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>',re.S)
        items = re.findall(pattern, html)
        for item in items:
            yield {
                'index': item[0],
                'image': item[1],
                'title': item[2].strip(),
                'actor': item[3].strip()[3:] if len(item[3]) > 3 else '',
                'time' : item[4].strip()[5:] if len(item[4]) > 5 else '',
                'score': item[5].strip() + item[6].strip()
            }
    
    def write_to_file(content):
        """
        存储数据,通过JSON库的dumps()方法实现字典的序列化,写入到一个文本文件!
        :param content:
        :return:
        """
        with open('result.txt', 'a', encoding='utf-8') as f:
            f.write(json.dumps(content, ensure_ascii=False) + ',\n')
    
    def main(offset):
        """
        通过构造URL中的offset参数(偏移量值),实现TOP100十页数据的爬取!
        :param offset:
        :return:
        """
        url = "http://maoyan.com/board/4?offset=" + str(offset)
        html = get_one_page(url)
        for item in parse_one_page(html):
            print(item)
            write_to_file(item)
    
    if __name__ == '__main__':
        for i in range(10):
            main(offset=i * 10)
            time.sleep(1)
    
    

    3.关于正则表达式匹配电影信息规则编写参考网页源码,举例如下:

                    <dd>
                            <i class="board-index board-index-1">1</i>
        <a href="/films/1200486" title="我不是药神" class="image-link" data-act="boarditem-click" data-val="{movieId:1200486}">
          <img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
          <img data-src="https://p0.meituan.net/movie/414176cfa3fea8bed9b579e9f42766b9686649.jpg@160w_220h_1e_1c" alt="我不是药神" class="board-img" />
        </a>
        <div class="board-item-main">
          <div class="board-item-content">
                  <div class="movie-item-info">
            <p class="name"><a href="/films/1200486" title="我不是药神" data-act="boarditem-click" data-val="{movieId:1200486}">我不是药神</a></p>
            <p class="star">
                    主演:徐峥,周一围,王传君
            </p>
    <p class="releasetime">上映时间:2018-07-05</p>    </div>
        <div class="movie-item-number score-num">
    <p class="score"><i class="integer">9.</i><i class="fraction">6</i></p>        
        </div>
    
          </div>
        </div>
    
                    </dd>
    

    在这里插入图片描述

    4.实现效果:

    {"index": "1", "image": "https://p0.meituan.net/movie/414176cfa3fea8bed9b579e9f42766b9686649.jpg@160w_220h_1e_1c", "title": "我不是药神", "actor": "徐峥,周一围,王传君", "time": "2018-07-05", "score": "9.6"},
    {"index": "2", "image": "https://p0.meituan.net/movie/8112a8345d7f1d807d026282f2371008602126.jpg@160w_220h_1e_1c", "title": "肖申克的救赎", "actor": "蒂姆·罗宾斯,摩根·弗里曼,鲍勃·冈顿", "time": "1994-09-10(加拿大)", "score": "9.5"},
    {"index": "3", "image": "https://p1.meituan.net/movie/c9b280de01549fcb71913edec05880585769972.jpg@160w_220h_1e_1c", "title": "绿皮书", "actor": "维果·莫腾森,马赫沙拉·阿里,琳达·卡德里尼", "time": "2019-03-01", "score": "9.5"},
    {"index": "4", "image": "https://p1.meituan.net/movie/ac8f0004928fbce5a038a007b7c73cec746794.jpg@160w_220h_1e_1c", "title": "小偷家族", "actor": "中川雅也,安藤樱,松冈茉优", "time": "2018-08-03", "score": "8.1"},
    {"index": "5", "image": "https://p0.meituan.net/movie/609e45bd40346eb8b927381be8fb27a61760914.jpg@160w_220h_1e_1c", "title": "海上钢琴师", "actor": "蒂姆·罗斯,比尔·努恩,克兰伦斯·威廉姆斯三世", "time": "2019-11-15", "score": "9.3"},
    {"index": "6", "image": "https://p0.meituan.net/movie/005955214d5b3e50c910d7a511b0cb571445301.jpg@160w_220h_1e_1c", "title": "哪吒之魔童降世", "actor": "吕艳婷,囧森瑟夫,瀚墨", "time": "2019-07-26", "score": "9.6"},
    {"index": "7", "image": "https://p0.meituan.net/movie/61fea77024f83b3700603f6af93bf690585789.jpg@160w_220h_1e_1c", "title": "霸王别姬", "actor": "张国荣,张丰毅,巩俐", "time": "1993-07-26", "score": "9.5"},
    {"index": "8", "image": "https://p1.meituan.net/movie/580d81a2c78bf204f45323ddb4244b6c6821175.jpg@160w_220h_1e_1c", "title": "美丽人生", "actor": "罗伯托·贝尼尼,朱斯蒂诺·杜拉诺,赛尔乔·比尼·布斯特里克", "time": "2020-01-03", "score": "9.3"},
    {