本文共 5916 字,大约阅读时间需要 19 分钟。
最近在学习爬虫开发,尝试将Spring Boot、WebMagic和MyBatis结合使用,实现数据抓取到MySQL的存储方案。以下是具体实现过程和技术细节。
项目依赖的主要库包括:
项目的Maven依赖文件如下:
org.springframework.boot spring-boot-devtools runtime true org.springframework.boot spring-boot-starter-test test org.springframework.boot spring-boot-configuration-processor true mysql mysql-connector-java ${mysql.connector.version} com.alibaba druid-spring-boot-starter ${druid.spring.boot.starter.version} org.mybatis.spring.boot mybatis-spring-boot-starter ${mybatis.spring.boot.starter.version} com.alibaba fastjson ${fastjson.version} org.apache.commons commons-lang3 ${commons.lang3.version} joda-time joda-time ${joda.time.version} us.codecraft webmagic-core ${webmagic.core.version} org.slf4j slf4j-log4j12
数据库配置文件中,定义了MySQL数据源和Druid连接池参数:
# 数据源配置spring.datasource.name=mysqlspring.datasource.type=com.alibaba.druid.pool.DruidDataSourcespring.datasource.driver-class-name=com.mysql.jdbc.Driverspring.datasource.url=jdbc:mysql://localhost:3306/mydb?useUnicode=true&characterEncoding=utf8&useSSL=false&allowMultiQueries=truespring.datasource.username=rootspring.datasource.password=root# Druid 连接池配置spring.datasource.druid.initial-size=5spring.datasource.druid.min-idle=5spring.datasource.druid.max-active=10spring.datasource.druid.max-wait=60000spring.datasource.druid.validation-query=SELECT 1 FROM DUALspring.datasource.druid.test-on-borrow=falsespring.datasource.druid.test-on-return=falsespring.datasource.druid.test-while-idle=truespring.datasource.druid.time-between-eviction-runs-millis=60000spring.datasource.druid.min-evictable-idle-time-millis=300000spring.datasource.druid.max-evictable-idle-time-millis=600000
定义了一个简单的 CRUD 接口:
public interface CrawlerMapper { int addCmsContent(CmsContentPO record);} 对应的MyBatis XML映射文件:
INSERT INTO cms_content (contentId, title, releaseDate, content) VALUES #{contentId, title, releaseDate, content}
爬虫启动:通过定时任务(每10分钟一次)启动爬虫,抓取目标网站的数据。
页面解析:使用WebMagic框架解析爬取到的网页内容,提取需要存储的字段(如标题和内容)。
数据存储:将解析后的数据通过MyBatis插入到MySQL数据库,使用Druid连接池管理数据库连接。
@Componentpublic class XXXTask { private static final Logger LOGGER = LoggerFactory.getLogger(XXXPipeline.class); @Autowired private XXXPipeline xxxPipeline; @Autowired private XXXPageProcessor xxxPageProcessor; @Autowired private ScheduledExecutorService timer = Executors.newSingleThreadScheduledExecutor(); public void crawl() { timer.scheduleWithFixedDelay(() -> { try { Thread.currentThread().setName("XXxCrawlerThread"); Spider.create(xxxPageProcessor) .addUrl("https://www.xxx.com/explore") .addPipeline(xxxPipeline) .thread(2) .start(); } catch (Exception ex) { LOGGER.error("定时抓取数据线程执行异常", ex); } }, 0, 10, TimeUnit.MINUTES); }} 使用Spring Boot的主类:
@SpringBootApplication@MapperScan(basePackages = "com.hyzx.qbasic.dao")public class Application implements CommandLineRunner { @Autowired private XXXTask xxxTask; public static void main(String[] args) throws IOException { SpringApplication.run(Application.class, args); } @Override public void run(String... strings) throws Exception { xxxTask.crawl(); }} 定义了一个简单的数据对象:
import java.util.Date;public class CmsContentPO { private String contentId; private String title; private String content; private Date releaseDate; public String getContentId() { return contentId; } public void setContentId(String contentId) { this.contentId = contentId; } public String getTitle() { return title; } public void setTitle(String title) { this.title = title; } public String getContent() { return content; } public void setContent(String content) { this.content = content; } public Date getReleaseDate() { return releaseDate; } public void setReleaseDate(Date releaseDate) { this.releaseDate = releaseDate; }} 创建了以下数据库表:
CREATE TABLE cms_content ( contentId VARCHAR(40) NOT NULL COMMENT '内容ID', title VARCHAR(150) NOT NULL COMMENT '标题', content LONGTEXT COMMENT '文章内容', releaseDate DATETIME NOT NULL COMMENT '发布日期', PRIMARY KEY (contentId)) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT 'CMS内容表';
通过以上配置,完成了一个从网页爬取数据到MySQL存储的全流程开发。项目结合了Spring Boot的快速开发能力、WebMagic的高效爬虫框架以及MyBatis的灵活数据库处理,实现了数据的高效采集和存储。
如果需要更详细的代码示例或其他技术支持,可以参考我的GitHub仓库或联系我。
转载地址:http://zpqfk.baihongyu.com/