crawler-spring-boot-starter

Demo project for Spring Boot

License	License The Apache License, Version 2.0
Categories	Categories Spring Boot Container Microservices
GroupId	GroupId com.github.houbbbbb
ArtifactId	ArtifactId crawler-spring-boot-starter
Last Version	Last Version 0.0.1
Release Date	Release Date 05-Nov-2019
Type	Type jar
Description	Description crawler-spring-boot-starter Demo project for Spring Boot
Project URL	Project URL https://projects.spring.io/spring-boot/#/spring-boot-starter-parent/crawler-spring-boot-starter
Source Code Management	Source Code Management https://github.com/houbbbbb/crawler-spring-boot-starter

Download crawler-spring-boot-starter

Filename	Size
crawler-spring-boot-starter-0.0.1.pom
crawler-spring-boot-starter-0.0.1.jar	16 KB
crawler-spring-boot-starter-0.0.1-sources.jar	12 KB
crawler-spring-boot-starter-0.0.1-javadoc.jar	122 KB
crawler-spring-boot-starter-0.0.1-exec.jar	27 MB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.github.houbbbbb/crawler-spring-boot-starter/ -->
<dependency>
    <groupId>com.github.houbbbbb</groupId>
    <artifactId>crawler-spring-boot-starter</artifactId>
    <version>0.0.1</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.github.houbbbbb/crawler-spring-boot-starter/
implementation 'com.github.houbbbbb:crawler-spring-boot-starter:0.0.1'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.github.houbbbbb/crawler-spring-boot-starter/
implementation ("com.github.houbbbbb:crawler-spring-boot-starter:0.0.1")

Apache Buildr

'com.github.houbbbbb:crawler-spring-boot-starter:jar:0.0.1'

Apache Ivy

<dependency org="com.github.houbbbbb" name="crawler-spring-boot-starter" rev="0.0.1">
  <artifact name="crawler-spring-boot-starter" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.github.houbbbbb', module='crawler-spring-boot-starter', version='0.0.1')
)

Scala SBT

libraryDependencies += "com.github.houbbbbb" % "crawler-spring-boot-starter" % "0.0.1"

Leiningen

[com.github.houbbbbb/crawler-spring-boot-starter "0.0.1"]

Dependencies

compile (7)

Group / Artifact	Type	Version
org.springframework.boot : spring-boot-configuration-processor	jar	2.2.0.RELEASE
org.springframework.boot : spring-boot-autoconfigure	jar	2.2.0.RELEASE
org.jsoup : jsoup	jar	1.12.1
org.apache.maven.plugins : maven-gpg-plugin	jar	1.5
org.apache.maven.plugins : maven-javadoc-plugin	jar	3.1.0
org.apache.maven.plugins : maven-source-plugin	jar	3.1.0
org.apache.maven.plugins : maven-release-plugin	jar	2.5.1

Project Modules

There are no modules declared in this project.

crawler-spring-boot-starter

spring boot 爬虫框架

中央仓库的maven依赖

<dependency>
  <groupId>com.github.houbbbbb</groupId>
  <artifactId>crawler-spring-boot-starter</artifactId>
  <version>0.0.1</version>
</dependency>

使用方式

配置方式：application.yml, 如果不配置，默认值：线程池大小：poolSize=6; 超时时间：timeout=2000;

crawler:
  pool-size: 6
  timeout: 2000

代码示例

@Autowired
WebCrawler webCrawler; // 网页内容爬取
@Autowired
FileCrawler fileCrawler;

// 网页爬取
void crawlerTest() {
    Starter starter = webCrawler.getStarter(); // 获取爬取启动类
    starter.setRootUrl("http://www.xxx.com/"); // 要爬取的根url
    starter.setParser((document, tran) -> {    // 网页解析器，可以自定义解析网页文档
        Elements elements = document.select("a");
        for(Element element: elements) {
            String url = element.absUrl("href"); // 可以获取完整的url
            System.out.println("url " + url);
            new Requester(tran, url); // 将要爬取的url加入到任务队列
        }
    });
    starter.start(); // 启动爬取方法
}

// 本地文件遍历
void fileCrawlerTest() {
    FileStarter starter = fileCrawler.getStarter(); // 获取文件遍历启动器
    starter.setRootUrl("G:\\exc\\hhh"); // 设置文章根目录，只要设置好根目录，就是自动遍历目录中所有文件
    starter.setParser((file) -> { // 文件解析器，自定义实现，可以从这里获取到文件路径
        System.out.println("fileName " + file.getFileName()); // 
    });
    starter.start(); // 启动遍历方法
}

注意：由于涉及多线程，不能使用@Test测试，必须用http请求的方式

Versions

Version
0.0.1 05-Nov-2019

crawler-spring-boot-starter

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management

Download crawler-spring-boot-starter

How to add to project

Dependencies

compile (7)

Project Modules

crawler-spring-boot-starter

spring boot 爬虫框架

中央仓库的maven依赖

使用方式

配置方式：application.yml, 如果不配置，默认值：线程池大小：poolSize=6; 超时时间：timeout=2000;

代码示例

注意：由于涉及多线程，不能使用@Test测试，必须用http请求的方式

Versions