EncartaLabs

Apache Nutch

( Duration: 2 Days )

Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and array other document formats.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster.

  • Prior exposure to JAVA /J2EE, Database skills
  • Some knowledge about IDE, Ant build tool
  • Some knowledge about Hadoop

COURSE AGENDA

1

Installing and configuration of Nutch

2

Verify your Nutch installation

3

Crawl your first website

4

Crawling the web, the CrawlDb, and URL filters

5

Parsing and Parse filters

6

Nutch plugins and plugin architecture

7

Analysis, Link analysis, and scoring

8

Indexing and custom fields

9

Deployment, shard architecture

10

Writing custom tools for Nutch

11

Setup Solr for search

12

Integrate Solr with Nutch

13

Hadoop architecture

Encarta Labs Advantage

  • One Stop Corporate Training Solution Providers for over 3,500 Modules on a variety of subjects
  • All courses are delivered by Industry Veterans
  • Get jumpstarted from newbie to production ready in a matter of few days
  • Trained more than 20,000 corporate candidates across india and abroad
  • All our trainings are conducted in workshop mode with more focus on hands On

View our other course offerings by visiting www.encartalabs.com/course-catalogue

Contact us for delivering this course as a public/open-house workshop for a group of 10+ candidates at our venue

Top