EncartaLabs

Apache Nutch

( Duration: 2 Days )

Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and array other document formats. Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster.

This Apache Nutch training course covers installation, configuration and writing custom resources.

Prior knowledge of the below technologies are needed to attend this Apache Nutch workshop:

  • JAVA /J2EE, Database
  • IDE, Ant build tool
  • Hadoop

COURSE AGENDA

1

Day-1

  • Installing and configuration of Nutch
  • Verify your Nutch installation
  • Crawl your first website
  • Crawling the web, the CrawlDb, and URL filters
  • Parsing and Parse filters
  • Nutch plugins and plugin architecture
  • Analysis, Link analysis, and scoring
2

Day-2

  • Indexing and custom fields
  • Deployment, shard architecture
  • Writing custom tools for Nutch
  • Setup Solr for search
  • Integrate Solr with Nutch
  • Hadoop architecture

Encarta Labs Advantage

  • One Stop Corporate Training Solution Providers for over 4,000 Modules on a variety of subjects
  • All courses are delivered by Industry Veterans
  • Get jumpstarted from newbie to production ready in a matter of few days
  • Trained more than 50,000 Corporate executives across the Globe
  • All our trainings are conducted in workshop mode with more focus on hands-on sessions

View our other course offerings by visiting http://encartalabs.com/course-catalogue-all.php

Contact us for delivering this course as a public/open-house workshop/online training for a group of 10+ candidates.

Top