Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classifier corpus #22

Open
pbannist opened this issue Jan 26, 2022 · 2 comments
Open

Classifier corpus #22

pbannist opened this issue Jan 26, 2022 · 2 comments

Comments

@pbannist
Copy link

As far as I can envision, there are three high-level sets of data (corpus) for a given site that could be used by the classifier. In all cases, the output of the classifier might produce multiple strong signals about what the content of the site is (and "strong" will also need to be defined)

  1. Looking at the homepage of a site and using the content/other signals there to determine the topics for the site
  2. Looking at all of the content on a site and using that content/other signals to determine the topics for the site
  3. Looking at all of the content on a site and weighting it by usage and then using that content/other signals to determine the topics for the site.

While the first and second options might be appealing methods as they are simple, they probably will give a very inaccurate view of the content of many sites. I think that the third would give the most accurate view of what a site is actually about.

@JamesFinlayson-zz
Copy link

There's an ingrained assumption, within this, that a site would be limited to one set of topics. I'd like to challenge that assumption as this is unlikely to work well for many sites. The most obvious classes of sites that I conceive it'd cause problems with are:

  1. major news publishers. 'news' for example would lack specificity, or even accuracy when referring to their food & drink section.
  2. larger businesses with diversified products. It would be helpful to increase ad relevance to understand if a user who's visited an insurance company's site, for example, was looking at car insurance vs life insurance.
@dmarti
Copy link
Contributor

dmarti commented Feb 22, 2022

@JamesFinlayson It looks like both of your examples would be helped by allowing sites to set a section name for categories of content (#17). A large news site could put food and drink in a separate section from general news, and a diversified company could choose how they wanted to organize their products and services for classification purposes. (The sites would not supply their own topics, just provide info to the classifier to split out what pages should be treated as a group for classification purposes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants