Classifier corpus #22

pbannist · 2022-01-26T21:02:41Z

As far as I can envision, there are three high-level sets of data (corpus) for a given site that could be used by the classifier. In all cases, the output of the classifier might produce multiple strong signals about what the content of the site is (and "strong" will also need to be defined)

Looking at the homepage of a site and using the content/other signals there to determine the topics for the site
Looking at all of the content on a site and using that content/other signals to determine the topics for the site
Looking at all of the content on a site and weighting it by usage and then using that content/other signals to determine the topics for the site.

While the first and second options might be appealing methods as they are simple, they probably will give a very inaccurate view of the content of many sites. I think that the third would give the most accurate view of what a site is actually about.

JamesFinlayson-zz · 2022-01-27T12:23:02Z

There's an ingrained assumption, within this, that a site would be limited to one set of topics. I'd like to challenge that assumption as this is unlikely to work well for many sites. The most obvious classes of sites that I conceive it'd cause problems with are:

major news publishers. 'news' for example would lack specificity, or even accuracy when referring to their food & drink section.
larger businesses with diversified products. It would be helpful to increase ad relevance to understand if a user who's visited an insurance company's site, for example, was looking at car insurance vs life insurance.

dmarti · 2022-02-22T17:45:46Z

@JamesFinlayson It looks like both of your examples would be helped by allowing sites to set a section name for categories of content (#17). A large news site could put food and drink in a separate section from general news, and a diversified company could choose how they wanted to organize their products and services for classification purposes. (The sites would not supply their own topics, just provide info to the classifier to split out what pages should be treated as a group for classification purposes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classifier corpus #22

Classifier corpus #22

pbannist commented Jan 26, 2022

JamesFinlayson-zz commented Jan 27, 2022

dmarti commented Feb 22, 2022

Classifier corpus #22

Classifier corpus #22

Comments

pbannist commented Jan 26, 2022

JamesFinlayson-zz commented Jan 27, 2022

dmarti commented Feb 22, 2022