Detection of trustworthiness of the content in the websites using classification models
Abstract
“Fake Information online travels faster, farther, deeper and more broadly than the truth”, – A study conducted by MIT Sloan says. In recent times, the rise of the spread of misinformation/conspiracy theories has led to communal riots, Health hazards (Covid-19), and defamation, and even destroyed the sanctity of the electoral process. The bad actors in society use fake news as a weapon to manipulate people’s beliefs & perceptions without them being aware of it. This has becomea serious problem in recent times due to high internet penetration & accessibility. The information available on the internet is generally not protected by any professional techniques which can help curb illegitimate websites and filter out authentic websites for the users. There are also no measures or standards for publishing online; anyone can simply put up any kind of information for public use effortlessly under false facades. We have done extensive research before diving into the problem area and have found that the classification of websites based on the metadata of the webpages shows great promise in the future of search engine algorithms. It can be cascaded with several other existing search ranking algorithms to boost their performance. Though the technique is simple, it highly precise. Leveraging this simplicity, this model could also find its application in third party browser plugins or extensions. From some of the interviews which we have done with the experts in this area mentioned that metadata signals such as ‘Contact Information’, ‘Privacy Policy’, ‘Author Credits’, ‘Terms of use’, ‘About us’ are indeed useful in identifying the trustworthiness of websites. Some of these signals have higher importance than others when determining the trustworthiness. More metadata labels can be added in the future that could improve the performance of the model. We have investigated the performance of the following individual models that have been shortlisted based on their performance and speed. • Support Vector Machine • Logistic Regression • Generalized Linear model • Gradient Boosted Trees • Naïve Bayes After evaluating tradeoffs between speed and efficiency, Naïve Bayes & XGB ideal models that can have real world applications in the space of search engine algorithms. The classification of websites based on the metadata of the webpages shows great promise in the future of search engine algorithms. It can be cascaded with several other existing search ranking algorithms to boost their performance. Though the technique is simple, it highly precise. Leveraging this simplicity, this model could also find its application in third party browser plugins or extensions. Given the unique role of search engines in the information discovery and research to society, they should avoid suppressing points of view even if unpopular for those seeking it. To not overstep boundaries of free speech and access to information, Objective processes & principles should be drafted before implementing such classifiers in the real world.
Collections
- Student Projects [3208]