dc.description.abstract | Data mining is the practice of examining the existing database to generate new information.Itprovides useful information for the decision makers.With a spurt in e-commerce players like Amazon, Flipkart, OLX, Infibeam in Indian market, it has become inevitable to understand the customer behaviours more quickly and preciously for every player in this market.The motivation behind this project is to extract the benefits of the data mining technology to predict the
Behaviours of customers in the e-commerce industry. The objective of the project aims at segmenting and targeting the customer market by extracting knowledge and insights from e-commerce customer purchase history data. After studying various scholastic works and technical review papers, we structured the methodology, feature generation being the main phase in the project. Before which we performed data pre-processing that included data reduction using UNIX, data cleaning, data integration and data transformation. Feature generation involved generating features and transforming it to get more features.As given in the proposed methodology of the project, once data cleaning is done, we tried to perform regression using various techniques in R tool. Because the data size is huge, R was not able to process the data initially, and it took a lot of time for each iteration. Hence, we sampled the data using stratified sampling, after which random
sampling was performed which gave us the exact representation of the population. After the sampling of data, we generated features from the transaction data of past purchase history. It helped us in improving the prediction accuracy and made computation easy.
Feature generation was performed after in-depth analysis of the e-commerce customers’ purchase behaviour. Few significant features that are extracted from purchase history are recency, frequency, monetary value of individual customers, factor variables such as the first-time purchaser, etc. Using the features that we generated, we further proceeded to segment the train data for the onstruction of the model. We segmented the customer data using k-means algorithm that is a common means of segmenting the data in data mining. Variance of each data points was measured, and clusters were decided on based on the similarity of variance among the data points. In the end, we found four clusters of segmented data.Finally, out of all the models we studied, logistic regression model was fitted for the four clusters, and we found the same model fitting among all of them. The model was then validated using 20 per cent of the train data and it turned out to be well fitted model.
We fitted the model for the test data and prediction was performed . The prediction was then submitted in Kaggle to the ROC area and finally achieved area of 0.52688 under the ROC curve. From this result, we concluded that for this particular dataset, logistic regression compared to RFM performed very well. Also, we inferred that the model that we use very much depends on the type of the dataset and it is difficult to associate a one-all model to all the datasets | en_US |