Recommender Systems (Notes from Mining of Massive Datasets )

What is Recommender Systems (RS)?

It is a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc. For example, Linkedin recommends new friends to you, Amazon recommends new products, Pandora recommends musics, you name it.

There are two types of recommender systems: content-based and collaborative filtering. There are two subtypes of collaborative filtering RS: item-item and user-user.

  • Content-based: If you like this item, you may also like this…
  • Item-Item: Customers who likes this also likes …
  • User-User: Customers who are similar to you also liked …

It is usually convenient  to use Utility Matrix to show the connect between users and items.

collaborative-filtering-utility-matrix-resized-600

An example of Utility Matrix

Most of the time, it is sparse. The entries of it can also be 1 and 0. For example, when the item j was bought by user i, the ijth entry is 1. Often, we find a utility matrix with this kind of data shown with 0’s rather than blanks where the user has not purchased or viewed the item. However, in this case 0 is not a lower rating than 1; it is no rating at all.   The goal of recommender system is to fill the blanks, find the rank top n items for each users.

Remarks:

  • Since not all users like to give rate, the information we have collected were from the users who were willing to provide ratings. That is, it is biased.
  • We can make inferences from users’ behavior. More generally, one can infer interest from behavior other than purchasing. For example, if an Amazon customer views information about an item, we can infer that they are interested in the item, even if they don’t buy it.

Content-based recommenders:

Features (explicit and implicit):

For different items, they have different features. For example, movies can use the following features: director, actors, year, genre and so on. For documents, it is common to use TF-IDF as features. Images can be characterized by tags collected from people. Features can be boolean or numerical. For example, if one feature of movies is the set of actors, then imagine that there is a component for each actor, with 1 if the actor is in the movie, and 0 if not. If one feature of movies is the average rating, then it is numerical. It is suggested to normalize the features to avoid some features are dominant.

 

1. User Profiles method:

User Profiles:

For content-based recommenders, we construct profiles for all users. That is,

We not only need to create vectors describing items; we need to create vectors with the same components that describe the user’s preferences from the utility matrix.

Example: Suppose items are movies, represented by boolean profiles with components corresponding to actors. Also, the utility matrix has a 1 if the user has seen the movie and is blank otherwise. If 20% of the movies that user U likes have Julia Roberts as one of the actors, then the user profile for U will have 0.2 in the component for Julia Roberts.

Example:  Consider the same movie information as in Example above, but now suppose the utility matrix has nonblank entries that are ratings in the 1–5 range. Suppose user U gives an average rating of 3. There are three movies with Julia Roberts as an actor, and those movies got ratings of 3, 4, and 5. Then in the user profile of U, the component for Julia Roberts will have value that is the average of 3−3, 4−3, and 5−3, that is, a value of 1.

On the other hand, user V gives an average rating of 4, and has also rated three movies with Julia Roberts (it doesn’t matter whether or not they are the same three movies U rated). User V gives these three movies ratings of 2, 3, and 5. The user profile for V has, in the component for Julia Roberts, the average of 2−4, 3−4, and 5−4, that is, the value −2/3.

Recommending Items to Users Based on Content:

Now with the profiles of users and items,  we can estimate the degree to which a user would prefer an item by computing the cosine distance between the user’s and item’s vectors.

2. Classification and Regression Algorithms:

For each user, build a classifier (for example, decision tree) or make a regression that predicts whether the user will like an item or not, or predicts the rating of all items using item profiles and utility matrices. The predictors are the features of items.

Unfortunately, classifiers of all types tend to take a long time to construct. This method is  applied only when the dataset is small.

Collaborative Filtering:

Recommendation for a user U is  made by looking at the users that are most similar to U in this sense, and recommending items that these users like. The process of identifying similar users and recommending what similar users like is called collaborative filtering.

 How to measure similarities?

  • Jaccard distance: we could ignore values in the matrix and focus only on the sets of items rated. If the utility matrix only reflected purchases, this measure would be a good one to choose. However, when utilities are more detailed ratings, the Jaccard distance loses important information.
  • Cosine distance: when treat blanks as zeros, it is questionable. Since 0 represents dislike.
  • Rounding the data: for example, treat ratings 1 and 2 as 0; 3, 4, and 5 as 1. Then use cosine distance.
  • Normalize rates: subtract user’s average rating before using cosine distance.

The Duality of Similarity

The distances used in finding similar users can be used to find similar items. But there are two ways the symmetry is broken in practice.

  1. To make recommendation for user 1, we can firstly find similar users to user 1 and then get ratings from the (weighted by similarity) average rating of similar users. However, it is not symmetric for items.
  2. The similarity of items makes more sense than that of users.

User-User collaborative filtering:

Let r_x be the average rating by user x, N be the set of k similar users who also rate item i. Denote r_{yi} as the rate by user y in set N to item i. Then the rate of user x to item i can be estimated by

r_{xi}=\frac{1}{k}\sum_{y\in N}r_{yi} or r_{xi}=\frac{\sum_{y\in N}r_{yi}s_{xy}}{\sum_{y\in N} s_{xy}}

where s_{xy} is the similarity of user x and user y.

Since some users may rate higher and some tend to rate lower, take this fact into consideration, then we can also predict r_{xi} by

r_{xi}=\bar{r_x}+\frac{1}{k}\sum_{y\in N}(r_{yi}-\bar{r_y}) where \bar{r_y} is the average rating of user y on all items.

Item-Item collaborative filtering:

Similar to user-user collaborative filtering, we can use the average rating of similar items to item i as the rating by user x to item i: i.e. r_{xi}=\frac{1}{k}\sum_{j\in N(i,x)}r_{xj} or take the weighted version.

Generally, the item-item collaborative filtering performs better then the user-user approach, since items are simpler than users. But user-user approach can be done once, while item-item approach can’t.

Complexity:

The expensive step is finding k most similar users or items, and the complexity is O(|U|), where |U| is the size of the utility matrix.

One can use KNN, dimensional reduction or cluster algorithms, such as hierarchical clustering to find similar users or items.

Cluster:

For example,  if there are N items, then make N/2 clusters. Use the average as the rating of user i to cluster j. After that, cluster M users to be M/2 clusters. And so on. To predict the rating of user x to item j, just find which clusters they belong to and use the average rating as the prediction.

Dimensional Reduction: UV decomposition

Suppose the utility matrix is m by n (m users and n items). Then we can decompose it to be two matrices: M~UV, U (m by  d) and V (d by n). U characterizes users and V characterizes items.  Here ~ means UV approximates M at non-blank entries. How to measure the closeness? We use RMSE(root -mean – square – error): RMSE=\sqrt{\frac{1}{N}\sum_{m_{ij} non-blank}e^2_{ij}}, where e_{ij} is the error of approximation to the ij th element.

How to get U and V so that RMSE is minimized?

  1. Preprocessing of the utility matrix M: subtract the average rate from user i m^*_{ij}=m_{ij}-\bar{m_i}, then $m^*_{ij}-\bar{m^*_j}$. The order of the normalizing can be changed, though the result may be different. The third option is m_{ij}-\frac{1}{2}(\bar{m_i}+\bar{m_j}). When make prediction, don’t forget to undo the normalization.
  2. Initialize U and V. One can choose matrices with equal elements, so that the element of UV equals the average of all non-blank elements of M. Or you can make a protuberance to that by a small number \epsilon and \epsilon follows uniform distribution or a normal distribution. But such initialization can not guarantee that the global optimal will be obtained, so you may initialized U and V many times.
  3. Then update the elements of U and V. For example, we are go update u_{11} to u_{11}+x, x be chosen so that RMSE is minimized among all options of x. (Taking derivative and set it to be 0.) What order should we follow when update the elements of U and V?  You can do it row-by-row, or make a permutation of the list 1,2,3,…,N (N is the number of all elements to be updated), then follow that order. One element can be visited more than once.
  4. When to stop? Set a threshold, when the improvement  of  RMSE is less than the threshold, we stop. Or the improvement of RMSE for individual elements are too little, we stop on that element.

Gradient descent or Stochastic Gradient descent can be used. One problem is overfitting. To avoid that, one can do the procedure above many times and average the results U and V.

Pros and Cons of Collaborative filtering:

Pros: It works for any kind o items (no feature selection needed.)

Cons:

  • Need enough users.
  • Utility matrix is sparse.
  • Hard to find users that have rated the same items.
  • It becomes hard for ‘first’ user or ‘first’ item.
  • Popularity bias: tends to recommend popular items.

Solutions:

  • Hybrid methods
  • Combine two or more recommenders. Global baseline and collaborative filtering.

Some remarks:

  • Users are multi-facetted, so Top k recommendations may be very redundant. We may consider diverse recommendations.
  • Interests change over time.

 

 

 

 

 

 

 

Posted in Machine Learning | Leave a comment

How to Ace a Data Science Interview

supervised learning

Source: How to Ace a Data Science Interview

Posted in Uncategorized | Leave a comment

Learning curves (example)

 


# In this exercise we'll examine a learner which has high variance, and tries to learn
# nonexistant patterns in the data.
# Use the learning curve function from sklearn.learning_curve to plot learning curves
# of both training and testing error.

from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
from sklearn.learning_curve import learning_curve
from sklearn.cross_validation import KFold
from sklearn.metrics import explained_variance_score, make_scorer
import numpy as np

# Set the learning curve parameters; you'll need this for learning_curves
size = 1000
cv = KFold(size,shuffle=True)
score = make_scorer(explained_variance_score)

# Create a series of data that forces a learner to have high variance
X = np.round(np.reshape(np.random.normal(scale=5,size=2*size),(-1,2)),2)
y = np.array([[np.sin(x[0]+np.sin(x[1]))] for x in X])

def plot_curve():
 reg = DecisionTreeRegressor()
 reg.fit(X,y)
 print "Regressor score: {:.4f}".format(reg.score(X,y))

 # TODO: Use learning_curve imported above to create learning curves for both the
 # training data and testing data. You'll need 'size', 'cv' and 'score' from above.

 training_sizes, training_scores, testing_scores=learning_curve(reg, X, y, train_sizes=np.array([ 0.1, 0.33, 0.55, 0.78, 1. ]), cv=cv, scoring=score, exploit_incremental_learning=False, n_jobs=1, pre_dispatch='all', verbose=0)

 # TODO: Plot the training curves and the testing curves
 # Use plt.plot twice -- one for each score. Be sure to give them labels!

 plt.grid()

 plt.plot(training_sizes, training_scores, 'o-', color="r",
 label="Training score")
 plt.plot(training_sizes, testing_scores, 'o-', color="g",
 label="testing score")

plt.legend(loc="best")

 # Plot aesthetics
 plt.ylim(-0.1, 1.1)
 plt.ylabel("Curve Score")
 plt.xlabel("Training Points")
 plt.legend(bbox_to_anchor=(1.1, 1.1))
 plt.show()

Posted in Python for data analysis | Leave a comment

One Hot Encoding using sklearn

The dataset is the famous Titanic dataset.

import numpy as np
import pandas as pd

# Load the dataset
X = pd.read_csv('titanic_data.csv')
# Limit to categorical data
X = X.select_dtypes(include=[object])

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# TODO: Create a LabelEncoder object, which will turn all labels present in
# in each feature to numbers. For example, the labels ['cat', 'dog', 'fish']
# might be transformed into [0, 1, 2]

le = LabelEncoder()

# TODO: For each feature in X, apply the LabelEncoder's fit_transform
# function, which will first learn the labels for the feature (fit)
# and then change the labels to numbers (transform).

for feature in X:
X[feature] = le.fit_transform(X[feature])

# TODO: Create a OneHotEncoder object, which will create a feature for each
# label present in the data. For example, for a feature 'animal' that had
# the labels ['cat','dog','fish'], the new features (instead of 'animal')
# could be ['animal_cat', 'animal_dog', 'animal_fish']

ohe = OneHotEncoder()

# TODO: Apply the OneHotEncoder's fit_transform function to all of X, which will
# first learn of all the (now numerical) labels in the data (fit), and then
# change the data to one-hot encoded entries (transform).

onehotlabels = ohe.fit_transform(X)

“onehotlabels” is a <891×1726 sparse matrix of type ‘<type ‘numpy.float64′>’
with 4455 stored elements in Compressed Sparse Row format>.

Part of it:

  (0, 1725)	1.0
  (0, 1574)	1.0
  (0, 1416)	1.0
  (0, 892)	1.0
  (0, 108)	1.0
  (1, 1723)	1.0
  (1, 1656)	1.0
    :	:
  (886, 1725)	1.0
  (886, 1574)	1.0
  (886, 994)	1.0
  (886, 892)	1.0
  (886, 548)	1.0
Posted in Python for data analysis | Leave a comment

Exploratory Analysis of Movielen Dataset using Python

The MovieLens 20M dataset:

GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.org). The data sets were collected over various periods of time, depending on the size of the set.  20 million ratings and 465,564 tag applications applied to 27,278 movies by 138,493 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Released 4/2015; updated 10/2016 to update links.csv and add tag genome data. The download address is https://grouplens.org/datasets/movielens/20m/. The size is 190MB. More details can be found here:http://files.grouplens.org/datasets/movielens/ml-20m-README.html

There are 6 tables:

  1. ratings.csv (userId, movieId, rating,timestamp)
  2. movies.csv (movie, title, genres)
  3. tags.csv (userId, movieId, tag, timestamp)
  4. links.csv (movieId, imdbId, tmdbId)
  5. genome_score.csv (movieId, tagId, relevance)
  6. genome_tag.csv (tag, tagId)

 

Import packages and load datasets:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import seaborn as sns

movies_pd=pd.read_csv('movies.csv')
rates_pd =pd.read_csv('ratings.csv')
links_pd=pd.read_csv('links.csv')
tags_pd=pd.read_csv('tags.csv')
genome_scores_pd=pd.read_csv('genome-scores.csv')
genome_tags_pd=pd.read_csv('genome-tags.csv')

rates_pd.head()

This is how rates_pd looks like.

userId movieId rating timestamp
0 1 2 3.5 1112486027
1 1 29 3.5 1112484676
2 1 32 3.5 1112484819
3 1 47 3.5 1112484727
4 1 50 3.5 1112484580

We convert timestamp to normal date form and only extract years.

import datetime
def convert_time(timestamp):
    date=datetime.datetime.fromtimestamp(
    int(timestamp)).strftime('%Y-%m-%d %H:%M:%S')
    return int(date[0:4])
rates_pd['year']=rates_pd['timestamp'].apply(convert_time)

Next, we calculate the average rating over all movies in each year.

avg_rates_year=rates_pd[['year','rating']].groupby('year').mean()

 

This is the head of the movies_pd dataset.

movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy

We extract the publication years of all movies.


def year(title):
    year=re.search(r'\(\d{4}\)', title)
    if year:
       year=year.group()
       return int(year[1:5])
    else:
    return 0
movies_pd['year']=movies_pd['title'].apply(year)

Since there are some titles in movies_pd don’t have year, the years we extracted in the way above are not valid. We set year to be 0 for those movies.

sub=movies_pd[movies_pd['year']!=0]
fig1, (ax3,ax4) = plt.subplots(2,1,figsize=(20,12))
plt.subplot(211)
plt.plot(sub.groupby(['year']).count()['title_new'])
plt.title('Number of movies vs Years',fontsize=16)

plt.subplot(212)
a2=plt.plot(avg_rates_year)
plt.title('Average ratings vs genres',fontsize=16)
plt.grid(True)
plt.tight_layout()
fig.savefig('Number of Movies and average ratings VS years.jpg')

screen-shot-2017-02-06-at-10-54-26-pm

The picture shows that there is a great increment of the movies after 2009. But the average ratings over all movies in each year vary not that much, just from 3.40 to 3.75.

Next we extract all genres for all movies. That is, for a given genre, we would like to know which movies belong to it.

First, we split the genres for all movies.

import re
def genres_str(x):
    if x=='(no genres listed)':
        keys=['no_genres']
    else:
        keys= re.sub('[|]', ' ', x)
        keys=keys.split()
return keys
movies_pd['genres_split']=movies_pd['genres'].apply(genres_str)

all_genres=['Action','Adventure','Animation',"Children",
 "Comedy","Crime","Documentary","Drama",
 "Fantasy",'Film-Noir','Horror','Musical',
 'Mystery','Romance','Sci-Fi','Thriller',
 'War','Western','IMAX','no_genres']
#genres_classify[genre] gives the moviesIds which can be classified to be genre.
values=[]
for i in range(len(all_genres)):
    values.append([])
genres_classify=dict(zip(all_genres, values))

for i in range(movies_pd.shape[0]):
     for genre in movies_pd.loc[i,'genres_split']:
         genres_classify[genre].append(movies_pd.loc[i,'movieId'])

Now we can consider the  distributions of the ratings for each genre.

data=[]
for g in all_genres:
    #sub is all ratings for the movies in the genres g.
    sub=np.array(rates_pd[rates_pd['movieId'].isin(genres_classify[g])].rating)
    data.append(sub)
from matplotlib.patches import Polygon
#boxplot the ratings vs genres
fig, ax1 = plt.subplots(figsize=(20, 12))
fig.canvas.set_window_title('Boxplot of Movie Ratings VS Genres')
plt.subplots_adjust(left=0.075, right=0.95, top=0.9, bottom=0.25)

bp = plt.boxplot(data, notch=0, sym='+', vert=1, whis=1.5)
plt.setp(bp['boxes'], color='black')
# plt.setp(bp['whiskers'], color='black')
# plt.setp(bp['fliers'], color='red', marker='+')

# Add a horizontal grid to the plot, but make it very light in color
# so we can use it for reading data values but not be distracting
ax1.yaxis.grid(True, linestyle='-', which='major', color='lightgrey',
 alpha=0.5)

# Hide these grid behind plot objects
ax1.set_axisbelow(True)
ax1.set_title('Ratings vs movie genres')
ax1.set_xlabel('Movie Genres')
ax1.set_ylabel('ratings')

# Now fill the boxes with desired colors
boxColors = ['darkkhaki', 'royalblue']
plt.setp(bp['whiskers'], color='green')
#plt.setp(bp['fliers'], color='red', marker='+')
numBoxes=20

medians = list(range(numBoxes))
for i in range(numBoxes):
 box = bp['boxes'][i]
 boxX = []
 boxY = []
 for j in range(5):
 boxX.append(box.get_xdata()[j])
 boxY.append(box.get_ydata()[j])
 boxCoords = list(zip(boxX, boxY))
 # Alternate between Dark Khaki and Royal Blue
 k = i % 2
 boxPolygon = Polygon(boxCoords, facecolor=boxColors[k])
 ax1.add_patch(boxPolygon)
 # Now draw the median lines back over what we just filled in
 med = bp['medians'][i]
 medianX = []
 medianY = []
 for j in range(2):
 medianX.append(med.get_xdata()[j])
 medianY.append(med.get_ydata()[j])
 plt.plot(medianX, medianY, 'k')
 medians[i] = medianY[0]
 # Finally, overplot the sample averages, with horizontal alignment
 # in the center of each box
 plt.plot([np.average(med.get_xdata())], [np.average(data[i])],
 color='w', marker='*', markeredgecolor='k')

# Set the axes ranges and axes labels
ax1.set_xlim(0.5, numBoxes + 0.5)
top = 6
bottom = 0
ax1.set_ylim(bottom, top)
xtickNames = plt.setp(ax1, xticklabels=all_genres)
plt.setp(xtickNames, rotation=90, fontsize=12)
plt.savefig('genres_ratings.png')

genres_ratings

Next we make ranks by the number of movies in different genres and the number of ratings for all genres.

import operator
common={}
for g in genres_classify:
 common[g]=len(genres_classify[g])
common_sort= sorted(common.items(), key=operator.itemgetter(1)) &amp;amp;amp;amp;nbsp;
common_pd=pd.DataFrame(common.items(),columns=['genre', 'numbers'])
common_pd.head()
genre numbers
0 Mystery 1514
1 Drama 13344
2 Western 676
3 Sci-Fi 1743
4 Horror 2611

We can see that Drama is the most common genre; Comedy is the second. The most uncommon genre is Film-Noir.

Remark: Film Noir (literally ‘black film or cinema’) was coined by French film critics (first by Nino Frank in 1946) who noticed the trend of how ‘dark’, downbeat and black the looks and themes were of many American crime and detective films released in France to theaters following the war.

popular={}
i=0
for g in genres_classify:
 popular[g]=len(data[i])
 i+=1
popular_sort= sorted(popular.items(), key=operator.itemgetter(1))
popular_sort
popular_pd=pd.DataFrame(popular.items(),columns=['genre', 'number of ratings'])
popular_pd.head()
summary_genre=popular_pd.merge(common_pd,on='genre',how='inner')
summary_genre.head()

 

genre number of ratings
0 Mystery 5614208
1 Drama 2111403
2 Western 361
3 Sci-Fi 1669249
4 Horror 3298335
genre number of ratings numbers
0 Mystery 5614208 1514
1 Drama 2111403 13344
2 Western 361 676
3 Sci-Fi 1669249 1743
4 Horror 3298335 2611
5 Film-Noir 244619 330

sort_1=summary_genre.sort_values(by='numbers')
sort_2=summary_genre.sort_values(by='number of ratings')

fig, (ax1,ax2) = plt.subplots(2,1,figsize=(20,12))

a1=sort_1.plot.bar(x='genre',y='numbers',ax=ax1,color='darkkhaki')
a1.set_title('Number of Movies vs Genres',fontsize=16)
plt.grid(True)

a2=sort_2.plot.bar(x='genre',y='number of ratings',ax=ax2,color='royalblue')
a2.set_title('number of ratings vs genres',fontsize=16)
plt.grid(True)
plt.tight_layout()

fig.savefig('Number of Movies and Number of Ratings by genres.jpg')
#plt.show()

heatmap-of-ratings

 

Finally, we explore the users ratings for all movies and sketch the heatmap for popular movies and active users.

heatmap

The code is below.

summary_2=rates_pd['rating'].groupby(rates_pd['userId'])
user_counts=summary_2.count().to_dict()
user_counts=pd.DataFrame(user_counts.items(), columns=['userId', 'count'])
active_users=user_counts[user_counts['count']>5000]
INDEX1=active_users.index
popular_movies=counts[counts['count']>25000]
INDEX2=popular_movies.index
INDEX=list(set(INDEX1).union(set(INDEX2)))
rates_pd_sub=rates_pd.iloc[INDEX,]
table = pd.pivot_table(rates_pd_sub, values='rating', index=['movieId'],
columns=['userId'], aggfunc=np.sum)
table_pd=pd.DataFrame(table.fillna(0))
table_pd.transpose()

#plt.figure(num=None, figsize=(20, 20))
plt.figure(num=None, figsize=(25,10), dpi=80, facecolor='w', edgecolor='k')
heatmap=sns.heatmap(table_pd.transpose())

 

Posted in Uncategorized | Leave a comment

SQL commends cheat sheet 1 (W3school)

1. SELECT:

Format:

select column1, column 2 from dataset

select * from dataset

select distinct column from dataset

select column1, column2 from table_name where (conditions)

 Conditions: AND, OR, LIKE, BETWEEN, IN

For example,

  •            WHERE city like ‘s%’  (start with ‘s’.)
  •            WHERE value IN (value1, value2, vlaue3)
  •            WHERE column_name BETWEEN value1 AND value2
SELECT * FROM Products
WHERE Price NOT BETWEEN 10 AND 20;

SELECT * FROM Products
WHERE ProductName BETWEEN 'C' AND 'M';

SELECT * FROM Orders
WHERE OrderDate BETWEEN #07/04/1996# AND #07/09/1996#;

 

2. ORDER BY:

Format:

order by column1 ASC|DEC, column2 ASC|DEC

3. INSERT INTO:

Format:

insert into table_name  values(v1,v2,…)

Examples:

INSERT INTO Customers (CustomerName, ContactName, Address, City, PostalCode, Country)
VALUES (‘Cardinal’,’Tom B. Erichsen’,’Skagen 21′,’Stavanger’,’4006′,’Norway’);

INSERT INTO Customers (CustomerName, City, Country)

VALUES (‘Cardinal’, ‘Stavanger’, ‘Norway’);

Note: the CustomerID field will of course also be updated automatically.

4.UPDATE:

Example:

UPDATE Customers

SET City=’New York’, ContactName=’Alfred’

WHERE CustomerID=1;

5. DELECT

Format:

DELECT FROM table_name

WHERE some_coln=some_value AND …

Note: DELECT FROM table_name will delect all

6. SQL injection

Example:

 

SELECT * FROM Users WHERE UserID=105 or 1=1;

Customers input:

User Name: John Doe

Password: myPass

uName = getRequestString(“UserName”);
uPass = getRequestString(“UserPass”);

sql = ‘SELECT * FROM Users WHERE Name =”‘ + uName + ‘” AND Pass =”‘ + uPass + ‘”‘

RESULT

SELECT * FROM Users WHERE Name =”John Doe” AND Pass =”myPass”

Customers input:

User Name: ” or “”=”

Password: ” or “”=”

Then result will be:

SELECT * FROM Users WHERE Name =”” or “”=”“ AND Pass=”” or “”=””

Since “=” is always true, it will return users.

Example:

SELECT * FROM Users; DROP TABLE Suppliers

txtUserId = getRequestString(“UserId”);
txtSQL = “SELECT * FROM Users WHERE UserId = ” + txtUserId;

Userid: 105; DROP TABLE Suppliers

Then the result will be:

SELECT * FROM Users WHERE UserId = 105; DROP TABLE Suppliers

7. Parameters for Protection

http://www.w3schools.com/sql/sql_injection.asp

8. LIMIT, TOP

Example:

SELECT *
FROM Persons
LIMIT 5;

Another way to get 5 rows:

SELECT column_name(s)
FROM table_name
WHERE ROWNUM <= number; (ROWNUM is a key word.)

SELECT TOP 2 * FROM Customers;

SELECT TOP 50 PERCENT * FROM Customers;

Wildcard

Description

%

A substitute for zero or more characters

_

A substitute for a single character

[charlist]

Sets and ranges of characters to match

[^charlist]
or
[!charlist]

Matches only a character NOT specified within the brackets

SELECT * FROM Customers
WHERE City LIKE ‘[bsp]%’; (start with b, or s or p.)

SELECT * FROM Customers
WHERE City LIKE ‘[a-c]%’; (start with a, b, c)

SELECT * FROM Customers
WHERE City LIKE ‘[!bsp]%’; (not start with b s p.)

Another way:

SELECT * FROM Customers

WHERE City NOT LIKE ‘[bsp]%’;

9. Alias

SELECT column_name AS alias_name
FROM table_name;

SELECT column_name(s)
FROM table_name AS alias_name;

SELECT CustomerName, Address+’, ‘+City+’, ‘+PostalCode+’, ‘+Country AS Address
FROM Customers;

Tip: It requires double quotation marks or square brackets if the column name contains spaces:

SELECT CustomerName, CONCAT(Address,’, ‘,City,’, ‘,PostalCode,’, ‘,Country) AS Address
FROM Customers;

SELECT o.OrderID, o.OrderDate, c.CustomerName
FROM Customers AS c, Orders AS o
WHERE c.CustomerName=”Around the Horn” AND c.CustomerID=o.CustomerID;

 

10. JOIN

  • INNER JOIN: Returns all rows when there is at least one match in BOTH tables
  • LEFT JOIN: Return all rows from the left table, and the matched rows from the right table
  • RIGHT JOIN: Return all rows from the right table, and the matched rows from the left table
  • FULL JOIN: Return all rows when there is a match in ONE of the tables

SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
(INNER) JOIN Customers
ON Orders.CustomerID=Customers.CustomerID;

SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
INNER JOIN Orders
ON Customers.CustomerID=Orders.CustomerID
ORDER BY Customers.CustomerName;
SELECT column_name(s)
FROM table1
LEFT JOIN table2
ON table1.column_name=table2.column_name;

SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
LEFT JOIN Orders
ON Customers.CustomerID=Orders.CustomerID
ORDER BY Customers.CustomerName;

SELECT column_name(s)
FROM table1
FULL OUTER JOIN table2
ON table1.column_name=table2.column_name;

 

 

Posted in Uncategorized | Leave a comment

Questions on K means method

What is K means algorithm?

Screen Shot 2016-12-24 at 8.48.54 PM.png

What is the optimization function?

Screen Shot 2016-12-24 at 8.50.42 PM.png

The partial derivative of Distortion J  with respect to each center location must be zero.

Will the algorithm stop?

When J is minimized, (1) each x^{(i)} is encoded by its nearest center and (2) each center must be at the centroid of points it owns. The algorithm will terminate in a state at which neither (1) or (2) change the configuration. The reason is as follows: there are only a finite number of ways of partitioning m records into K groups. So there are only a finite number of possible configurations in which all centers are the centroids of the points they own. If the configuration changes on an iteration, it must have improved the cost function. So each time the configuration changes it must go to a configuration it is never been to before. So if it tried to go on forever, it would eventually run out of configurations.

But it doesn’t mean we will alway arrive at the  optimal configuration. It matters where you start. The picture below is an example that the local optimal is obtained but not the global optimal.

Screen Shot 2016-12-24 at 9.01.30 PM.png

To avoid this case, many initializations have to be tried and then choose the one whose lost function is minimal.

How to choose K?

This is a difficult problem. There is no general rule. One possible way is the elbow method below. But it can not be guaranteed that it always works.

Screen Shot 2016-12-24 at 9.05.35 PM.png

Sometimes, running K means is used for downstream purpose, so the evaluation depends on later purpose.

There is an excellent answer on this question: How do I determine k when using k-means clustering?

Common uses of K-means:

  1. Often used as an exploratory data analysis tool.
  2. In one-dimension, a good way to quantize realvalued variables into k non-uniform buckets.
  3. Used on acoustic data in speech understanding to convert waveforms into one of k categories (known as Vector Quantization).
  4. Also used for choosing color palettes on old fashioned graphical display devices!

(From Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials)

Drawbacks of K means:

Here is the link of this question: How to understand the drawbacks of K-means?

I  summarize it below.

The drawbacks are as follows:

  • k-means assumes the variance of the distribution of each attribute (variable) is spherical;
  • all variables have the same variance;
  • the prior probability for all k clusters is the same, i.e., each cluster has roughly equal number of observations;

Counter examples:

slpl1

It can be saved by transforming your data into polar coordinates.

zai1g

gn1im

The data is uniformly distributed and there is no cluster. But K means can still cluster it. This is bad.

K means and EM:

Both are iterative algorithms to assign points to clusters. K-means is EM’ish, but makes ‘hard’ assignments of samples to clusters. Gaussian mixture model is equivalent to K Means when \Sigma_k = \sigma I, and σ → 0.

(return to this.)

References:

[1] Coursera Machine learning

[2] Andrew’s tutorials

 

 

Posted in Machine Learning | Leave a comment

SVM notes

Support vector machine can be used for classification and regression. It has successful applications in many fields, like bioinformatics, text, image recognition, etc. Its main ideas are large margin and kernel trick.

Margins:

1.Intuition

To make things easier, we suppose there are two classes and two features x_1, x_2.  To separate the two classes using a hyperplane (a line in the simplified case),  one reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. In Figure 1, the red line is the best.

512px-svm_separating_hyperplanes_28svg29-svg

Figure 1: (From Wikipedia)

In logistic regression, P(y=1|x)=h_{\theta}(x)=g(\theta^Tx), where g(z)=\frac{1}{1+exp(-z)}. The larger \theta^Tx is (\theta^Tx>>0), the more confident we are to predict that y=1; the smaller \theta^Tx is (\theta^Tx<<0), the more confident we are to predict that y=0. If we choose \theta^Tx=0 as the decision boundary, then the far a point is away from the decision boundary, the more confident we are about the prediction. But for the points near the boundary, a slight change of the parameters \theta may lead to different predictions. So it would be nice if we can find a decision boundary such that the distances of all training points to the boundary are greater than some threshold.  That’s the idea of margin.

Note: if there are n features, then  x=(1, x_1, x_2, \cdots, x_n)^T and \theta^T=(\theta_0, \theta_2, \cdots, \theta_n)=(b, \omega^T).

2. Functional and geometric margins:

Suppose y\in \{1,-1\}. For a training example, the functional margin is defined to be \hat{\gamma}^{(i)}=y^{(i)}\cdot \theta^Tx^{(i)}. By the distance formula, it is easy to see that

\hat{\gamma}^{(i)}= \gamma^{(i)} \cdot \|\omega\|

where \gamma^{(i)} is the distance from x^{(i)} to the hyperplane \theta^Tx=0, and we call it the geometric margin. When \|\omega\|=1, the functional margin is the same as the geometric margin. For a training set, the functional margin and the geometric margin are defined to be the minimum functional and geometric margin over  all margins of the samples in the training set:

\hat{\gamma}=\min_i\hat{\gamma}^{(i)} and \gamma=\min_i {\gamma}^{(i)}.

The bigger the margin of a dataset is, the more confident of the prediction, so our objective is to maximize the geometric margin for a given training set. That is

arg\max_{\omega,b} \{\min_{i} y^{(i)} \frac{(\omega^Tx^{(i)}+b)}{\|\omega\|} \}

3

Figure 2

Since the geometric margin is invariant to the scaling of the parameters \theta, we can impose an arbitrary scaling constraint on \omega. This point is important in the following optimazition problem.

Optimization:

Based on the discussion above, training a SVM is converted to the following optimization problem:

max_{\gamma, \omega, b} \gamma s.t. y^{(i)}(\omega^Tx^{(i)}+b)\geq \gamma, i=1,2,\cdots, m, \|\omega\|=1.

It is equivalent to min_{\omega, b} \frac{1}{2}\|\omega\|^2 s.t.

y^{(i)}(\omega^Tx^{(i)}+b)\geq 1, i=1,2,\cdots, m. (*)

It is a hard margin. A soft margin is  min_{\omega, b, \psi\geq 0} \frac{1}{2}\|\omega\|^2+C\sum_i \psi_i s.t.

y^{(i)}(\omega^Tx^{(i)}+b)\geq 1-\psi_i, i=1,2,\cdots, m. (**)

The drawback of it that it is sensitive to C.

Screen Shot 2016-12-24 at 9.54.03 AM.png

Figure 3: (wlog, b=0)

Screen Shot 2016-12-24 at 10.34.33 AM.png

Screen Shot 2016-12-24 at 10.37.19 AM.png

Subject to the functional margin is no less than 1 to minimize the norm of \omega, we have to maximize the geometric margin, which is consistent with our intuition when choosing the best hyperplane to separate two classes.

Mathematically, this form (*) can be solved efficiently.  The objective function is a convex quadratic one and the constraint is linear. When the objective function is convex, and the constraint space is convex, local extremes are global extremes. (This is an easy exercise.)

The soft margin form (**) is also a  convex problem.

Note: f is a function of n variables, then f is convex if its Hessian matrix is positive semidefinite at any x. Quadratic function f(x)=x^TQx+b^Tx is convex if Q is positive semidefinite. Since its Hessian matrix is Q.

We may also view the optimization problem above from the point view of cost functions.

In logistic regression, when y=1, the lost function is -log h_{\theta}(x); when y=0, the lost function is -log(1-h_{\theta}(x)).

Screen Shot 2016-12-23 at 10.25.05 PM.png

In SVM, we instead use cost_1 and cost_0 as above, which are approximations to the lost function in logistic regression.

Also by taking C=1/\lambda, the lost function of SVM is

C\sum_{i=1}^m y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})+\frac{1}{2}\sum_{i=1}^n\theta_i^2.

So if y=1, we want \theta^T x \geq 1 not just larger than 0; if y=0, we want \theta^T x \leq -1 not just less than 0.

Hence if C is very large, then minimizing the cost function is equivalent the following question:

min \frac{1}{2}\|\omega\|^2 subject to \theta^T x\geq 1 if y=1 and \theta^T x\leq -1 if y=0. This is the same as the optimization question (*).

Nonlinear decision boundary(Kernels):

When the boundary is not linear, in logistic regression, we may introduce polynomial terms.

Screen Shot 2016-12-24 at 10.39.08 AM.png

 

This idea can be generalized: instead of $\theta^T x$, we use $\theta^T f(x)$, where f is a vector. In the polynomial case, for example, f(x)=(1,x_1,x_2,x_1x_2,x_1^2,x_2^2).  If \theta^T f\geq 0, then predict 1; otherwise, predict 0. The elements of f  can be chosen to be similarity functions.

Screen Shot 2016-12-23 at 10.55.21 PM.png

f_i\in[0,1]. The more similar, the more closer to 1.

Where to get the landmarks? We can let l^{(i)}=x^{(i)}, i=1,2,\cdots, m. And f_0=1. Then the cost function becomes

C\sum_{i=1}^m y^{(i)}cost_1(\theta^Tf^{(i)})+(1-y^{(i)})cost_0(\theta^Tf^{(i)})+\frac{1}{2}\sum_{i=1}^n\theta_i^2.

Remark(Gaussian kernel): Gaussian kernel is used frequently. For large $\delta^2$, the gaussian kernel varies smoothly, it corresponds to high bias and lower variance.

Some suggestions:

n is the number of features and m is the number of training  samples.

  1. If n is large relative to m, (say n=10,000, m is from 1 to 1000), then use SVM without kernel or logistic regression.
  2. If n is small, m is intermediate (say n is from 1 to 1000, m from 10 to 10,000), then use SVM with Gaussian kernel.
  3. If n is small, m is large, (say n is from 1 to 1000, m is 50,000 ) then create or add more features, then use logistic regression or SVM without a kernel.
  4. Neural network works well for most settings, but it is time-consuming for training sometimes.
  5. When using Gaussian kernels, do perform feature scaling.
  6. Not all similarity functions make valid kernels. Need to satisfy technical condition called “Mercer’s Theorem” to make sure SVM package works.

An example in R:


rm(list=ls())
#SVM on iris data
url='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
df=read.table(url,sep=',')
head(df)
df[,'V5']=as.factor(as.numeric(df[,'V5']))
set.seed(0)
index_tr=sample(nrow(df),100)
train <- df[index_tr, ]
test <- df[-index_tr, ]
table(train['V5'])
table(test['V5'])
#train a SVM
library(kernlab)
xtrain=train[,1:4]
ytrain=train[,5]
testx=test[,-5]
svp <- ksvm(V5~.,data=train,type="C-svc",kernel='rbfdot',C=100)
pred=predict(svp, testx)
table(pred,test[,5])

References:

[1] CS 229, Stanford

[2] Coursera Machine learning

[3] SVM Convex Optimization

Posted in Machine Learning | Leave a comment

A brief introduction to Neural Networks with an example in R

Motivation:

Suppose there are two predictors and the decision boundary is non-linear, then one can introduce quadratic terms in the logistic regression, i.e.

$ latex  h_{\theta}(x)=g(\theta_0+\theta_1 x_1+\theta_3 x_2+\theta_4 x_1^2+\theta_5 x_1x_2+\theta_6 x_2^2),$

where g(z)=1/(1+e^{-z}) is the logistic function.

image

However, most questions have more than two predictors. If the number of predictors are large or you introduce terms like x_i x_j x_k and so on, then logistic model is not the best choice. Probably you should consider Neural networks.

Neural Networks: model representation

Logistic Unit:

We first introduce logistic unit.

screen-shot-2016-12-17-at-1-45-55-pm

The inputs are x_1, x_2, x_3, which are the predictors. The output is h_{\theta}(x)=g(\theta \mathbf{x}). The elements of \theta are also called weights in neutral network.

Neural network:

33

a^{l}_i is the ith unit(neuron) in the lth layer. \theta^{l} is the matrix of weights from layer l to layer l+1.

z_1^2=\theta^1_{10}x_0+\theta^1_{11}x_1+\theta^1_{12}x_2+\theta^1_{13}x_3,

z_2^2=\theta^1_{20}x_0+\theta^1_{21}x_1+\theta^1_{22}x_2+\theta^1_{23}x_3,

z_3^2=\theta^1_{30}x_0+\theta^1_{31}x_1+\theta^1_{32}x_2+\theta^1_{33}x_3,

and a_i^2=g(z_i^2), i=1,2,3 with a_0^2=1.

Then z^3_1=\theta^2_{10}a^2_0+\theta^2_{11}a^2_1+\theta^2_{12}a^2_2+\theta^2_{13}a^2_3 and h_{\theta}(x)=g(z^3_1).

Briefly, it is given by the forward propagation below.

Forward propagation:
{1.\ a^{(1)} = x}
{2.\ \text{for }l = 1,...,L-1\text{ do}}
{3.\quad z^{(l+1)} = \Theta^{(l)} a^{(l)}}
{4.\quad a^{(l+1)} = g(z^{(l+1)})}
{5.\ \text{end}}
{6.\ h_{\Theta}(x) = a^{(L)}}

So if the jth layer has s_j units and the j+1th layer has s_{j+1} units, then the size of the weights matrix \theta^j is s_{j+1}\times (s_j+1).

Examples:(A. Ng’s notes)

image2017

Multiclass classification:

When there are k classes (K>2),  we expect h_{\theta}(x) be a vector  with length K. If  h_{\theta}(x)=e_i=[0,0,\cdots,1,\cdots, 0]^T,  the sample is identified to be the ith class.

Suppose (x^i, y^i), i=1,2,\cdots, m are m samples. We rewrite y^{i}=e_s if the label of the ith sample is in the sth class.

An example of a neural network diagram with K=4 output units for a multi-class classification problem.

multiclass_neural_network_example

An example with K=4 output for a multi-class classification

The only difference from the binary classification is that it has one extra hidden layer and the last hidden layer has k units. The K units form the vector h_{\theta}(x), that is, the ith element of h_{\theta}(x) is the ith output in the last layer.

Cost functions:

Recall the cost function of logistic function is

J(\theta)=\frac{1}{m}-\sum_{i=1}^m (y^i\log (h_{\theta}(x^i))+(1-y^{i}log(1-h_{\theta}(x^i)))),

similarly, the cost function of neural network with regularization is

\displaystyle \begin{array}{rcl} J(\Theta) & = & -\frac{1}{m} \bigg[\sum _{i=1}^m \sum_{k=1}^K y_k^{(i)} \log h_{\theta}(x^{(i)})_k + (1-y_k^{(i)})\log(1 - h_{\theta}(x^{(i)})_k) \bigg] \\ & + & \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum _{i=1}^{s_l} \sum _{j=1}^{s_{l+1}}(\Theta _{ji}^{(l)})^2, \end{array}

where L is the total number of layers. The cost function is not convex, so local min may be obtained.

To minimize the cost function, most optimazition methods can be applied. A basic method is the gradient decent method, where the partial derivatives of the cost function with respect to all weighs are needed. The back propagation algorithm is effective and we introduce it below.

 Back propagation algorithm:

First we define the “error”  \delta^l_j  at the jth unit in the lth layer.
{1.\ \delta^{(L)} = a^{(L)} - y}
{2.\ \text{for }l = L-1, ..., 2\text{ do}}
{3.\quad \delta^{(l)} = (\Theta^{(l)})^T \delta^{(l + 1)} .* g'(z^{(l)})}
{4.\ \text{end}}

Note that we don’t have \delta^{(1)}. The “error”  is related to the partial derivatives in this way:

{\frac{\partial}{\partial \Theta ^{(l)}}J(\Theta) = \delta^{(l+1)}(a^{(l)})^T + \lambda \Theta}.

So the following algorithm is used to get the derivatives of the cost function:
{1.\ \text{Set }\Delta^{(l)} = 0,\text{ for }l = 1, ..., L-1}
{2.\ \text{for }i = 1, ..., m\text{ do}}
{3.\quad \text{Set }a^{(1)} = x^{(i)}}
{4.\quad \text{Compute }a^{(l)}\text{ for }l=2,...,L\text{ using forward propagation (Algorithm 1)}}
{5.\quad \text{Using }y^{(i)}, \text{compute } \delta^{(l)} \text{ for } l = L, ..., 2\text{ using the backpropagation algorithm (Algorithm 2)}}
{6.\quad \Delta^{(l)} = \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T \text{ for }l=L-1, ..., 1}
{7.\ \text{end}}

Then

D^{(l)}:=\frac{1}{m}\Delta^{(l)}+\lambda \theta^{(l)}(1-1_{j=0}).

Note that  \frac{\partial J}{\partial \theta^{l}}=D^{(l)}.

Some suggestions:

  1. A  reasonable default is to have one hidden layer with the number of units equal to {2} times the number of input units. If more than one hidden layer, use the same number of hidden units in every layer. The more hidden units the better, the constrain here is the burden in computational time that increases with the number of hidden units.
  2. When using numerical optimization you need initial values for {\Theta}. Randomly initialize the parameters {\Theta} so that each {\theta_{ij} \in [-\epsilon, \epsilon]}. Initializing {\theta_{ij} = 0} for all {i,j} will result in problems. Basically all hidden units will have the same value, which is clearly undesirable.
  3. During the building of your neural network algorithm, use numerical derivatives to make sure that your implementation of the back propagation is correct. The reason we don’t use finite difference method(central) to get the partial derivatives of the cost function is because it is not effective in computation. So it is only used for checking the back propagation.

An example in R:

We use the neuralnet package in r to classify the types of iris. The data can be found from UCI: iris.

#install package 'neuralnet'
#use neuralnet to classify iris data.
rm(list=ls())
library(neuralnet)
library(nnet)
df=read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',sep=',')
head(df)
# there are four predictors V1,V2,V3,V4. V5 is the label.
labels=df['V5']
mydf &amp;lt;- cbind(df[, 1:4], class.ind(df$V5))
colnames(mydf)[5:7]=c('setosa','versicolor','virginica')
head(mydf)

#split the data into train set and test set
set.seed(123)
smp_size &amp;lt;- floor(0.75 * nrow(mydf))
train_ind &amp;lt;- sample(seq_len(nrow(mydf)), size = smp_size)

train &amp;lt;- mydf[train_ind, ]
test &amp;lt;- mydf[-train_ind, ]
testX=test[c('V1','V2','V3','V4')]
test_labels=labels[-train_ind,]
#Train Neural network model
print(net&amp;lt;-neuralnet(setosa+versicolor+virginica~V1+V2+V3+V4,data=train,hidden=c(4,4),linear.output=FALSE))
plot(net,rep='best')
predictOut=compute(net,testX)
predict_results=data.frame(predictOut$net.result)
colnames(predict_results)=c('Iris-setosa','Iris-versicolor','Iris-virginica')
results=colnames(predict_results)[apply(predict_results,1,which.max)]
table(results, test_labels)

 The training model looks like this:

Screen Shot 2016-12-17 at 4.21.43 PM.png

Reference:

Thiago G. Martins

Machine Learning(coursera)

Posted in Machine Learning, R | Leave a comment

Spectral methods in dynamics

In the dynamics seminar here at Houston, we’re beginning a series of expository talks on statistical properties of dynamical systems. This week’s talk was given by Andrew Törö…

Source: Spectral methods in dynamics

Posted in dynamical system and probability | Leave a comment

Reviews from Amazon’s customers

The dataset is from http://snap.stanford.edu/data/amazon/productGraph/. I analyzed the data from two department: beauty and Toys and Games. The features includes

  • asin – ID of the product
  • reviewerID – ID of the reviewer
  • reviewTime – time of the review
  • overall – rating of the product
  • department

I mainly used Tableau to visualize the data. From the visualization, I found something interesting.

First, one can notice that the number of products and users has an obvious increment  after 2006 compared to that in 1998.

number-of-products

It is weird that the number of customers and products in 2014 is less than that in 2013. After checking the data carefully, I found that the data in the fourth quarter of 2014 is not in the dataset.

Next let’s have a closer look at that after 2007.

number-of-products-and-customers-since-2007

It is out of our surprise that the number of reviews show similar trend.

number-of-reviews-for-years5

In December and January, the number of reviews is obvious more than the other months due to the festivals and more discounts.

number-of-reviews-every-month

I would like to know when customers prefer to write reviews, so we can send emails to them at that time to remind them. But it seems that the data collected is only part of the whole data, since all the reviews are at 7pm and 8pm.

all-reviews-are-at-7-and-8-clocks

It is interesting to know the distribution of the number of reviews for each customer. This is the histogram of it.

 

dectect-suspective-customers

The customer with id AJGU56YG8G1DQ gives more than 1500 reviews. The second customer is around 600 reviews. The big difference caught my attention and I doubt whether the fist customer is not real and it is the account used to check the system, or maybe this customer is hired to promote the average review.

I found that most of the time,  the customer with id AJGU56YG8G1DQ gave reviews to be 5 stars, no matter in beauty or in toys and games departments. So it is obvious an outlier of the dataset.

 

After the outlier is filter out, I studies the distribution of the reviews. Review is a categorical variable with five values: 1,2,3,4,5. Star 5 ranks first and star 4 ranks second. It may be explained by the fact that only when customers are very satisfied or satisfied with the products, they will write reviews to the products.

distribution-of-reviews

The average review every year for both departments is always above 4.

average-review-for-years4

Posted in Uncategorized | Leave a comment